Task force/Offline/IRC/2010-01-05

From Strategic Planning
< Task force‎ | Offline‎ | IRC
Jump to navigation Jump to search

walkerma: Hi Philippe! Amgine, hello! Kelson, wizzy, are you there?

[10:01am] Kelson: walkerma: hi
[10:01am] wizzy: walkerma:
[10:01am] walkerma: Great - shall we get started? Heiko can't make it today, but he can come on Thursday, so I suggest we focus on the recommendations for today. Is that alright?
[10:02am] walkerma: Or is there anything else that urgently need to be covered?
[10:02am] Amgine: Breakfast.
[10:02am] Amgine: But we can skip that for now.
[10:02am] Philippe|Wiki: oh yes, Breakfast. critical
[10:02am] walkerma: There are a couple of un-discussed things we should probably at least touch upon today, even if it's just as part of "let's not forget.."
[10:03am] wizzy: past dinnertime here
[10:03am] Amgine: <hasn't read the last two logs... went through the holiday vacation nightmare>
[10:03am] Laura|Away is now known as LauraHale.
[10:04am] walkerma: We are now one week away from submitting our recommendations, and so at the end I'd like us also to consider what "leg work" and research still needs to be done, before it's too late!
[10:04am] • Philippe|Wiki nods.
[10:04am] walkerma: Amgine - yes, I was away too, I'm just getting back into things - fortunately I have something very precious this week - TIME!
[10:04am] walkerma: (i.e. I'm not teaching, grading etc)
[10:05am] wizzy: I don't like all the new-fangled stuff on the strategy wiki - watchlists don't really work, and the fluid threads thing I hate
[10:05am] Philippe|Wiki: grrrrrrr, where's my bot?
[10:05am] Philippe|Wiki: There aren't any changes to watchlists, as far as I know....
[10:06am] walkerma: Have you looked over Recommendation No. 1 recently?
[10:06am] Amgine: Agree about the question of research.
[10:06am] Amgine: Uhm, just a second.
[10:06am] walkerma: http://strategy.wikimedia.org/wiki/Talk:Task_force/Recommendations/Offline_1
[10:07am] Philippe|Wiki: (just FYI, a full list of your recommendations can be found at http://strategy.wikimedia.org/wiki/Task_force/Recommendations/Offline)
[10:07am] walkerma: Sorry, that was the talk page for it
[10:07am] Amgine: Just made a save, so reload
[10:07am] randmontoya left the chat room. (Remote closed the connection)
[10:07am] randmontoya joined the chat room.
[10:08am] Amgine: Okay, I supposedly have access to create that survey on a WMF limesurvey server.
[10:08am] Amgine: I haven't done so yet, because permissions to do so arrived around the time I was leaving on holiday nightmare 2009.
[10:08am] wizzy: The Static HTML tree dumps:[3] are no alternative <-- Maybe they are not, but for me they are an essential end product
[10:09am] walkerma: Amgine - can you send that survey out in the next day or two? Do you need specific info from us?
[10:09am] _jfelipe left the chat room.
[10:09am] walkerma: Such as people to send the survey to?
[10:09am] Amgine: Well, take a look at the questions and answers I have on the talk page and give feedback.
[10:09am] walkerma: Or do you have a list of "targets"!?
[10:10am] Amgine: And then, yes, I'll want everyone to give me a list of targets to e-mail with invitations.
[10:10am] walkerma: Amgine: I think the survey looks excellent!
[10:10am] wizzy: url?
[10:10am] Amgine: http://strategy.wikimedia.org/wiki/Talk:Task_force/Recommendations/Offline_1
[10:10am] walkerma: (As far as my limited tech skill lets me follow it)
[10:11am] Amgine: It's a logic-branching survey, so I think the most questions anyone will see is 5.
[10:11am] wizzy: I see no survey - or do you want comments on that page ?
[10:11am] walkerma: It's a draft of the proposed survey
[10:12am] Amgine: wizzy: it's a series of questions, and answers, which would be presented to developers and project managers. And yes, I'd love to have comments.
[10:13am] wizzy: I guess my main comment is that I wish to be able to generate the non-semantic HTML dumps somehow
[10:13am] walkerma: Kelson: What do you think about this survey? And what do you think about the recommendation in general? You're currently one of the most active processors of content, so your input would be very useful
[10:13am] Amgine: That's currently possible, wizzy, through several tools.
[10:14am] Kelson: walkerma: I have to read it.
[10:14am] wizzy: with pictures ?
[10:14am] walkerma: wizzy: If we're looking to make re-use easy, I think we need to be able to offer several formats
[10:15am] walkerma: wiki-syntax, semantic XML, basic HTML, openZIM
[10:15am] walkerma: And maybe a mobile phone format too
[10:15am] walkerma: in time, at least
[10:15am] wizzy: I certainly agree a higher abstraction is necessary - but I don't want to lose sight of my needs
[10:15am] Amgine: Picture dumps is an interesting question I haven't looked at, actually.
[10:16am] walkerma: Is that reasonable? Or are my completely off base here?
[10:16am] wizzy: I want to be able to generate a raw, apache-serveable, html tree with a Makefile or script
[10:18am] wizzy: without redlinks, with pictures, categories a bonus
[10:18am] Amgine: wizzy: are you referring to a mirror of a WMF site?
[10:18am] wizzy: yes - for our article subset
[10:19am] wizzy: just like BozMo's schools-wikipedia
[10:19am] Amgine: As far as I am aware, there is no automated system which can do that. There are two problems:
[10:20am] walkerma: Like the guy from Texas did - right?
[10:20am] Amgine: 1) images are not available via the dump.
[10:20am] wizzy: walkerma: yes
[10:20am] Amgine: 2) it requires removing redlinks unless you are only hiding them via css.
[10:20am] Amgine: #2 could be automated. I just don't know of anyone who has done so.
[10:21am] wizzy: Kelson did it
[10:21am] Amgine: #1 could be automated, but it would not be via the dump.
[10:21am] Philippe|Wiki: Is wizzy's need a common one, or is it a one-off? I'd rather stay focused on high level things - if it's something that many people need to do, fantastic... if it's something that only wizzy needs, I'm not sure it needs to be a specific recommendation. I'm not tech enough to evaluate.
[10:21am] walkerma: Amgine: Tom Bylander has done this for the Version 0.7 set, by taking Kelson's kiwix set (which has redlinks removed etc) and processing that
[10:22am] Kelson: Amgine: I have a more or less fully automated system
[10:22am] Amgine: How?
[10:22am] Amgine: And is it from the dump?
[10:22am] walkerma: Unfortunately both the Kiwix server and the Texas one are both down at the moment. http://ai.cs.utsa.edu/wikipedia0.7/
[10:23am] walkerma: http://en.mirror.kiwix.org/index.php/Main_Page
[10:23am] wizzy: I see it as the most universally-usable format - give it to someone on a USB stick, serve it in a small LAN school environment
[10:23am] Amgine: wizzy: Have you seen linterweb/okawix?
[10:23am] Kelson: walkerma: which kiwix server is down?
[10:23am] walkerma: The one I just linked to
[10:24am] Amgine: en.mirror.kiwix.org
[10:25am] Typhoon left the chat room. (Read error: 104 (Connection reset by peer))
[10:25am] Kelson: walkerma: yes, that normal... think, we do not need it currently... will migrated it from my personal PC to my Kiwix server for the next version.
[10:25am] wizzy: Amgine: I have. I believe there is a way it can be used to serve the content, but I would just wget -r the whole thing to get what I want
[10:25am] Kelson: readonly version is available here: http://tmp.kiwix.org:4201/A/1WJP
[10:26am] walkerma: Kelson: Thanks
[10:26am] Kelson: Amgine: think we can speak about that later, OK?
[10:26am] Amgine: Wizzy: They sell usb drives of whichever WMF project you are interested in.
[10:26am] Amgine: They also sell usb drives with whatever subset of articles you want from that project.
[10:27am] Typhoon joined the chat room.
[10:27am] Amgine: Kelson: Sure.
[10:27am] wizzy: Kelson's is great
[10:27am] walkerma: OK, I'd like to get back to the recommendations.
[10:28am] walkerma: Can we propose the 4-5 formats I suggested?
[10:28am] Huib left the chat room.
[10:29am] walkerma: If so, how will each one be produced? Is Kelson producing suitable output? If so, can we propose that he be supported by the WMF?
[10:30am] Amgine: You have 4 formats, but consider that media dumps are simply not available at the moment (although Commons media: namespace text is available in a dump.)
[10:30am] StrategyBot joined the chat room.
[10:31am] wizzy: is 'semantic XML' at the top of the tree ?
[10:32am] walkerma: Amgine: Kelson's output contains pictures. If you look at this page:
[10:32am] walkerma: http://tmp.kiwix.org:4201/A/KRE
[10:32am] walkerma: you see that it contains a picture and a graph, but the video content is missing
[10:32am] Kelson: For people interesting about a solution to automaticaly get a importance sorted list of article for any WP... I can explain a little bit how my scripts work after the meeting.
[10:32am] Amgine: I am looking at that. It's not an archive of content.
[10:34am] walkerma: Kelson: Yes, I want to talk on the phone with you- sorry I was too late yesterday to do so
[10:34am] Kelson: walkerma: about video... they are a few limits currently (and one is the ZIM format itself)... but this will be fixed soon (not too complicated).
[10:34am] walkerma: Kelson:Not a real problem right nwo, there are so few videos on WP anyway. I knew about that one cos I uploaded it!
[10:34am] Kelson: walkerma: we are also starting with openzim to speak about streaming capabilities in the zimlib.
[10:35am] walkerma: Is that page currently in openZIM format?
[10:35am] walkerma: Or still in wikisyntax?
[10:35am] walkerma: SOrry if it's a stupid question
[10:36am] Kelson: walkerma: All what you see in tmp.kiwix.org:4201 is a ZIM file: http://tmp.kiwix.org/zim/0.8/wikipedia_en_wp1_0.7_30000+_05_2009_beta2.zim
[10:36am] Kelson: walkerma: the HTTP server is a slightly modified version of the openzim standard zimreader
[10:37am] walkerma: So I see a dump tree going from raw Wikisyntax to openZIM, and from there to raw HTML and to XML. Is that right? Please forgive me if I
[10:37am] walkerma: am completely off here!
[10:37am] Kelson: wizzy: have also not forgotten your whished to have a easy HTTP zimserver, packaged and with a working searchengin... I'm working on it.
[10:38am] walkerma: That seems (to me) to be how things are working right now
[10:38am] wizzy: Kelson ++
[10:38am] walkerma: Kelson: What did you think of TOm Bylander's HTML version made from the Kiwix mirror?
[10:38am] Kelson: walkerma: almost right nothing to do with XML
[10:39am] walkerma: Kelson: If Amgine wanted to create a semantic XML version from http://tmp.kiwix.org:4201/A/1WJP could she do so?
[10:39am] walkerma: or he?
[10:39am] Kelson: walkerma: this is what people need in school... and in fact everywhere you have a local LAN.
[10:39am] Amgine: Kelson: I think we need to ask questions of you now, since we are not getting beyond this point.
[10:40am] Amgine: What is the source from the WMF? are you using a dump?
[10:41am] Kelson: Amgine: I do not use WMF XML dumps, I mirror directly against a mediawiki instance... in fact against a wikipedia live instance.
[10:42am] wizzy: but you re-create that live instance from the XML dump ?
[10:42am] Amgine: <nods> This is what I thought. So nothing about recommendation 1 relates to your product.
[10:42am] Kelson: Amgine: but I could work with XML dumps... should have to write a parser.
[10:42am] Amgine: Which is what recommendation1 is about.
[10:43am] Amgine: I don't know why you do so, though, if you're using the API to create your .zim files.
[10:43am] Amgine: why you would do so...
[10:43am] Kelson: Amgine: yes that is true (about recommandation #1)
[10:43am] wizzy: but it is more than just a parser - it includes pictures ? Or is that the easy bit ?
[10:43am] Kelson: Amgine: I'm not directly concerned... although the topic is interesting
[10:43am] Amgine: Dumps do not contain media, wizzy.
[10:44am] walkerma: Amgine: I took recommendation to mean - we need an easy-to-use version of WP content. Could some of the official "dump"s be in Kelson's format?
[10:44am] wizzy: so parsing the XML dump is not enough to create a zim file
[10:44am] Amgine: Getting the pictures would have to work the way Kelson already does it: pulling them from the server before creating the offline product.
[10:45am] Kelson: Amgine: I do so because (1) I want a general solution... not WMF oriented (2) few things are not in the dumps (like pictures) (3) I always work only on selection , etc.
[10:45am] Amgine: walkerma: Yes, they could be, but that first recommendation is about how things are currently done, and how we can do things to make that easier.
[10:46am] walkerma: Can we recommend that Kelson's releases be made part of the "official" system?
[10:46am] Philippe|Wiki: You can recommend anything you want.
[10:46am] Amgine: Kelson: I think your solution is excellent. However, I suspect the WMF would prefer if all third parties did not use live sources all the time, as they would prefer to reduce server loads.
[10:46am] Kelson: Amgine: a point : I do not build a ZIM against the data delivered by the Mediawiki Web services... but I build a Mediawiki mirror with thus data, and after that I hiuld a static HTML dump against the mediawiki mirror... and with this static HTML, I build a ZIM file.
[10:46am] walkerma: We're recommending what changes should be made, surely?
[10:47am] wizzy: We need One True Source - a parser for the XML, and after a script to pull the media
[10:47am] Amgine: Kelson: Could you explain that? You build a mirror from a dump?
[10:48am] Kelson: Amgine: I build a Mediawiki mirror... and i fill it with the data delivered by the Source Mediawiki Instance.
[10:48am] Kelson: Amgine: in fact I have a script which can mirror contents from a source Mediawiki to a target Mediawiki.
[10:49am] Amgine: Using API?
[10:49am] walkerma: When we met with Erik, Kul, Tomasz, Brion at Wikimania, they seemed very happy to endorse Kelson's work and the openZIM format
[10:50am] Kelson: Amgine: yes using the API... for example http://it.mirror.kiwix.org is a mirrof of the almost 120.000 most importance articles of it.wikipedia.org
[10:51am] Amgine: Kelson: how long does this take to mirror? is it faster than downloading a dump and importing it via mwdumper?
[10:51am] Amgine: Also, how do you feed recent changes?
[10:51am] walkerma: Maybe we could get Tomasz to start producing openZIM and static HTML versions of Wikimedia content, and putting these out as alternative versions to the wikisyntax dumps?
[10:52am] Philippe|Wiki: Time check: we're 50 minutes into the mtg and still on the first recomendation
[10:52am] Kelson: Amgine: using MWdumper is maybe faster.... but mwdumper is not able for example to told you which templates are necessary for a page.
[10:52am] Amgine: That would likely be a good choice, but not 'instead of' but 'in addition to'
[10:52am] Kelson: Amgine: mirroring the text of 100.000 articles is not fast.. but it's ok.
[10:53am] Kelson: Amgine: I think like you....
[10:53am] Amgine: Kelson: I understand what you're doing. How are you feeding recent changes?
[10:53am] walkerma: Philippe: Yes, noted! Can we bring this to a close?
[10:54am] Kelson: Amgine: if I would have enough time... I would code a XML parser to be able to have the choice.
[10:54am] walkerma: We need to move onto nos. 2 and beyond
[10:54am] Kelson: Amgine: Just going through the list again, this is currently the only solution.
[10:54am] Amgine: Okay, let us do so.
[10:55am] walkerma: But before we do, I'd like to ask that we agree on what goes into the main proposal for no. 1. What are we recommending?
[10:56am] walkerma: So far, it mainly says "What we do right now is not good enough"
[10:56am] walkerma: We have a heaven-sent opportunity to recommend what SHOULD BE DONE
[10:57am] wizzy: I think we recommend a parser for XML, that can generate openzim or html or whatever, and then something to fill in templates and media
[10:57am] walkerma: if we want to extend the reach of WM projects around the world
[10:57am] wizzy: maybe a something that talks to MW-API for templates and media
[10:58am] Amgine: Well, my top choice of recommendations is completely different: Standardize article content and structure across projects and languages.
[10:58am] walkerma: Aha! This gets to a question I had wanted to raise
[10:58am] walkerma: I see two completely different aspects buried inside no. 1
[10:59am] walkerma: (a) We need more metadata to allow us to structure and organize the content
[10:59am] walkerma: (b) We need formats for re-using content that are easier to work with
[11:00am] Amgine: <nods> Two recommendations out of one assertion is not unexpected.
[11:00am] walkerma: These may (or may not) be related - but we need to address BOTH parts of this in our recommendation, or possibly split them into two if necessary (I'd prefer not)
[11:01am] wizzy: I think we need one master source - we shouldn't recommend openzim *and* xml
[11:01am] walkerma: I think wizzy, for example, is mainly focusing on (b) while Amgine is more interested in (a)
[11:02am] walkerma: So I think that however we organize the recommendations, we need to come up with concrete answers to (a) and (b) separately (while considering the other)
[11:02am] Amgine: <nods>
[11:02am] walkerma: We've mainly focused on (b) today. Let's try to get a consensus on that first
[11:03am] walkerma: I described how the system works currently - but I accept that may not be the best way.
[11:03am] Amgine: For me, these different formats resolve to "presentation" rather than "data" or "logic".
[11:03am] wizzy: I am quite happy that b) is generated from a)
[11:04am] Amgine: So, I think most presentations should be easily supported by WMF.
[11:04am] walkerma: Kelson: Should we have an XML format as the main format for re-use?
[11:04am] walkerma: Or should it be openZIM? Or both?
[11:04am] wizzy: but if there is no way to get (b) from (a), I have a problem
[11:05am] walkerma: wizzy said earlier:"I think we recommend a parser for XML, that can generate openzim or html or whatever, and then something to fill in templates and media"
[11:05am] Kelson: walkerma: I think XML exports, as we know them, is necessary. This is the best solution if you want to mirror a whole WP (for example).
[11:05am] walkerma: Is wizzy's statement our consensus?
[11:06am] walkerma: Kelson: Thanks. Are OpenZIM versions more suitable for smaller selections, then, say <= 100k articles?
[11:07am] Kelson: walkerma: the ZIM file format and the zimlib can deal with huge amount of data... and at the same time can work on really small devices.
[11:07am] Kelson: this is pretty impressive.
[11:08am] wizzy: Kelson: do the templates come as a separate xml dump ? should we write a second parser for them ?
[11:08am] Amgine: I think OpenZIM is a great model for offline or limited internet access. It's the best model for reading, rather than editing.
[11:08am] Kelson: Our smallest device ist the Ben nanonotes : http://www.qi-hardware.com/products/ben-nanonote/
[11:09am] Kelson: wizzy: no, they come in the same dump if I correctly remember... but you do not know which one you need if you work on a selection.
[11:10am] walkerma: Kelson: If XML exports were regularly available from the WMF, would they become your source for generating openZIM versions of content? Or is your current method more efficient?
[11:10am] Kozuch_ joined the chat room.
[11:10am] Amgine: Templates are available in some dumps, not all. There is at least one proposal for a dump with expanded templates. API has a module for expanding templates on the fly.
[11:11am] Kelson: walkerma: I have no reason currently to choose an other way, mine works fine.
[11:11am] wizzy: (content xml, template xml) -> PARSER (article list, output format) + (some media puller) -> final output format
[11:12am] Amgine: The Mediawiki parser does not have a written specification, therefore there are no third-party parsers which are complete.
[11:12am] Amgine: *and* they keep modifying the ... thing.
[11:12am] wizzy: so let us recommend the writing of one, to Kelson's specifications
[11:12am] walkerma: OK, wizzy, could you do a rewrite of recommendation no. 1 taking this IRC chat into account? Amgine, Kelson, could you review this and edit as needed?
[11:13am] Amgine: Sure.
[11:13am] wizzy: yes
[11:13am] Kelson: walkerma: one point: I currently need the SQL dumps to deal with automatic selection.
[11:13am] walkerma: Try to take the broadest view, bearing in mind multifarious end uses
[11:13am] Amgine: multinefarious end users...
[11:13am] wizzy: but you can generate the SQL dumps from the XML dumps ?
[11:14am] walkerma: Kelson - thanks, noted!
[11:14am] Kelson: walkerma: ok, I will review http://strategy.wikimedia.org/wiki/Task_force/Recommendations/Offline
[11:14am] Amgine: wizzy: Not directly, no.
[11:14am] wizzy: what is missing ?
[11:14am] Philippe|Wiki: Kelson, just FYI, that reccs page is subst's of the actual recommendation pages, but you'll see their addresses when you review it.
[11:14am] Amgine: It needs to be imported to MySQL, and then an SQL dump generated from that.
[11:15am] wizzy: but in principle all the information is there, and this hypothetical parser could do it ?
[11:15am] Kelson: wizzy: hmmm... yes that should be true but using them directly (I do no need all of the informations) is quicker.
[11:15am] Kelson: wizzy: ... and this is complicated enough ... http://kiwix.svn.sourceforge.net/viewvc/kiwix/selection_tools/build_selection.sh?revision=791&view=markup
[11:15am] wizzy: quicker and easier is why we write this recommendation
[11:16am] walkerma: OK, can we get to (a), namely to talk about the metadata aspect of this? We seem to be drifting in that direction anyway
[11:16am] walkerma: A reminder:
[11:16am] walkerma: (a) We need more metadata to allow us to structure and organize the content
[11:17am] walkerma: No. 1 currently says "Provide a (real) XML dump that represents wiki pages as semantically annotated XML rather than wiki text"
[11:17am] walkerma: I want us to focus on the "semantically annotated" part of that
[11:17am] Amgine: <grin> walkerma: if you have time after the meeting, I'd love to talk with you about metadata vs structure (implied metadata)
[11:18am] walkerma: Amgine: Certainly - can we talk by phone? I think they have office hours here on IRC in 40 minutes time
[11:18am] Philippe|Wiki: no
[11:18am] Philippe|Wiki: not 'til tonight
[11:18am] Amgine: Mmm, maybe. Or you can join me in #Pleonasm
[11:19am] walkerma: I realise that the two things are related
[11:19am] walkerma: structure and metadata, that is
[11:19am] Amgine: <nods>
[11:19am] walkerma: But what I'd like to bring in is some of the other things that have been mentioned
[11:19am] walkerma: Such as the UDC category system, the Dublin Core for XML, etc
[11:20am] Amgine: <nods>
[11:20am] walkerma: http://strategy.wikimedia.org/wiki/Proposal:Dublin_Core
[11:20am] walkerma: http://strategy.wikimedia.org/wiki/Task_force/Offline/UDC_categorisation
[11:21am] walkerma: This all needs to be brought together under the umbrella of one broad recommendation, I think
[11:21am] walkerma: Amgine: Can you give us your view on this?
[11:22am] Amgine: Mmm, let me take a quick look at Dublin Core
[11:22am] Amgine: <just ran downstairs to grab the sandwich meats sans bread>
[11:23am] sherrod joined the chat room.
[11:23am] peteforsyth_ joined the chat room.
[11:25am] Amgine: Okay... Categorization first:
[11:26am] Amgine: The possibility of using sane, standardized knowledge categorization would be a major step toward simplifying processing of Wikipedia content, especially for applications such as OpenZIM readers.
[11:27am] Amgine: The UDC schema allows the most nuanced categorization, and may be used without copyright issues *except* we would not be able to actually publish the categorization without a license.
[11:28am] walkerma: Maybe we could reach an agreement with them?
[11:28am] Amgine: Possibly, but it is the core of their business model.
[11:28am] millosh left the chat room. (Read error: 60 (Operation timed out))
[11:29am] walkerma: I negotiated such a deal with Chemical Abstracts - we can now openly publish about 8000 Chemical Abstracts numbers. That deal came out of a threat to sue us, originally...!
[11:29am] walkerma: Anyway, continue!
[11:29am] Amgine: Dublin Core is a great idea, conceptually. I'm not sure how it would work in application on unstructured flat files.
[11:31am] Philippe|Wiki: OK, i'm going to have to move on to my next appt before long. I'll keep myself in this channel though, so that my log finishes.
[11:32am] Amgine: Tah Philippe|Wiki
[11:32am] walkerma: Philippe: Thanks! Hopefully we'll get through this soon!
[11:32am] Philippe|Wiki: Some parting thoughts, though: don't get in the weeds.
[11:32am] Philippe|Wiki: Keep your ideas high level, and we can come back to how to execute them.
[11:32am] Philippe|Wiki:
[11:32am] Amgine: <thinks we're well and truly aground already>
[11:33am] Philippe|Wiki: ....or possibly, grounded
[11:33am] You are now known as Philippe|Away.
[11:34am] walkerma: What I'd like to know is how much of the metadata would come from (a) manual tags etc added by editors, manually; (b) autogenerated tags generated by bots or scripts (as we did when generating the index for Version 0.7) and (c) what could be produced automatically in other ways
[11:35am] walkerma: For (a) I mean NEW metadata. (c) Might use existing categories, etc but process them to generate more useful metadata
[11:35am] lyzzy left the chat room.
[11:36am] Amgine: With the SMW initiative, most of the metadata would come from (a), and would require additional accretions to the parser.
[11:36am] wizzy: for (c) how about checking last editor and flagging potential good revisions ?
[11:36am] walkerma: wizzy - YES!
[11:36am] peteforsyth left the chat room. (Read error: 110 (Connection timed out))
[11:36am] peteforsyth_ is now known as peteforsyth.
[11:37am] Amgine: I'm currently working with developers who have been parsing Wikipedia and Wiktionary content for linguistic/dictionary data.
[11:37am] Amgine: From *structure* they are now able to build about 4 million metadata tokens.
[11:37am] howief joined the chat room.
[11:38am] walkerma: Amgine: Can you indicate some examples of these tokens?
[11:38am] Amgine: http://toolserver.org/~hippietrail/enwiktjsontrans.xml
[11:38am] Amgine: This is a few pages of *just* the gloss translations.
[11:39am] Amgine: I think fully extended this would be about 80 million tokens, just the word translations on en.wiktionary.
[11:40am] Amgine: When you begin to consider that quantity of data generation you begin to realize why the WMF devs really don't want much to do with metadata processing.
[11:40am] wizzy: and on wikipedia - what tags ?
[11:40am] walkerma: One of the main purposes of the 1.0 project has been (indirectly) to generate tons of metadata for en:WP. WikiProjects supply the data through manual tags, and they do this because they get a nice format of metadata about their own articles - VERY useful for them, and also for us
[11:41am] Amgine: wizzy, on wikipedia they are mining the infoboxes.
[11:41am] Amgine: Let me see if I have a link to some of the data.
[11:41am] walkerma: Amgine - Yes, that is very important
[11:42am] Amgine: walkerma: They are focused exclusively on the language data... and no I don't have the links in this OS; they're in the linux side of this hd.
[11:44am] walkerma: Amgine: Could you & I work on listing some of the tags we'd like to see? Could we try to make these conform to Dublin Core? This list needn't be complete - just some examples to show the overall strategy
[11:44am] Amgine: Mmm, I am not sure I know enough to be of much help?
[11:44am] walkerma: Could you write some examples for Wiktionary, at least?
[11:45am] walkerma: I don't know that any of us are real experts here..
[11:46am] walkerma: I may ask Gerard Meissen for his thoughts
[11:46am] Amgine: Heh.
[11:46am] millosh joined the chat room.
[11:46am] Amgine: Let me ask KipCool, too.
[11:47am] walkerma: This page shows an example of a database made from Chemboxes. Of course there is dBpedia as well!
[11:47am] walkerma: http://wikipedia.chemspider.com/Search.aspx?q=all
[11:48am] walkerma: A lot of the data on that page is generated by code on the ChemSpider servers
[11:48am] Amgine: <phone>
[11:49am] wizzy: that is a lot of stuff
[11:49am] walkerma: OK, I don't think we have time to cover no. 2 properly.
[11:49am] walkerma: (I mean recommendation no 2)
[11:50am] walkerma: So I think that may need to wait until Thursday. But I'd like for us to decide what we need to research before then, and which emails need to be sent
[11:50am] wizzy: do we need to schedule another meeting ?
[11:51am] walkerma: In my email, I proposed Thursday at this time, then again next Tuesday before we submit our final proposals
[11:51am] wizzy: ok
[11:51am] walkerma: I also think we should use the on-wiki discussion pages - I'll start a thread or two this afternoon
[11:53am] walkerma: There is one topic we haven't really addressed at all, namely the interaction between the offline and online versions of projects. This has two main aspects - updating people's collections (probably relatively straightforward) and editing/writing offline then uploading
[11:54am] walkerma: I will start a thread on that on-wiki, since we don't have time to discuss that aspect now. I've asked SJ a set of 11 questions via email, and that one is one that interests him and the OLPC people, I think
[11:55am] walkerma: You can also see some of this in this proposal:
[11:55am] walkerma: http://strategy.wikimedia.org/wiki/Proposal:Distributed_Wikipedia
[11:56am] walkerma: So, I think Amgine is going to work on the survey, and perhaps add something on metadata???
[11:56am] walkerma: I will also contribute to that
[11:56am] walkerma: wizzy will do a rewrite on no. 1
[11:57am] walkerma: Kelson will also contribute to the technical description on formats for dumps etc
[11:57am] Amgine: sorry about that,
[11:58am] walkerma: Is that right? We will also need some "reliable sources" to support some of our "facts" and assertions in these recommendations
[11:58am] Amgine: <nods>
[11:58am] wizzy: ok
[11:59am] walkerma: Perhaps for Rec 1 metadata topic (a) up above), Amgine and I can find some relevant support for these, while Kelson and wizzy could find sources to back up their assertions for (b) above (the formats for dumps)
[12:00pm] walkerma: I will contact Gerard and Amgine will contact KipCool
[12:00pm] walkerma: I will also contact the other Cellphone people regarding the cellphone releases for Rec no. 2
[12:00pm] walkerma: Is that it? WHat else do we need to dig up?
[12:01pm] wizzy: I see my refactoring as trying to streamline Kelson's process - and make sure that we deal with any missing info, like templates and media
[12:07pm] walkerma: wizzy - can you make it on Thursday?
[12:07pm] Amgine: I was snowbound over the holidays in North Dakota.
[12:07pm] wizzy: walkerma: I think so
[12:08pm] walkerma: Great. Kelson - will you be around on Thursday at 1800h UTC?
[12:08pm] wizzy: Amgine: fun people ?
[12:08pm] Thorncrag is now known as Thorncrag_.
[12:09pm] Amgine: No. The state is not known for its entertainment or intellectual qualities.
[12:11pm] wizzy: kinda why I asked
[12:12pm] walkerma: OK, should we close there?
[12:12pm] wizzy: bot: off
[12:15pm] walkerma: Talk to you on Thursday