Task force/Offline/IRC/2009-11-24

From Strategic Planning
Task force | Offline | IRC
Discussion of Wikimedia Foundation's Strategy Project
17:54 GerardM-: Where is philippe on about something ?
17:57 Amgine: bugzilla
17:57 Amgine: In 3 minutes, TF/Offline
17:58 Amgine: Hullo Hejko...
17:59 walkerma has joined (n=chatzill@admin-151-108.potsdam.edu)
17:59 hejko: Hi Amgine, hi GerardM-
17:59 hejko: hi walkerma
17:59 walkerma: hello!
Discussion of Wikimedia Foundation's Strategy Project
18:08 hejko_: I am sorry, lost the connection and had to answer the phone. Last was "hello" by walkerma :)
18:08 Amgine: [09:00am] Amgine: So, shall we begin?
18:08 Amgine: [09:00am] walkerma: I'm wondering if Andrew Cates is here?
18:08 Amgine: [09:01am] walkerma: He was going to try to make it, but he's never used IRC before - I sent him some instructions
18:08 Amgine: [09:01am] Amgine: Hejko: didn't we start an agenda somewhere?
18:08 Amgine: [09:02am] walkerma: Last time we agreed to discuss  (a)  What is the best way to reach schools in developing countries? (b) How should we produce, publish and distribute electronic releases? Book releases?
18:08 Amgine: [09:02am] Amgine: <nods>
18:08 Amgine: [09:02am] walkerma: Do you have more detail?
18:08 Amgine: [09:02am] Amgine: No. I did do some research on finding other projects which are doing exactly this:
18:08 Amgine: [09:03am] Amgine: <a class="url" href="http://www.kunnafoni.org/" oncontextmenu="on_url_contextmenu()">http://www.kunnafoni.org/</a>
18:08 Amgine: [09:03am] walkerma: Good for you!  I'm sorry I've been so busy in real life/quiet on here
18:08 Amgine: [09:03am] Amgine: Of course, OLPC
18:08 Amgine: [09:03am] Amgine: For b) I have done more research.
18:08 Amgine: [09:04am] Amgine: Most reusers prefer structured data, rather than the mostly flat revision data we produce.
18:09 walkerma: Amgine: Can you explain what you mean about structured vs flat data?
18:09 walkerma: Do you mean categories, indexes, etc?
18:09 Amgine: Well, there's metadata, such as the categories etc.
18:09 hejko_: walkerma: the XML dumps are not really XML when it comes to articles which are still wiki markup
18:10 hejko_: i think reusers would prefer if the article text would be represented in XML, XHTML or something elsa that can be easily worked on.
18:10 Amgine: But also most articles include an introduction, follow a semi-standard layout, and include references.
18:11 Amgine: <nods> There's something called the Text Encoding Initiative: TEI DTD
18:11 hejko_: the mwlib framework we build would allow to process dumps, expand templates and return a real XML representation of the articles
18:11 Amgine: It's sort of the xml swiss army knife for textual documents.
18:11 hejko_: TEI is complicated
18:11 Amgine: Very.
18:12 Amgine: hejko: One of the most structured projects is Wiktionary.
18:12 hejko_: i'd prefer to use XHTML hwere psossible and use sementic annotations for things like galleries and timelines that are not covered by XHTML
18:12 hejko_: MathML for formulars, etc.
18:13 hejko_: we already have a working DockBook XML export prototype
18:13 hejko_: this could be a good start if that'd become a recommendation.
18:13 Amgine: <nods> What I think this TF can state, fairly categorically, is the flat revision output is a stumbling block to re-use?
18:14 walkerma: Amgine: These "reusers" you refer to - do you mean people like OLPC, or people in the country's education department, or what?  I want to clarify.
18:14 hejko_: this could be really useful for any reusers (e.g. scientific) as they currently all have the same reocuring challenge: parse MW markup to something they can work with.
18:15 Amgine: A re-user would be any group trying to present or process Mediawiki content - not just Wikipedia, but primarily Wikipedia.
18:15 Amgine: So, WikiReader is one.
18:15 Amgine: But so is the Wiktionary Lookup Tool (go to <a class="url" href="http://fr.wikinews.org" oncontextmenu="on_url_contextmenu()">http://fr.wikinews.org</a> and double click on any non-linked word)
18:15 walkerma: OK, so you're looking at all reusers, not just schools
18:16 Amgine: Yes. In order to get stuff *into* schools, someone has to figure out a way to present the content. That usually means processing a dump.
18:16 walkerma: Understood!  I can see that a more semantic version of the information would be much more useful, though I can't see us turning all of our chemistry content into CML
18:17 walkerma: <a class="url" href="http://en.wikipedia.org/wiki/Chemical_Markup_Language" oncontextmenu="on_url_contextmenu()">http://en.wikipedia.org/wiki/Chemical_Markup_Language</a>
18:17 Amgine: <nods> I think that would be a project-specific effort.
18:18 Amgine: I think we're talking more about the larger structures, right hejko?
18:19 hejko_: right. but if projects choose to use microformats to encode more semantic these of course would automatically be available to re-users also.
18:19 walkerma: One thing I think is critical, is being able to select the most important parts of an article, such as the lead paragraph, the main data from the infobox, etc
18:20 Amgine: Template expansion into dump...
18:20 Amgine: Whoa, before we get into the technicals, that isn't really the focus of this task force, is it?
18:22 walkerma: Hi brassratgirl!  Whenever space or speed becomes a limiting factor, we need to find a way to pick the really important parts of an article
18:22 hejko_: Amgine: but we certainly could conclude that "If processing of dumps gets way easier, this will result in more successful offline projects"
18:23 brassratgirl: walkerma: interesting. like for the school packages? for what purpose?
18:23 Amgine: <nods> Agree.
18:23 brassratgirl: for my students, it's the intro & the bibliography :)
18:23 Amgine: For schools is our primary, but also other low internet accessibility regions.
18:23 Amgine: brassratgirl: Not an abstract?
18:24 Amgine: Although that might be subsumed in the concept of an intro.
18:24 walkerma: Example: For a distribution in Cameroon, we may choose NOT to include every complete article on every village and hamlet in the US.  However, it would not take up much space to say "Potsdam is a one-horse town in Northern New York, population XYZ" - which may be all anyone in Cameroon needs to know.
18:26 Amgine: walkerma: <abstract size:1024>abstract</abstract><summary size:140>Potsdam is a one-horse town...</summary>
18:26 brassratgirl: walkerma: It depends a lot on what one envisions wikipedia being good for; and how much of those long-term uses one might plan for. No, you don't need all the census facts; but if someone's relocating to the U.S., suddenly all of that information becomes super useful
18:26 brassratgirl: I'd like to see slices: basic encyclopedia, sustainable technologies (with appropedia), geographical...
18:27 Amgine: ping appropedia
18:27 hejko_: i assume no images, right?
18:27 walkerma: But 9 times out of 10, when I consult Wikipedia (and others have said the same), I just want to know that this is a town in Poland, or an album by Black Eyed Peas
18:28 walkerma: That can be found in the lead.
18:28 Amgine: Depends, hejko.
18:28 walkerma: Amgine is right
18:28 Amgine: I would say images are vital for school applications, but limited in size/use.
18:28 hejko_: if we include images we don't need to care about reducing the text
18:28 Amgine: <grin> very true.
18:28 brassratgirl: heijko_ the text is actually a barrier to use in some cases though
18:28 brassratgirl: when it's too technical, etc.
18:30 walkerma: When you have 3 million articles, but you may only have thumbnails from 30,000 articles.  What do you take from the 3 million?  Not images, obviously.  But even the full text may be inappropriate in some cases
18:30 Amgine: brassratgirl: Who (in addition to OLPC) is *currently* bringing WMF content to schools
18:31 walkerma: To me, this ability to organise content is related to the semantic data discussion earlier in this meeting
18:31 brianmc: The UK's Wikipedia for Schools project Amgine
18:31 Amgine: linky brianmc
18:31 hejko_: I think we should define some use cases plus the expected delivery medium and then decide what should be included. Or the other way  round :)
18:31 walkerma: Andrew was hoping to join us
18:31 brianmc: moment...
18:31 brassratgirl: Amgine: good question, I don't know off the top of my head. Sj would.
18:32 brianmc: linkies here, plus interview with Andrew, <a class="url" href="http://en.wikinews.org/wiki/2008-09_Wikipedia_for_Schools_goes_online" oncontextmenu="on_url_contextmenu()">http://en.wikinews.org/wiki/2008-09_Wikipedia_for_Schools_goes_online</a>
18:33 Amgine: Another example: Wikisource legal libraries of case law
18:33 hejko_: targeted uses. I think there are schools with OLPCs, scools with one PC and schools that only can afford some books. I am not sure what school we are currently talking about.
18:34 Amgine: kk.
Discussion of Wikimedia Foundation's Strategy Project
18:37 Amgine: [09:36am] Amgine: In developing a solution, the implementation may result after the problem no longer exists.
18:37 walkerma: Sorry, I had a distraction
Discussion of Wikimedia Foundation's Strategy Project
Discussion of Wikimedia Foundation's Strategy Project
18:40 hejko_: sorry
18:40 Amgine: Targeted uses, for me, would be OER, reference material such as a 'intros-only' wikipedia, a dictionary, and electronic-reader versions of Wikisource texts which are not better available via Gutenberg.
18:41 walkerma: Amgine: Excellent!
18:42 Kelson has joined (n=Kelson@32-2.61-188.cust.bluewin.ch)
18:43 walkerma: Hi Kelson!  OK, so for schools in particular, we're thinking we want to develop a more structured format for offline material, so we can assemble collections for targeted uses.
18:43 Amgine: More structured data dump, yes.
18:44 Amgine: btw: brianmc's Wikipedia for Schools article leads eventually to <a class="url" href="http://schools-wikipedia.org/" oncontextmenu="on_url_contextmenu()">http://schools-wikipedia.org/</a>
18:44 walkerma: Kelson: Can you tell us how OpenZIM can handle semantic information, categories, lead paragraphs, data boxes, etc?
18:44 Kelson: walkerma: hi... would be great to have a doc. with the requirements... maybe the ZIM format can do that.
18:45 Amgine: Kelson: I understand a WMF dev was at the conference this past weekend.
18:46 Kelson: walkerma: nothing about semantic... but this is a little bit vague. A more detailed requirement would be welcome
18:46 walkerma: If you're unfamiliar with OpenZIM, please look at <a class="url" href="http://openzim.org/Main_Page" oncontextmenu="on_url_contextmenu()">http://openzim.org/Main_Page</a>
18:46 Kelson: Amgine: that is true
18:46 walkerma: It's the format for offline content that has been most supported by the Foundation
18:46 Amgine: I'm very unfamiliar with OpenZim, but I don't want to waste the TF time learning, so if Kelson could give a brief overview?
18:47 Kelson: walkerma: I'm not familiar with the concept of "lead boxes" or "data boxes", explanations would be also welcome
18:47 hejko_: basic question: does it store the raw wikitext?
18:47 hejko_: or is the wikimarkup converted to something else before stored?
18:48 Kelson: Amgine: hejko_ is store what you want
18:48 walkerma: Kelson: <a class="url" href="http://en.wikipedia.org/wiki/Template:Chembox" oncontextmenu="on_url_contextmenu()">http://en.wikipedia.org/wiki/Template:Chembox</a>
18:48 walkerma: That's an example I'm familiar with
18:48 Kelson: walkerma: ok, so it cans, we mainly store HTML in the ZIM files... so if you want a box in the hml.... that's your choice ;)
18:49 walkerma: Kelson: I mean that you could search for things with a particular melting point, etc
18:49 Amgine: Kelson: The infobox walkerma points to is a template used on many pages with some small substitutions. The templates can be expanded using MW API.
18:49 Kelson: A ZIM file stores content... and a few informations about the content... it's mainly though for Web contents (HTML, pictures...).
18:50 Amgine: walkerma: That kind of data probably cannot be automated at the database dump. It would need to be processed at the content-editing point.
18:50 walkerma: OK
18:50 hejko_: walkerma: i think this something that the semantic mediawiki will offer in the future. maybe dbpedia already has this information.
18:51 walkerma: OK, I think this has been a really useful discussion, but I wonder if I could ask the question, how should we get this to schools that need it?
18:51 Kelson: Amgine: the openZIM format is not a project to dev. a new wiki parser or template engine, so you have to store whole HTML page
18:51 Amgine: Kelson: So it only works with a cached html format?
18:52 Kelson: Amgine: it works with static content... so you have to generate them before... if that is what you ask.
18:52 Amgine: Yes.
18:53 hejko_: walkerma: I think the WMF should not actually bring content to schools but rather provide the best possible tools to organizations that do these kind of things.
18:53 walkerma: Good point, hejko_
18:54 Kelson: walkerma: NGOs and simple install process should be able to do that... at least it works currently good I think.
18:54 Amgine: Yes, That's my opinion as well.
18:54 hejko_: e.g. offer offline readers, PDFs (for printing books) and maybe a tool to create individual dumps (w/ or w/o images, w/ abstracts or full content, or with certain categories only, ...)
18:54 walkerma: But we can also initiate contact with those organizations, and make it openly available
18:55 walkerma: A lot of people - including those in NGOs - may think that the only offline version of Wikipedia is one that came off their printer
18:55 walkerma: We need to develop a way to make these releases known
18:56 walkerma: Do we know which NGOs are active in this area, besides SOS Children's Villages and OLPC?
18:56 walkerma: And what about UNESCO?
18:56 Amgine: A couple. The vast majority of such technology transfer groups are religious in nature.
18:57 hejko_: wikieducator.org is part of the "common wealth of  learning" or similar.
18:57 Kelson: walkerma: they are many, I think we mainly ignore what they do: Kunnafoni, Linux4Afrika, etc..
18:57 Amgine: Unesco: good hoice.
18:57 walkerma: Religious is fine, as long as we're not seen as supporting a scheme for proselytisation (is that a word?)
18:57 walkerma: If they are a religious group with a secular goal, like Christian Aid
18:57 Amgine: s or z, I think.
18:58 Amgine: Wiktionary has a 'crat heavily involved in Swahili education efforts.
18:58 walkerma: Kelson: Yes, we need to build better links to groups like Linux4Afrika - how can we do that?
18:59 Amgine: Hire someone to work under the communications and do it.
18:59 Kelson: walkerma: write an email ;) Have met them in Berlin 5 months ago... but no feedback until now.
18:59 hejko_: it might be good if the WMF had an success story that got published in the related media.
19:00 Amgine: Okay, we're nearing the 1 hour mark: I think we've made progress on these two goals as far as understanding what we don't know.
19:00 walkerma: Is there a conference where these groups all hang out, where one of our group could do a presentation?
19:00 walkerma: Amgine: Yes, though we didn't get onto books, that may need to wait for another day
19:01 Amgine: Kelson: I'm asking devs about how to get a dump of html cache pages.
19:02 Amgine: I'm not getting the answers I expected, but I believe I could build a cache within a 7 day time period.
19:03 Kelson: Amgine: ok, if you have specific question, this would be possible to speek about that later, after the meeting.... or at anytime at #kiwix or #openzim
19:04 Amgine: Thank you. I have to run now people, but will be back online. I'm normally available about 12 hrs a day online.
19:05 hejko_: Shall we schedule a meeting for the smae time next week?
19:06 hejko_: "Please have recommendations complete by January 12, 2010." :)
19:06 walkerma: hejko_ Sounds good!  Andy Rabigliati (Wizzy) would like to meet on Dec 9th or 10th, the 10th is better for me.  But we could meet before then as well
19:07 hejko_: As long as we summarize our meetings, more often meetings are a good idea.
19:07 Amgine: Yes, next week...
19:08 walkerma: I can make it at the same time next Tuesday
19:08 walkerma: Hopefully Andrew C can join us
19:08 hejko_: Ok, can someone post a log of this meeting to the wiki? I'd then try to summarize it.
19:10 walkerma: For the agenda for next week, should we talk about mobile phone releases?