Task force/Offline/IRC

From Strategic Planning

This task force aims to have regular IRC-meetings.

Channel: #wikimedia-strategy

Next meeting

About IRC

Local timezones can be checked here.

You can access the chat by going to https://webchat.freenode.net/?channels=#wikimedia-strategy and filling in a username. Another option is http://chat.wikizine.org. For more information about IRC clients, go to the Wikipedia entry on IRC or the Meta page on Wikimedia IRC.

Logs

Summaries

November 10th, 2009

Full log is here.

This initial meeting aimed to define what topics needed to be addressed in order for us to achieve our goals, bearing in mind these questions. We mainly focussed on the second of these, "What are the delivery mechanisms that are likely to drive an increase in offline readership? Who are the organizations and entities currently doing this work?" We concluded that we should focus on the following:

  • Schools: Typically these would mainly use content in an electronic format or in book form. Schools are a natural distribution point for knowledge. One problem: our content is aimed at adults, not children, hence the need for initiatives like Wikipedia for schools, vikidia, Wikikids, etc.
  • NGOs: These were mentioned in passing, but not discussed in depth.
  • Delivery mechanisms:
    • Mobile phones are already hugely popular in developing countries. The next generation of low cost mobile phones will allow these users to access Wikipedia, if we have a mobile-friendly version available. For online use there's Wapedia, but can we get a compact offline version that can fit inside people's phones, and which is formatted well for a tiny screen? So far there is WikiPock; this gives text only, with no infoboxes or pictures.
    • Books may be wanted in schools. These would best be printed in the area where the books are to be used, by local publishers - this would keep costs down. PediaPress would be willing to work with these publishers.
    • WikiReader may have a role - see thewikireader.com.

Summary by Walkerma 23:43, 28 November 2009 (UTC)

November 24th, 2009

Full log is here

This second meeting's agenda was:

  • We want to come up with specific, viable answers to the following questions:
    • What is the best way to reach schools in developing countries?
    • How should we produce, publish and distribute electronic releases? Book releases?

However, none of the most experienced people in this area (SJ, BozMo, Wizzy) were able to be present at the start of the meeting. Instead, the discussion began by considering some of the formatting problems associated with offline content, and how these need to be addressed before we can produce truly useful content for schools.

  • Amgine raised ths point, "Most reusers prefer structured data, rather than the mostly flat revision data we produce" - so we should aim to include metadata such as categories in our releases. Can we make the content more "semantic"? Could we consider the Text_Encoding_Initiative format as a standard, or is it too complicated? Later, Kelson joined us and explained that OpenZIM (the main supported offline format for Wikimedia) will soon include categories, and metadata are also mentioned here and in the OpenZim roadmap.
  • True XML would be easier for reusers to work with, and Hejko's mwlib framework could help with that.
  • It would be useful if for some articles we could just take the lede paragraph, and perhaps key data from an infobox.
  • Hejko asked us to consider specific use cases - a school with many computers, or one with OLPC, or a school with mainly books.
  • Amgine suggested some targeted uses - OER, reference material such as a 'intros-only' wikipedia, a dictionary, electronic-reader versions of Wikisource texts.
  • We all agreed with Hejko's statement, "the WMF should not actually bring content to schools but rather provide the best possible tools to organizations that do these kind of things". We should work with NGOs and UNESCO, and we discussed various ways to reach these NGOs (email, a Wikimedia academy, press stories on successes, conferences)
  • Some current projects bringing WP to schools, mentioned at various points:

Summary by Walkerma 00:27, 29 November 2009 (UTC)

Post-meeting meeting

After the official meeting, BozMo and Wizzy were able to join some TF members on IRC, and there was extensive discussion on what is needed in practice to bring offline releases to schools. I hope to post a summary of this soon, but in the meantime, there is a nice summary of this discussion on Wizzy's blog.

Summary by Walkerma 00:27, 29 November 2009 (UTC)

December 1st, 2009

Full log is here.

Meeting kicked off with discussion on cellphone releases. That discussion is summarised at Task_force/Offline_Task_Force/Cellphone.

Regarding readers and dump formats, we discussed the Zim file format. Custom files need custom readers. Alternatives were discussed, like compressed HTML trees and zip files, but speed concerns were raised for large zip files.

Comparing a ZIM reader and an HTML reader, Kelson pointed out that current HTML render engines like fennec for example need at least 128 MB RAM, making them unsuitable at the moment.

Another point was raised regarding the locked-down nature of cellphone software - curiously third world appears to have a much freer marketplace than First world, so this was not considered a problem.

Continuing last weeks debate on metadata, the weaknesses of the category system were raised again. For example, Walkerma described :-

I asked Kelson in 2007 to give us a list of all articles that were in Chemicals categories. He gave us 22,000 articles, of which only 6,000 or so turned out to be actual chemicals. One was a bar in England, for example. It was listed under "Establishments in England serving alcohol" and alcohol was ultimately listed under the ethanol category

Amgine said Decimal Classification had been contacted. The full collection is copyrighted, but has a unique licence where we might not have to pay for it. It was suggested that this be used in parallel with the Category system. Also discussed at User_talk:JakobVoss.

Summary by Wizzy 12:14, 2 December 2009 (UTC)

December 10, 2009

Amgine said we must not be too specific over wikipedia content, and should consider sister projects (like Wiktionary ?) Amgine also talked with Hampton Catlin (http://m.wikipedia.org) - he suggested an API which serves html-parsed content akin to the current API which serves wikisyntax.

Other of Amgine's proposals rolled into Recommendation 1.

Quick summary by Wizzy 09:22, 29 December 2009 (UTC)

December 15, 2009

Walkerma discussed WikiPock - complete en, pt, es WP, no pictures or infoboxes yet. They are interested in customised collections - such as smaller collections with pictures, aimed at kids, or bigger collections where the top 30,000 articles might include some pictures. The problem with infoboxes and tables is the parser - see Recommendation 1.

Quick summary by Wizzy 09:34, 29 December 2009 (UTC)

December 22, 2009

Full log here.

Some of those present

wizzy asked what their starting point for collections is - HTML, or database dumps ? hejko replied MW-API, we use wiki text and a python library to parse it into a document tree http://en.wikipedia.org/w/api.php and http://code.pediapress.com/

Patrice said they had released a symbian version for nokia phones. He uses a proprietary format (soon open source), V2 data format is in Beta testing now, renders tables, 3.1 million articles under 4GB, no pictures. Search only uses title index. wizzy suggested title and first paragraph search.

pm27 showed http://www.okawix.com/?page=torrent&lang=en - downloads a zeno file, they are upgrading to openzim. No pictures, though the software is capable of pictures. http://blog.wikiwix.com/en/2009/12/07/okawix-et-openzim/ describes the zeno / openzim switch.

Patrice pointed out the next generation of microSD will provided 64GB and up to 1TB capacity.

pm27 updates the working snapshot every two months, Patrice every 3/4 months.

the 30k articles are selected algorithmically - http://toolserver.org/~cbm/release-data/2008-9-13/HTML/ for Patrice, linterweb have their own assessor, Kelson said he would write it up.

walkerma said a new bot is being written, V2. hejko asked if it could emit book outlines that are compatible with the book tool's stored books format.

Other mediawiki projects were mentioned - WikiQuote and Wiktionary - no size problems here.

walkerma raised the issue of article versions. Currently painfully done by hand to edit for vandalism. hejko mentioned taking the full dump and create a list of frequent editors. then for each article selected the last version that was edited by a frequent editor. the wikitrust project was mentioned favourably - http://wikitrust.soe.ucsc.edu/

WikiTrust assigns a score to each author, secretly, that shows how often that author has been reverted. An author with a lot of unreverted edits builds up a high score of trust. Someone who is a vandal will normally be very obvious with such a scoring, as they will have very low trust. The actual text of each article is marked up with its own trust rating based on who contributed the text. We think we could come up with an overall score for each version, based on adding up those trust scores for all the text - thn find the most "trusted version". More discussion in the logs at 21:22.

Some discussion was had on the problem of porn in schools, or a version of wikipedia 'cleaned' of the more graphic images.

Summary by Wizzy 17:06, 23 December 2009 (UTC)

29 December 2009

Read through the IRC log, thanks.

Regarding the Sierra Leone request, I still come back to an HTML dump served up by a local apache server, accessible from the LAN. That is the way I do it, and I want that before openzim readers. Wizzy 15:18, 30 December 2009 (UTC)

5 January 2010

We started with Recommendation 1 - proposed as a survey for developers and project managers. User:Wizzy questioned how such XML dumps would be converted into HTML dumps for use in offline wikipedias. User:Amgine said such dumps could be converted, but inclusion of media had not yet been considered. The discussion then turned to User:Kelson's creation method, which currently does not use the dumps, but instead mirrors directly against a live wikimedia server. Amgine said the WMF would prefer if third parties did not use live sources to reduce server loads. Kelson said that if he had time, he would code a XML parser. Since Kelson always works with a narrow article selection, he also needs to narrow the included templates, something MWDumper cannot do.

Kelson currently uses the MW-API to pull required templates and media, like pictures. He also uses the raw SQL dumps to assist with narrowing the article selection.

7 January 2010

Starting out with the cellphone as a platform, two major use cases were identified - 1) Live viewing on a cell phone (like wapedia 2) offline repository, with viewer (like okawix). hejko pointed out that offline repositories might not work with ultra low cost handsets. Cellphone hardware specifications were raised again, but it was agreed (?) that we should target phones a notch up from basic - that have an SD Card slot and some kind of browser.

walkerma thought cellphone users are likely to be different from the school users, but convergence was discussed, as cellphones are a lot cheaper and more ubiquitous than netbooks or the like. Hejko thought that simple and cheap handsets will be the predominant ones used in developing countries, but SJ (and wizzy) think that smartphones will be there sooner than we think.

Who to engage to get content onto cellphones ? First world - it is the networks. Third world many (most ?) phones are unlocked - or are cheap chinese imports. (wizzy thinks you should make content available for download cheaply - requires talking to the networks - and readers for all platforms, including a basic common denominator of an HTML dump readable by the onboard phone browser).

walkerma summarised as follows :-

  1. WP on cellphones are critical for reaching individual adult (and teen) users in developing countries, because reach is much greater than internet
  2. There are three strategies for supplying WM content to cellphone users: (a) Online access; (b) Offline access using specifically designed software
  3. We need to work with manufacturers and providers of cellphones to make sure WP will work well in upcoming products.
  4. We need to design our releases around likely memory capacities - SD card (2 Gig) or onboard (a text-only, xml section 0, dump might be only a few Meg)

Discussion moved to books - an SJ recommendation, being cheaper if locally printed, and not requiring electricity. Perhaps more suitable for projects like Wikieducator or Wikibooks. SJ thinks ultra-cheap books remain the best short-term way; we need much much better printing/publishing contacts.

(Presumably our requirement for an HTML dump would satisfy the printers for formatting ?)

A specific use case was identified - a reference book on the chemical elements - might not be unique, but it could be bought LEGALLY for $3 in India instead of perhaps $5-10. And support them with creating custom selections (like chemistry, or mathematics).

Possible discussion topics

Please add any of your own below.

Content assessment and selection

For example - if space is limited, which Wikipedia articles should be included in a WP collection and which shouldn't? How do we rank articles? Importance, or quality, or some combination of the two? The new English 1.0 bot is now being tested, and it is looking really amazing. These are questions very dear to my heart, so I realise I'm very biased, but we should probably discuss these points at one of our meetings. Walkerma 04:37, 10 December 2009 (UTC)

Book collections

Putting together a book is a little different than "traditional" electronic content. The Books WikiProject and the new Book-Class tag for the English 1.0 bot is discussed in a recent Signpost article by Headbomb. I suggest we get a meeting with both Hejko and Headbomb to discuss how best to get books assembled. Walkerma 04:37, 10 December 2009 (UTC)

Other languages

We need to make sure that we find out what everyone is doing - not just en:WP and fr:WP. For example, the Tamil Wikipedia is looking at using the assessment scheme from en - are they putting together offline releases? We need to find such things out. Walkerma 04:37, 10 December 2009 (UTC)

What content?

Both Wizzy and SJ have mentioned that teachers consider Wikipedia to be the "killer ap" in collections distributed to schools. But what about Wiktionary, Wikiquote, Wikiversity, Wikibooks, and other things? What about non-WMF projects like Wikihow, WikiTravel or WikiEducator? We need to make sure we take a broad look at things other than Wikipedia. Walkerma 04:37, 10 December 2009 (UTC)

Is offline about school applications? For example, the item which appears to me most fundamental is Wiktionary. Wiktionary's data can be used in almost 100% of all software - from translation to spell checkers to instant-look-up dictionaries - but it is nearly unused due to data dumps being inaccessible. - Amgine 17:12, 10 December 2009 (UTC)