Task force/Recommendations/Offline

From Strategic Planning

Outline for Recommendation #1 - content reuse from WMF projects

Goal

Make reuse of content from WMF projects simpler, in order to support/grow the infrastructure for offline projects

Strategy

  1. Provide parsed semantically annotated XML of article text.
  2. Create and maintain article xml DTD standard and documentation, plus stylesheet, per-project
  3. Write/publish a Mediawiki parser specification.
  4. A reference parser implemention should be written, and companion writer, to output HTML with media to assist other formats.

Assertion: Low accessibility of WMF content hinders widespread reuse

Sub assertion: In many countries, access to the Internet is limited

In many countries the Internet be unavailable or very slow - see Regional_bandwidth for details. There is clearly a demand for our content - see this article as a recent example.

Sub assertion: Working with the current dumps[1] is hard

The current dumps are not XML at the page level

  • xml dumps include raw wikisyntax text, with no further parsing.
    • text does not include expanded templates.
    • there is no Mediawiki Parser specification
      • thus, there are no third-party parsers/tools.[2]
  • xml dumps could be parsed to include project-specific markup.

The Static HTML tree dumps[3] are no alternative for a primary source

  • Relevant data is lost
  • No semantic encoding of the data
  • They should be generated by a parser for those that need them - see below

Sub assertion: Structured data is valuable to content readers AND data re-users

Fact: Metadata, such as Semantic ontologies, are supported on the internet

Fact: WMF content elements which are more structured/standardized are more targeted by content re-users

  • Number one content element mentioned in interviews is infobox templates.
    • Corollary: one of the most common Mediawiki support requests in IRC is importing/using/creating en.WP infoboxen. (Thus infoboxes also represent a cost to WMF in technical support.)
  • Sister projects sometimes only differentiated by structured article forms
    • Commons media articles contain prescribed content, do not contain proscribed elements.
    • Wiktionary articles have extensive allowed sections in prescribed layouts, categorized in tight but specific structures.
    • Wikinews articles have standard forms, simple categorization, and are designed for the first paragraph to serve as the article summary[4].
  • Specifications/Guidelines for articles are helpful for editors, data re-users, if they are enforced.
    • Even extremely broad guides, if they are actually implemented, will dramatically reduce malformed data.
    • Layout, structure guidelines encourage automated formatting and improvements.
    • Written structures are a first step toward cross-language standardization.
    • Offline releases require indexes, which require structured data.

Sub assertion: Matching our content to standard library systems would help

Systems like the Dewey Decimal System allow libraries to classify books etc. A system such as the (UDC) system might prove an appropriate way to organize content. This would facilitate things like index generation for offline use and matching with non-Wikipedia resources. This is not essential, but should be explored as an option.

Fact: UDC and related systems are widely used for classifying knowledge, and are suitable for computer-based resources

See http://www.udcc.org/about.htm.

Outline of offline recommendation #2: Use of cellphones

Third world has remarkable cellphone penetration, compared to internet. How can we leverage this ?

Goal

Give 3 billion people with no internet connection access to the Wikimedia content via cellphones.

Strategy

  1. Convince network providers and/or manufacturers to have WP content pre-installed on new cellphones
  2. Support third party developers/providers of open offline storage standards (such as OpenZim), readers which use them (such as Linterweb), and proprietary offline solutions (such as WikiPock).
  3. Encourage development of non-internet distribution systems, e.g. SMS article requests.

Main assertion

Cellphones provide the most cost efficient and far reaching delivery mechanism. E.i. even if it takes an investment of 30 million USD to implement this goal, this would still be no more than one cent per person.

Further deliberations in support of the below can be found at Task force/Offline/Cellphone and Task force/Offline/IRC

Fact: Cellphone hardware is the most ubiquitous hardware platform in limited-internet markets

According to the Information Society, cellphones are owned by 33% of people in Africa, a number that is rising rapidly; by contrast, only 4% use the Internet, and only 1% have broadband access. In the Asia/Pacific region, the comparable numbers are 37%/15%/3%. Also see this article on market penetration of cellphones in the developing world.

Fact: There are already suppliers of Wikipedia offline for cellphones

See http://www.wikipock.com/ . The Wikipock collection is 7 GB, which includes all the English WP, as well as ES and PT, no pictures and limited table rendering. A new version of the software due out next month will be open source, reduce storage space and speed up searching. It works on a variety of platforms, and gives full table rendering.

Patrick Collison's mobile phone version is also available for download, though only for the iPhone or iPod (which WikiPock does not serve).

Wapedia has proved to be a very popular site for online browsing of Wikipedia, as evidenced by its Alexa traffic rank (2,714 in the World, as of January 11, 2010). This suggests that people are happy to use Wikipedia on their cellphone.

See also WikiReader - http://thewikireader.com - a dedicated wikipedia palmtop.

Fact: Software requires a platform-specific reader, or might use the cellphones built-in browser

Currently, the need for search and compressed data formats suggests a reader.

Outline for Recommendation #3 - Schools

Question

How can we promote the educational use of Wikipedia offline, especially in remote communities?

Strategy

  1. Work through schools. Provide the necessary data and infrastructure to organizations that aim to bring WMF content to schools.
  2. Schools need to consider national curricula, scholarly topics. We need to support targeted selection for them, including kids topics.
  3. Small, specialist article collections, like chemistry or Mathematics, would be open content for book publishers.
  4. Content should include Wikipedia, Wiktionary, Wikibooks and possibly WikiNews
  5. Distribution would be via USB stick or download, and as books.

Assertion: Schools are a natural distribution point for knowledge

Typically these would use content in an electronic format or in book form. It could be targeted per seat, or served from a central point at the school.

Sub assertion: Schools share a common purpose with us

Schools exist to provide education, so if we want to promote education, schools provide the obvious distribution point for our content.

Sub assertion: Schools already use similar materials

Fact: Encyclopedias are established in the classroom

  • For example World Book, the world's bestselling paper encyclopedia, has a large number of its products geared towards schools.
  • Likewise, dictionaries and textbooks are obviously widely used in the classroom, suggesting that Wiktionary and WikiBooks could find a natural market there.

Fact: Teachers can organize use of materials

The teacher is trained to disseminate educational information, use educational resources (including electronic resources) and engage students in learning.

Sub assertion: Content should include Wikipedia, Wiktionary, Wikibooks and possibly WikiNews

Fact: These correspond with what schools already use

See above for the first three. Some schools use newspaper archives in local libraries for researching world events, etc. WikiNews could provide some of this, at least for recent events.

Fact: Wikipedia, Wikibooks and electronic dictionaries have already proved popular in schools

The Wikipedia for Schools releases have proved popular, despite the small size of the releases (5500 articles maximum), and online traffic is almost as high as for Citizendium. See Task_force/Offline/SJ_Q&A for a description of how One Laptop Per Child has used Wikimedia content successfully in Latin America. User:Wizzy (on this task force) has had extensive experience of distributing Wikipedia offline to schools in South Africa. In both cases, the Wikimedia content has proved to be very well-received.

Sub assertion: Content should be provided in an electronic format

For convenient re-use, the content should be made available in an electronic format that can be read on standard computers. This could be a browser-based reader, or free custom software. The content could be supplied on DVD or a USB memory stick. Computers are already established in many schools in developing countries (see [5]).

Fact: This is how Wikimedia content has been used successfully in the past

Fact: This takes full advantage of the wiki format (internal links, low cost per page, etc)

Sub assertion: Content should also be provided in book format

Schools traditionally use books as their primary educational resource. Even if electronic resources supersede books in many cases, books are likely to remain viable for some time to come. This is especially true in places where computers and electronic devices are less common, or where electric power supply is intermittent (see this example). Books can often be produced inside the target country at low cost.

Assertion: We should improve and encourage content aimed at kids

Fact: our content is aimed at adults, not children

School content initiatives exist, like Wikipedia for schools, vikidia, Wikikids, etc.

Outline for Recommendation #4 - article selection

Question

How can we produce a variety of different page/article selections, to meet the varying needs of offline content users?

Strategy

A range of tools should be available, to allow selections to be made for offline releases

  1. The community is needed to flag articles for Importance and Quality.
  2. Hit statistics and other metadata are required to gauge article popularity and significance.
  3. Bots are needed to collect this information into tables.
  4. Article history, in conjunction with Flagged Revisions and WikiTrust are needed to pick unvandalised versions.
  5. An online selection tool usable by publishers and other users is needed to use the information above, categories, and other metadata to emit custom dumps, probably as XML.
  6. Such dumps will include templates and, optionally, media to be included in the final output, like HTML or OpenZIM.

Assertion: We need automatic tools to perform article selection

Fact: We cannot always use the whole project due to size constraints

Wiktionary and other projects are small enough to 'swallow whole' (at least in electronic releases) but the Wikipedias need a lot of trimming, especially if images are included. Book releases present an even greater challenge.

Fact: We can use quality, importance and popularity measures via the WP:1.0 project to aid selection

See en:Version_1.0_Editorial_Team and en:Version_1.0_Editorial_Team/Assessment and the main index at en:Version_1.0_Editorial_Team/Index. Also see fr:Projet:Wikipédia_1.0/Index and hu:Wikipédia:Cikkértékelési_műhely/Index.

On these projects bots collect and tabulate metadata. WikiProject teams still need to tag articles on a Quality scale, and preferably on an Importance scale also. However, because assessment work is decentralized and performed by subject-experts, and supported by bots, this approach has proved scalable; WikiProjects have now manually assessed more than 2 million articles on en, 370,000 articles on fr, and over 50,000 on hu.

Once the selection has been made, another tool needs to use category information and other metadata to generate indexes.

Assertion: We need automatic tools to perform 'best-version' selection

Sub assertion: Vandalised versions of articles are a problem

Fact: Vandalism is a problem on Wikipedia

In the original selection for en:WP Version 0.7, we selected only versions from account holders, yet we still found approximately 200 examples of significant vandalism in our offline collection. For example, the article on a popular black comedian had the following opening statement "Suc* my a** kids im a ni**er and my real name is charles reed pulk" (asterisks added by myself).

Fact: A vandalized version of an article in an offline release cannot be corrected

As a result we might potentially send a page full of obscenities to 100,000 schools.

Fact: Manual article checking is difficult and slow

For the "Wikipedia for schools" collection, a whitelist greatly helped in version selection, but some examples of "deep vandalism" still crept in. Extensive checking and rewriting by volunteers was used, since this collection was specifically aimed at children. A similar whitelist approach was used with the German Wikipedia 1.0 releases. With the en:Wikipedia 1.0 releases, where a whitelist was not used, a search was performed for "bad words" and vandalized versions were then identified and corrected manually. This approach is extremely tedious and labour-intensive, and delayed the release by six months.

Fact: There are projects like Wikitrust and Flagged Revisions that should be able to automate version selection in the future

See http://wikitrust.soe.ucsc.edu/

User:Walkerma has had detailed discussions with Luca de Alfaro from the WikiTrust project, and automated version selection is very likely to be possible. This approach would allow every article version to be given a trust value, based on the sum of the contributions in it, and this would allow us to avoid picking vandalized versions. On January 10th, 2010 this prospect was reconfirmed by Prof de Alfaro as very likely, and an article version dump was completed on that date.

An alternative approach would be to use Flagged Revisions. In the German Wikipedia, and others where the Flagged Revisions extension is well established, this may prove to be a popular method for version selection.

Fact: Rapid version selection is needed in order to provide regular selection updates

See below. Clearly, we cannot be doing monthly updates if each version selection requires six months of manual checking.

Assertion: Releases of offline selections will need to be carefully structured

This assumes that we begin to produce a variety of article/page selections from Wikipedia.

Sub assertion: We should use a clear system of identifiers for each release and update

For example, if we make a selection of "Top 1000 biographies" for 2011, we could use simple, clear release numbers, such as "Top 1000 Biographies, Version 2011.01" for January 2011, with perhaps monthly updates that read "2011.02" for February, etc. The details can be discussed, but a standard format should be used - preferably across all formats. One number should indicate the selection of articles (in this case, perhaps updated once a year), the other number should indicate the selection of article versions (presumably updated once a month). Note that automated selection of both articles and article versions is a prerequisite for this type of regular, structured series of releases.

Sub assertion: We should offer frequent version updates

If WikiTrust or Flagged Revisions allows us to produce lists of low-vandalism versions rapidly, we should use this to update the article version selection often.

Fact: Even for offline use, end-users still want to have a current selection

See this example.

Fact: Content can become stale

Elections, wars, terrorist attacks and "acts of God" all happen with great regularity, and version updates should help keep the content "fresh".