Task force/Recommendations/Offline 1

From Strategic Planning

Outline for Recommendation #1 - content reuse from WMF projects

Goal

Make reuse of content from WMF projects simpler, in order to support/grow the infrastructure for offline projects

Strategy

  1. Provide parsed semantically annotated XML of article text.
  2. Create and maintain article xml DTD standard and documentation, plus stylesheet, per-project
  3. Write/publish a Mediawiki parser specification.
  4. A reference parser implemention should be written, and companion writer, to output HTML with media to assist other formats.

Assertion: Low accessibility of WMF content hinders widespread reuse

Sub assertion: In many countries, access to the Internet is limited

In many countries the Internet be unavailable or very slow - see Regional_bandwidth for details. There is clearly a demand for our content - see this article as a recent example.

Sub assertion: Working with the current dumps[1] is hard

The current dumps are not XML at the page level

  • xml dumps include raw wikisyntax text, with no further parsing.
    • text does not include expanded templates.
    • there is no Mediawiki Parser specification
      • thus, there are no third-party parsers/tools.[2]
  • xml dumps could be parsed to include project-specific markup.

The Static HTML tree dumps[3] are no alternative for a primary source

  • Relevant data is lost
  • No semantic encoding of the data
  • They should be generated by a parser for those that need them - see below

Sub assertion: Structured data is valuable to content readers AND data re-users

Fact: Metadata, such as Semantic ontologies, are supported on the internet

Fact: WMF content elements which are more structured/standardized are more targeted by content re-users

  • Number one content element mentioned in interviews is infobox templates.
    • Corollary: one of the most common Mediawiki support requests in IRC is importing/using/creating en.WP infoboxen. (Thus infoboxes also represent a cost to WMF in technical support.)
  • Sister projects sometimes only differentiated by structured article forms
    • Commons media articles contain prescribed content, do not contain proscribed elements.
    • Wiktionary articles have extensive allowed sections in prescribed layouts, categorized in tight but specific structures.
    • Wikinews articles have standard forms, simple categorization, and are designed for the first paragraph to serve as the article summary[4].
  • Specifications/Guidelines for articles are helpful for editors, data re-users, if they are enforced.
    • Even extremely broad guides, if they are actually implemented, will dramatically reduce malformed data.
    • Layout, structure guidelines encourage automated formatting and improvements.
    • Written structures are a first step toward cross-language standardization.
    • Offline releases require indexes, which require structured data.

Sub assertion: Matching our content to standard library systems would help

Systems like the Dewey Decimal System allow libraries to classify books etc. A system such as the (UDC) system might prove an appropriate way to organize content. This would facilitate things like index generation for offline use and matching with non-Wikipedia resources. This is not essential, but should be explored as an option.

Fact: UDC and related systems are widely used for classifying knowledge, and are suitable for computer-based resources

See http://www.udcc.org/about.htm.