Task force/Recommendations/Offline 1

This is a recommendation as submitted by the offline task force.
Please provide input and suggestions on Talk:Task force/Recommendations/Offline 1

Outline for Recommendation #1 - content reuse from WMF projects

Goal

Make reuse of content from WMF projects simpler, in order to support/grow the infrastructure for offline projects

Strategy

Provide parsed semantically annotated XML of article text.
Create and maintain article xml DTD standard and documentation, plus stylesheet, per-project
Write/publish a Mediawiki parser specification.
A reference parser implemention should be written, and companion writer, to output HTML with media to assist other formats.

Assertion: Low accessibility of WMF content hinders widespread reuse

Sub assertion: In many countries, access to the Internet is limited

In many countries the Internet be unavailable or very slow - see Regional_bandwidth for details. There is clearly a demand for our content - see this article as a recent example.

Sub assertion: Working with the current dumps[1] is hard

The current dumps are not XML at the page level

xml dumps include raw wikisyntax text, with no further parsing.
- text does not include expanded templates.
- there is no Mediawiki Parser specification
  - thus, there are no third-party parsers/tools.[2]
xml dumps could be parsed to include project-specific markup.

The Static HTML tree dumps[3] are no alternative for a primary source

Relevant data is lost
No semantic encoding of the data
They should be generated by a parser for those that need them - see below

Sub assertion: Structured data is valuable to content readers AND data re-users

Fact: Metadata, such as Semantic ontologies, are supported on the internet

Fact: WMF content elements which are more structured/standardized are more targeted by content re-users

Number one content element mentioned in interviews is infobox templates.
- Corollary: one of the most common Mediawiki support requests in IRC is importing/using/creating en.WP infoboxen. (Thus infoboxes also represent a cost to WMF in technical support.)
Sister projects sometimes only differentiated by structured article forms
- Commons media articles contain prescribed content, do not contain proscribed elements.
- Wiktionary articles have extensive allowed sections in prescribed layouts, categorized in tight but specific structures.
- Wikinews articles have standard forms, simple categorization, and are designed for the first paragraph to serve as the article summary[4].
Specifications/Guidelines for articles are helpful for editors, data re-users, if they are enforced.
- Even extremely broad guides, if they are actually implemented, will dramatically reduce malformed data.
- Layout, structure guidelines encourage automated formatting and improvements.
- Written structures are a first step toward cross-language standardization.
- Offline releases require indexes, which require structured data.

Sub assertion: Matching our content to standard library systems would help

Systems like the Dewey Decimal System allow libraries to classify books etc. A system such as the (UDC) system might prove an appropriate way to organize content. This would facilitate things like index generation for offline use and matching with non-Wikipedia resources. This is not essential, but should be explored as an option.

Fact: UDC and related systems are widely used for classifying knowledge, and are suitable for computer-based resources

See http://www.udcc.org/about.htm.