Task force/Recommendations/Offline 1
Appearance
This is a recommendation as submitted by the offline task force. Please provide input and suggestions on Talk:Task force/Recommendations/Offline 1 |
Outline for Recommendation #1 - content reuse from WMF projects
Goal
Make reuse of content from WMF projects simpler, in order to support/grow the infrastructure for offline projects
Strategy
- Provide parsed semantically annotated XML of article text.
- Create and maintain article xml DTD standard and documentation, plus stylesheet, per-project
- Write/publish a Mediawiki parser specification.
- A reference parser implemention should be written, and companion writer, to output HTML with media to assist other formats.
Assertion: Low accessibility of WMF content hinders widespread reuse
Sub assertion: In many countries, access to the Internet is limited
In many countries the Internet be unavailable or very slow - see Regional_bandwidth for details. There is clearly a demand for our content - see this article as a recent example.
Sub assertion: Working with the current dumps[1] is hard
The current dumps are not XML at the page level
- xml dumps include raw wikisyntax text, with no further parsing.
- text does not include expanded templates.
- there is no Mediawiki Parser specification
- thus, there are no third-party parsers/tools.[2]
- xml dumps could be parsed to include project-specific markup.
The Static HTML tree dumps[3] are no alternative for a primary source
- Relevant data is lost
- No semantic encoding of the data
- They should be generated by a parser for those that need them - see below
Sub assertion: Structured data is valuable to content readers AND data re-users
Fact: Metadata, such as Semantic ontologies, are supported on the internet
Fact: WMF content elements which are more structured/standardized are more targeted by content re-users
- Number one content element mentioned in interviews is infobox templates.
- Corollary: one of the most common Mediawiki support requests in IRC is importing/using/creating en.WP infoboxen. (Thus infoboxes also represent a cost to WMF in technical support.)
- Sister projects sometimes only differentiated by structured article forms
- Commons media articles contain prescribed content, do not contain proscribed elements.
- Wiktionary articles have extensive allowed sections in prescribed layouts, categorized in tight but specific structures.
- Wikinews articles have standard forms, simple categorization, and are designed for the first paragraph to serve as the article summary[4].
- Specifications/Guidelines for articles are helpful for editors, data re-users, if they are enforced.
- Even extremely broad guides, if they are actually implemented, will dramatically reduce malformed data.
- Layout, structure guidelines encourage automated formatting and improvements.
- Written structures are a first step toward cross-language standardization.
- Offline releases require indexes, which require structured data.
Sub assertion: Matching our content to standard library systems would help
Systems like the Dewey Decimal System allow libraries to classify books etc. A system such as the (UDC) system might prove an appropriate way to organize content. This would facilitate things like index generation for offline use and matching with non-Wikipedia resources. This is not essential, but should be explored as an option.