Task force/Offline/Amgine notes

From Strategic Planning

24 Nov 2009

http://pastie.org/715006


Wikipedia

Wiktionary

25 Nov 2009

Interview w/Kelson re: OpenZim

OpenZim is a storage and retrieval database/compression format requiring an external reader. OpenZim works exclusively from html output, such as from htmlDump.

Follow-up questions regarding redlinks - if files are not part of the repository, how they are handled is up to the reader software. It's possible to import an article subset into a MW install with the noredlinks skin extension, avoiding the issue.

Interview w/Tomaszf re: WMF dumps

Is aware of and tracking

  • openZim
  • WikiPok
  • Patrick Collison
  • nano note
  • wiki reader
  • working on a data partnerships with (separately) MIT and Stanford

Tomaszf is unable/unwilling to provide guidance as regards developing prioritization guidelines.

Interview w/Pm27 re: Okawix

  • Okawix is live-mirroring via RC bots.
  • Okawix moves from database -> zeno (possibly soon to openZim) as the dump.
  • Currently in-process of testing a NAS implementation for OrphFund
  • Planning addition of an XUL application within Mozilla for cellphones.

Thoughts on offline mission/questions

  • How many people?
    Essentially unknowable imo.
  • Delivery mechanisms likely to drive increase in offline readership?
    • Ubiquitous platform - eg. cellphone is most common electronic gadget in many markets, therefore WMF content on cellphone is a priority.
    • Ease of implementation for publishers - eg. data dumps available including publisher-targeted content in republisher-desired formats.
      • Contextually-tagged xml
      • html output (zendo, openzim)
      • RC livestream (Okawix uses rc IRC bots, but is this stable/robust?)

Interview with Andrew of Oz/Wikt

  • Currently working with
    • Wiktionary dumps in a range of applications
    • Has built a FireFox extension which can read from offline compressed dumps.
    • Currently working on parsing content out of en.Wikipedia infoboxen related to language/linguistics. (example: http://toolserver.org/~hippietrail/scriptinfobox-14.txt)
  • "the database dumps require a lot of things which are not 100% possible to recreate to render into usable form"
    • Example issue: template are not expanded.
      • directly related, parser does not have a spec and cannot be run outside MW software.
      • direct result: templates cannot be expanded.
    • "It would be really nice if dumps were made available with templates already expanded. You never know what a template does until you expand it. and as often as not it uses other templates so you still don't know what it does until you also expand those. It's a recursive process."
    • Template expansion came up 4 separate times during this interview.
      • Parser specification - because without it there is no way to build a parser which can expand templates.
      • The output of the Mediawiki parser is unusual in part because it isn't a real parser, has no error fails, and has acreted over time.
        • "There are quirky heuristics and special cases all through the parser. The ones for french punctuation are a famous one."
        • "There is no description of the parser so that it can be independently reproduced in other programming languages."
  • Working with dumps is expensive in computing/networking/storage considerations.
    • As an independent developer working with only a netbook, it's impossible for him to consider downloading a dump and importing it to local MW installation, which has led to the FF remote dump reader development.
    • Working remotely is also less than ideal: the toolserver set up does not allow server-intensive tasks nor is the full content available.
  • Current project to parse content from en.Wiktionary, extracting every dictionary field or attribute into a database.
    • Allow querying on a huge range of variables and words
    • Allow relational querying as well (parent/child sections)

Suggestions

  • "a parser spec is the #1 item"
  • "i would dump in more formats."
    • Got any specific ones?
    • "one with full HTML but probably without the "wrapping" page. just the generated part as you would see for action=render or action=print"
    • "and another is flat text that people can use without having to handle either wikitext or HTML, including mediawiki interface elements such as tables of contents and edit links etc"
    • "but my personal pet wish is for a minimally formatted dump which preserves only generic block level elements and converts inline elements to plain flat text. this would preserve a minimal amount of sentence/paragraph context for applications that want to analyse how language is used"
      • Note: this latter format would be used to develop a corpora on which linguistic usage and frequency studies could be based.
  • (in discussion regarding a dictionary-specific dump of Wiktionary) Make Wiktionary content more regular
    • Templates are easier to parse than prose text, but is harder for contributors to work with.
    • In prose text it is very hard to get contributors to use in a regular way which is easy to parse.
    • If you do build a parser to manage one language's templates, you'll need a different parser for each language.
  • Wiktionary needs a voice among the developers.
    • "well the most obvious thing is that nobody who is a developer or sysadmin at WMF is a wiktionarian so the foundation has little idea what we need. when we go to the trouble of learning sql and php and make a mediawiki extension it never gets installed"
    • "wiktionary needs a voice inside wmf. at least one person who cares to represent us."
    • "otherwise all we can do is offline processing and javascript extensions or wait patiently and faithfully for a few more years."