Task force/Offline/Amgine notes

    From Strategic Planning

    24 Nov 2009




    25 Nov 2009

    Interview w/Kelson re: OpenZim

    OpenZim is a storage and retrieval database/compression format requiring an external reader. OpenZim works exclusively from html output, such as from htmlDump.

    Follow-up questions regarding redlinks - if files are not part of the repository, how they are handled is up to the reader software. It's possible to import an article subset into a MW install with the noredlinks skin extension, avoiding the issue.

    Interview w/Tomaszf re: WMF dumps

    Is aware of and tracking

    • openZim
    • WikiPok
    • Patrick Collison
    • nano note
    • wiki reader
    • working on a data partnerships with (separately) MIT and Stanford

    Tomaszf is unable/unwilling to provide guidance as regards developing prioritization guidelines.

    Interview w/Pm27 re: Okawix

    • Okawix is live-mirroring via RC bots.
    • Okawix moves from database -> zeno (possibly soon to openZim) as the dump.
    • Currently in-process of testing a NAS implementation for OrphFund
    • Planning addition of an XUL application within Mozilla for cellphones.

    Thoughts on offline mission/questions

    • How many people?
      Essentially unknowable imo.
    • Delivery mechanisms likely to drive increase in offline readership?
      • Ubiquitous platform - eg. cellphone is most common electronic gadget in many markets, therefore WMF content on cellphone is a priority.
      • Ease of implementation for publishers - eg. data dumps available including publisher-targeted content in republisher-desired formats.
        • Contextually-tagged xml
        • html output (zendo, openzim)
        • RC livestream (Okawix uses rc IRC bots, but is this stable/robust?)

    Interview with Andrew of Oz/Wikt

    • Currently working with
      • Wiktionary dumps in a range of applications
      • Has built a FireFox extension which can read from offline compressed dumps.
      • Currently working on parsing content out of en.Wikipedia infoboxen related to language/linguistics. (example: http://toolserver.org/~hippietrail/scriptinfobox-14.txt)
    • "the database dumps require a lot of things which are not 100% possible to recreate to render into usable form"
      • Example issue: template are not expanded.
        • directly related, parser does not have a spec and cannot be run outside MW software.
        • direct result: templates cannot be expanded.
      • "It would be really nice if dumps were made available with templates already expanded. You never know what a template does until you expand it. and as often as not it uses other templates so you still don't know what it does until you also expand those. It's a recursive process."
      • Template expansion came up 4 separate times during this interview.
        • Parser specification - because without it there is no way to build a parser which can expand templates.
        • The output of the Mediawiki parser is unusual in part because it isn't a real parser, has no error fails, and has acreted over time.
          • "There are quirky heuristics and special cases all through the parser. The ones for french punctuation are a famous one."
          • "There is no description of the parser so that it can be independently reproduced in other programming languages."
    • Working with dumps is expensive in computing/networking/storage considerations.
      • As an independent developer working with only a netbook, it's impossible for him to consider downloading a dump and importing it to local MW installation, which has led to the FF remote dump reader development.
      • Working remotely is also less than ideal: the toolserver set up does not allow server-intensive tasks nor is the full content available.
    • Current project to parse content from en.Wiktionary, extracting every dictionary field or attribute into a database.
      • Allow querying on a huge range of variables and words
      • Allow relational querying as well (parent/child sections)


    • "a parser spec is the #1 item"
    • "i would dump in more formats."
      • Got any specific ones?
      • "one with full HTML but probably without the "wrapping" page. just the generated part as you would see for action=render or action=print"
      • "and another is flat text that people can use without having to handle either wikitext or HTML, including mediawiki interface elements such as tables of contents and edit links etc"
      • "but my personal pet wish is for a minimally formatted dump which preserves only generic block level elements and converts inline elements to plain flat text. this would preserve a minimal amount of sentence/paragraph context for applications that want to analyse how language is used"
        • Note: this latter format would be used to develop a corpora on which linguistic usage and frequency studies could be based.
    • (in discussion regarding a dictionary-specific dump of Wiktionary) Make Wiktionary content more regular
      • Templates are easier to parse than prose text, but is harder for contributors to work with.
      • In prose text it is very hard to get contributors to use in a regular way which is easy to parse.
      • If you do build a parser to manage one language's templates, you'll need a different parser for each language.
    • Wiktionary needs a voice among the developers.
      • "well the most obvious thing is that nobody who is a developer or sysadmin at WMF is a wiktionarian so the foundation has little idea what we need. when we go to the trouble of learning sql and php and make a mediawiki extension it never gets installed"
      • "wiktionary needs a voice inside wmf. at least one person who cares to represent us."
      • "otherwise all we can do is offline processing and javascript extensions or wait patiently and faithfully for a few more years."