Task force/Offline/Amgine notes

24 Nov 2009

Wikipedia

Wiktionary

25 Nov 2009

Interview w/Kelson re: OpenZim

OpenZim is a storage and retrieval database/compression format requiring an external reader. OpenZim works exclusively from html output, such as from htmlDump.

Follow-up questions regarding redlinks - if files are not part of the repository, how they are handled is up to the reader software. It's possible to import an article subset into a MW install with the noredlinks skin extension, avoiding the issue.

Interview w/Tomaszf re: WMF dumps

Is aware of and tracking

openZim
WikiPok
Patrick Collison
nano note
wiki reader
working on a data partnerships with (separately) MIT and Stanford

Tomaszf is unable/unwilling to provide guidance as regards developing prioritization guidelines.

Interview w/Pm27 re: Okawix

Okawix is live-mirroring via RC bots.
Okawix moves from database -> zeno (possibly soon to openZim) as the dump.
Currently in-process of testing a NAS implementation for OrphFund
Planning addition of an XUL application within Mozilla for cellphones.

random links blog

Thoughts on offline mission/questions

How many people?
Essentially unknowable imo.
Delivery mechanisms likely to drive increase in offline readership?
- Ubiquitous platform - eg. cellphone is most common electronic gadget in many markets, therefore WMF content on cellphone is a priority.
- Ease of implementation for publishers - eg. data dumps available including publisher-targeted content in republisher-desired formats.
  - Contextually-tagged xml
  - html output (zendo, openzim)
  - RC livestream (Okawix uses rc IRC bots, but is this stable/robust?)

Interview with Andrew of Oz/Wikt

Currently working with
- Wiktionary dumps in a range of applications
- Has built a FireFox extension which can read from offline compressed dumps.
- Currently working on parsing content out of en.Wikipedia infoboxen related to language/linguistics. (example: http://toolserver.org/~hippietrail/scriptinfobox-14.txt)

"the database dumps require a lot of things which are not 100% possible to recreate to render into usable form"
- Example issue: template are not expanded.
  - directly related, parser does not have a spec and cannot be run outside MW software.
  - direct result: templates cannot be expanded.
- "It would be really nice if dumps were made available with templates already expanded. You never know what a template does until you expand it. and as often as not it uses other templates so you still don't know what it does until you also expand those. It's a recursive process."
- Template expansion came up 4 separate times during this interview.
  - Parser specification - because without it there is no way to build a parser which can expand templates.
  - The output of the Mediawiki parser is unusual in part because it isn't a real parser, has no error fails, and has acreted over time.
    - "There are quirky heuristics and special cases all through the parser. The ones for french punctuation are a famous one."
    - "There is no description of the parser so that it can be independently reproduced in other programming languages."
Working with dumps is expensive in computing/networking/storage considerations.
- As an independent developer working with only a netbook, it's impossible for him to consider downloading a dump and importing it to local MW installation, which has led to the FF remote dump reader development.
- Working remotely is also less than ideal: the toolserver set up does not allow server-intensive tasks nor is the full content available.
Current project to parse content from en.Wiktionary, extracting every dictionary field or attribute into a database.
- Allow querying on a huge range of variables and words
- Allow relational querying as well (parent/child sections)

Suggestions

"a parser spec is the #1 item"
"i would dump in more formats."
- Got any specific ones?
- "one with full HTML but probably without the "wrapping" page. just the generated part as you would see for action=render or action=print"
- "and another is flat text that people can use without having to handle either wikitext or HTML, including mediawiki interface elements such as tables of contents and edit links etc"
- "but my personal pet wish is for a minimally formatted dump which preserves only generic block level elements and converts inline elements to plain flat text. this would preserve a minimal amount of sentence/paragraph context for applications that want to analyse how language is used"
  - Note: this latter format would be used to develop a corpora on which linguistic usage and frequency studies could be based.
(in discussion regarding a dictionary-specific dump of Wiktionary) Make Wiktionary content more regular
- Templates are easier to parse than prose text, but is harder for contributors to work with.
- In prose text it is very hard to get contributors to use in a regular way which is easy to parse.
- If you do build a parser to manage one language's templates, you'll need a different parser for each language.
Wiktionary needs a voice among the developers.
- "well the most obvious thing is that nobody who is a developer or sysadmin at WMF is a wiktionarian so the foundation has little idea what we need. when we go to the trouble of learning sql and php and make a mediawiki extension it never gets installed"
- "wiktionary needs a voice inside wmf. at least one person who cares to represent us."
- "otherwise all we can do is offline processing and javascript extensions or wait patiently and faithfully for a few more years."