Task force/Offline/Amgine notes
Appearance
< Task force | Offline
24 Nov 2009
Wikipedia
Wiktionary
25 Nov 2009
Interview w/Kelson re: OpenZim
OpenZim is a storage and retrieval database/compression format requiring an external reader. OpenZim works exclusively from html output, such as from htmlDump.
Follow-up questions regarding redlinks - if files are not part of the repository, how they are handled is up to the reader software. It's possible to import an article subset into a MW install with the noredlinks skin extension, avoiding the issue.
Interview w/Tomaszf re: WMF dumps
Is aware of and tracking
- openZim
- WikiPok
- Patrick Collison
- nano note
- wiki reader
- working on a data partnerships with (separately) MIT and Stanford
Tomaszf is unable/unwilling to provide guidance as regards developing prioritization guidelines.
Interview w/Pm27 re: Okawix
- Okawix is live-mirroring via RC bots.
- Okawix moves from database -> zeno (possibly soon to openZim) as the dump.
- Currently in-process of testing a NAS implementation for OrphFund
- Planning addition of an XUL application within Mozilla for cellphones.
- random links blog
Thoughts on offline mission/questions
- How many people?
- Essentially unknowable imo.
- Delivery mechanisms likely to drive increase in offline readership?
- Ubiquitous platform - eg. cellphone is most common electronic gadget in many markets, therefore WMF content on cellphone is a priority.
- Ease of implementation for publishers - eg. data dumps available including publisher-targeted content in republisher-desired formats.
- Contextually-tagged xml
- html output (zendo, openzim)
- RC livestream (Okawix uses rc IRC bots, but is this stable/robust?)
Interview with Andrew of Oz/Wikt
- Currently working with
- Wiktionary dumps in a range of applications
- Has built a FireFox extension which can read from offline compressed dumps.
- Currently working on parsing content out of en.Wikipedia infoboxen related to language/linguistics. (example: http://toolserver.org/~hippietrail/scriptinfobox-14.txt)
- "the database dumps require a lot of things which are not 100% possible to recreate to render into usable form"
- Example issue: template are not expanded.
- directly related, parser does not have a spec and cannot be run outside MW software.
- direct result: templates cannot be expanded.
- "It would be really nice if dumps were made available with templates already expanded. You never know what a template does until you expand it. and as often as not it uses other templates so you still don't know what it does until you also expand those. It's a recursive process."
- Template expansion came up 4 separate times during this interview.
- Parser specification - because without it there is no way to build a parser which can expand templates.
- The output of the Mediawiki parser is unusual in part because it isn't a real parser, has no error fails, and has acreted over time.
- "There are quirky heuristics and special cases all through the parser. The ones for french punctuation are a famous one."
- "There is no description of the parser so that it can be independently reproduced in other programming languages."
- Example issue: template are not expanded.
- Working with dumps is expensive in computing/networking/storage considerations.
- As an independent developer working with only a netbook, it's impossible for him to consider downloading a dump and importing it to local MW installation, which has led to the FF remote dump reader development.
- Working remotely is also less than ideal: the toolserver set up does not allow server-intensive tasks nor is the full content available.
- Current project to parse content from en.Wiktionary, extracting every dictionary field or attribute into a database.
- Allow querying on a huge range of variables and words
- Allow relational querying as well (parent/child sections)
Suggestions
- "a parser spec is the #1 item"
- "i would dump in more formats."
- Got any specific ones?
- "one with full HTML but probably without the "wrapping" page. just the generated part as you would see for action=render or action=print"
- "and another is flat text that people can use without having to handle either wikitext or HTML, including mediawiki interface elements such as tables of contents and edit links etc"
- "but my personal pet wish is for a minimally formatted dump which preserves only generic block level elements and converts inline elements to plain flat text. this would preserve a minimal amount of sentence/paragraph context for applications that want to analyse how language is used"
- Note: this latter format would be used to develop a corpora on which linguistic usage and frequency studies could be based.
- (in discussion regarding a dictionary-specific dump of Wiktionary) Make Wiktionary content more regular
- Templates are easier to parse than prose text, but is harder for contributors to work with.
- In prose text it is very hard to get contributors to use in a regular way which is easy to parse.
- If you do build a parser to manage one language's templates, you'll need a different parser for each language.
- Wiktionary needs a voice among the developers.
- "well the most obvious thing is that nobody who is a developer or sysadmin at WMF is a wiktionarian so the foundation has little idea what we need. when we go to the trouble of learning sql and php and make a mediawiki extension it never gets installed"
- "wiktionary needs a voice inside wmf. at least one person who cares to represent us."
- "otherwise all we can do is offline processing and javascript extensions or wait patiently and faithfully for a few more years."