Task force/Offline/IRC/2009-12-22

From Strategic Planning
< Task force‎ | Offline‎ | IRC
Jump to navigation Jump to search
20:01 < walkerma> Shall we start?  Ah - Amgine - glad to see you!  wizzy - I'm really glad you're here
20:02 < Amgine_> Okay, I'm just leaving the hotel, so I won't be able to participate.
20:02 < Amgine_> Just wanted to let y'all know that...
20:02 < walkerma> OK, thanks - will you be able to participate later?
20:02 < Amgine_> <will read the log tomorrow if I have internet.>
20:03 < Amgine_> 18 hours road time today.
20:03 -!- Amgine_ [n=Amgine@wikinews/Amgine] has quit [Client Quit]
20:03 < walkerma> OK, I understand. Kelson is involved with the OpenZIM format and also with the French offline project
20:03 -!- hejko [n=hejko@dslb-084-058-089-137.pools.arcor-ip.net] has joined #wikimedia-strategy
20:04 < walkerma> hejko is one of the founders of PediaPress
20:04 < Kelson> hi
20:04 < hejko> hi all
20:04 < walkerma> http://pediapress.com/
20:04 -!- FT2 [n=FT2@wikipedia/ft2] has quit [Nick collision from services.]
20:04 < walkerma> Hi!  Patrice is a founder of WikiPock
20:04 -!- FT2-away [n=FT2@wikipedia/ft2] has joined #wikimedia-strategy
20:04 -!- yannf [n=yannf@wikipedia/yannf] has quit ["Like the Truth, I am elsewhere..."]
20:05 < walkerma> http://www.wikipock.com/
20:05 -!- Netsplit crichton.freenode.net <-> irc.freenode.net quits: Philippe|Wiki
20:06 < walkerma> OK, since Patrice is here, I'd like to begin by asking him how he WkiPock might be used in developing countries
20:06 < Patrice> .
20:07 < walkerma> We've talked a lot about how cellphones are ubiquitous now in many places, but good internet is not
20:07 < walkerma> Any thoughts?
20:07 < hejko> Patrice: we have no idea how the specifications of the next generation low cost mobile phones will be. do you have any insight here?
20:08 -!- yannf [n=yannf@wikipedia/yannf] has joined #wikimedia-strategy
20:09 -!- Netsplit over, joins: Philippe|Wiki
20:09 < Philippe|Wiki> apologies for my tardiness, I was caught on the other side of a channel split.
20:09 < wizzy> I would also like to ask hejko and Patrice what their starting point for collections is - HTML, or database dumps ?
20:09 -!- peteforsyth [n=petefors@wikipedia/peteforsyth] has joined #wikimedia-strategy
20:09 < hejko> MW-API :)
20:10 -!- FT2 [n=FT2@wikipedia/ft2] has joined #wikimedia-strategy
20:10 < wizzy> hejko: URL ?
20:10 < hejko> we use wiki text and a python library to parse it into a document tree
20:11 < hejko> http://en.wikipedia.org/w/api.php
20:11 -!- pm27 [i=507dac3c@gateway/web/freenode/session] has joined #wikimedia-strategy
20:11 < hejko> http://code.pediapress.com/
20:11 < pm27> hello all
20:12 < walkerma> Hi pm27!  pm27 is president of Linterweb, which produced en:Version0.5 and is producing 0.7, on Okawix
20:12 < Patrice> we just release a symbian version for nokia phones
20:12 < walkerma> We were just asking Patrice two questions as once
20:12 < pm27> president is a very big word :)
20:13 < walkerma> http://www.linterweb.fr/
20:14 < walkerma> Patrice - are you there?
20:14 < wizzy> walkerma: were you saying last week that wikipock does not render tables ?
20:15 < walkerma> wizzy - I was saying that they DO render tables, but the version I have doesn't
20:15 < pm27> http://blog.wikiwix.com/fr/category/okawix/ for information what we do in offline
20:15 < Patrice> yes, i'm here
20:15 < walkerma> and Patrice can tell us about their Version 2 system
20:15 < Patrice> With our V2 data format is in Beta testing now
20:16 < walkerma> wizzy distributes Wikipedia to schools and villages in South Africa.  He's also a proud owner of a nice cellphone, like many South Africans
20:16 < Patrice> The new format is quicker and more compressed
20:16 < Patrice> 3.1 million articles unde 4GB
20:16 < pm27> wich format do you use Patrice ?
20:16 < pm27> ZIM ?
20:17 < wizzy> Patrice: do you have search ? how do you do it ?
20:17 < Patrice> no. we are using a proprietary format (soon open source). We started 18 months ago working on the technology, at that time Zim was not available
20:18 < pm27> I ask this question because it s very hard to have all the content Patrice
20:18 < Patrice> what do you mean by that (all the content)?
20:19 < pm27> see http://www.okawix.com/?page=torrent&lang=en
20:19 < pm27> all language and all project
20:21 < Patrice> is this url working for you ?
20:21 < pm27> I manage the project okawix
20:22 < wizzy> yes, that URL works for me
20:22 < walkerma> pm27: It seems to work for me - though I haven't actually completed a download
20:23 -!- Netsplit crichton.freenode.net <-> irc.freenode.net quits: Az1568_
20:23 < walkerma> Patrice - can you tell us about searches on WikiPock?
20:23 < Patrice> about WikiPock and the rendering table issue. WikiPock format V2 will render tables
20:23 < pm27> Patrice it s the torrent link
20:24 < Patrice> Our goal was to search and read Wikipedia offline on mobile phones when cpu and memory are very limited
20:24 < wizzy> pm27: that will download a ZIM file that the okawix reader can read ?
20:24 < Patrice> filesystems on blackberry or symbian OS do not perform that well
20:25 < pm27> not yet wizzy but we have working for
20:25 < Patrice> so we really spent a lot of time optimizing the technology for mobiles
20:25 < wizzy> what does it download then ?
20:26 < pm27> wizzy: zeno format
20:26 < Patrice> of course the technology works also for Mac/PC
20:26 < pm27> http://blog.wikiwix.com/en/2009/12/07/okawix-et-openzim/ wizzy
20:26 -!- FT2 [n=FT2@wikipedia/ft2] has quit [Read error: 110 (Connection timed out)]
20:26 < wizzy> and that has a builtin index for search ? (zeno)
20:27 < Patrice> i talked with openZim about the problem and optimization we had to implement to workaround the mobile phones limitations
20:27 -!- Netsplit over, joins: Az1568_
20:27 < pm27> it s our own search engine, the smae as use for wikiwix
20:28 < pm27> same
20:28 < Patrice> we do not use full text seach, the index will not run/fit on mobiles
20:29 < pm27> the idea is consist to make a dump of our index in online mode ready to use in offline mode
20:29 < Patrice> we only use title index
20:30 < wizzy> Have you considered doing title and first paragraph search ?
20:30 < pm27> like the wikireader of openmoko Patrice
20:30 < Patrice> that's a good idea,
20:30 < Patrice> did not think about that!
20:31 < pm27> Patrice: a full text search engine is possible
20:32 < Patrice> the next generation of microSD will provided 64GB and up to 1TB... so it will be possible to add full text search
20:32 < pm27> it s joke ?
20:33 < wizzy> Patrice: your limitation is CPU, or space for the index ?
20:33 < pm27> that is the mean than in African scholl they need more PC :)
20:34 < pm27> school
20:34 < Patrice> well both are a limitations
20:36 < wizzy> pm27: does your torrent download include pictures ?
20:36 < Patrice> no pictures of cours!
20:36 < Patrice> of course
20:36 < pm27> no but it s possible with the software to add the pictures
20:37 < wizzy> and is it from the current wikipedia ? or an earlier snapshot ?
20:37 < Patrice> martin proposed to build a stripdown version (top 30K articles) with images. That's something possible
20:37 < pm27> every two month we have updating
20:37 < Patrice> we produce a snapshot every 3/4 months
20:38 < wizzy> Patrice: how do you choose your top 30k ?
20:38 < walkerma> Patrice: Yes - I think in 2010 we can start to make 30k selections quite regularly
20:39 < Patrice> from http://toolserver.org/~cbm/release-data/2008-9-13/HTML/
20:40 -!- fajro [n=fajro@Wikimedia/Fajro] has joined #wikimedia-strategy
20:40 < Patrice> but martin told me there is a new bot in the work for the top 30K articles.
20:40 < pm27> Patrice you come from Paris ?
20:41 < Patrice> that's right. we started wikipock in Paris but then we relocated the company in San Francisco
20:42 < pm27> c'est con je suis de Paris :)
20:42 < walkerma> Since we're discussing the 30k selection, could I seize the opportunity to move onto our "official" agenda item - how to select articles and versions?
20:43 < wizzy> is there a new bot ? what is new about it ?
20:43 < walkerma> As most of you know, it's a big interest of mine
20:43 < walkerma> wizzy: http://en.wikipedia.org/wiki/xxxxxxxxxxxxxxx/Second_generation
20:44 < walkerma> Can I ask people here to keep this under their "hats" until it is officially announced?
20:44 < walkerma> It's still being tested
20:44 < walkerma> http://toolserver.org/~enwp10/
20:44 < Philippe|Wiki> ahem...
20:44 < Philippe|Wiki> i would remind you that it's a public log
20:44 < Philippe|Wiki> :-)
20:46 < walkerma> Philippe - I understand, it's public knowledge that we are testing it, and it was even in the SignPost - all I ask is that we don't make big announcements on our projects
20:46 < Philippe|Wiki> Ah, okay then. :)
20:46 < Philippe|Wiki> thanks for clarification.
20:46 < Patrice> 2TB memory cards: http://www.sdcard.org/home
20:46 < pm27> Patrice: for have a full search engine in en.wikipedia will have just need 1,5 G
20:46 < walkerma> Perhaps the URL could be kept out of the log just in case (not sure if it's OK)
20:47 < Philippe|Wiki> walkerma, I'm comfortable with that.
20:47 < hejko> walkerma: will the tool provide a XML file with all assessment scares and the hit score?
20:47 < hejko> scores
20:47 -!- Huib|AFK is now known as Huib
20:47 < Patrice> I  1.5 G the size of the index only, right?
20:47 < walkerma> Maybe scares is the right word - some porn articles have quite scary scores!
20:47 < Patrice> Is 1.5 G the size of the index only, right
20:47 < pm27> yes of course Patrice
20:48 < walkerma> hejko: Not sure, but I think it does
20:48 < walkerma> It will be updating and recalculating importance scores much more regularly than the old system
20:49 < walkerma> The quality assessments are done much faster
20:49 < Patrice> i have not found on http://toolserver.org/~cbm/release-data/2008-9-13/HTML/ the top 30K articles in CSV format. how can i get that?
20:49 < walkerma> You can also easily now pull out a selection, say, of articles that appear under France and under chemistry
20:50 < walkerma> Kelson: Can you answer Patrice there?
20:50 < hejko> great, this combined with articles as members of the Outline Of Knowledge category tree should allow to create tools which offer to create custom selections automatically.
20:50 < hejko> http://en.wikipedia.org/wiki/Wikipedia:WikiProject_Outline_of_Knowledge
20:50 < Kelson> walkerma: hmmm... no idea... Carl should know about that ;)
20:51 -!- Netsplit crichton.freenode.net <-> irc.freenode.net quits: Az1568_
20:51 < walkerma> OK, thanks.  Carl is the developer who wrote the new bot, and the old SelectionBot too
20:52 < walkerma> Anyway - on en:WP and on fr:WP, the Wikipedia communities have WikiProjects that tag article talk pages with their Project template
20:52 < walkerma> That template includes a quality rating, and sometimes an importance rating as well
20:53 < walkerma> The bot compiles the data - from over 2 million assessed articles in the case of en:WP - and puts it in a searchable form, a form that can be organised for offline use
20:54 < pm27> Salut Kelson
20:54 < Kelson> pm27: yo
20:54 < pm27> Patrice tu es sur Paris actuellement ?
20:54 < walkerma> It requires >1 thousand WikiProjects to keep on assessing articles, but they love using the bot because it allows them to see what they have in their subject area
20:54 < walkerma> Though some assessments get out of date
20:55 < Patrice> pm27: Pas en ce moment mais je passe a Paris mi-janvier. On peut dej ensemble.
20:56 < walkerma> Take a look at http://toolserver.org/xxxxxxxxxxx/list2.fcgi so you can see how to generate a selection on the fly (Philippe - please remove that URL from the log for now!)
20:56 < wizzy> who maintains the Outline pages ? Seems a quite thankless job
20:56 < Philippe|Wiki> walkerma: yep, noted.
20:56 -!- Netsplit over, joins: Az1568_
20:56 < walkerma> wizzy: I don't understand your question - what do you mean by Outline pages?
20:57 < wizzy> http://en.wikipedia.org/wiki/Outline_of_water for instance
20:57 < walkerma> Though it's true that much of the 1.0 work is thankless - especially the Version selection!
20:57 < wizzy> heh
20:58 < wizzy> walkerma: nice page at list2.fcgi 
20:58 < walkerma> According to the talk page, it's http://en.wikipedia.org/wiki/Wikipedia:WikiProject_Outline_of_Knowledge
20:58 < pm27> Patrice en priv?e
20:59 < walkerma> When the old 1.0 bot was introduced, it made a big change in how articles were organised.  I think this new bot will likewise have a big impact too, thought how the community uses it,we'll have to wait & see
21:00 -!- Netsplit crichton.freenode.net <-> irc.freenode.net quits: Az1568_
21:01 < walkerma> But it may make it VERY easy for a WikiProject to produce a selection in a few minutes - say on French chemists- then make that selection available via PediaPress, Okawiz and WikiPock the same day?
21:01 < wizzy> that would be great
21:02 < walkerma> What I'd like to ask is two questions:
21:02 < hejko> maybe the script at list2.fcgi could offer to emit book outlines that are compatible with the book tool's stored books format
21:03 < walkerma> 1. Are there other ways for making article selections, ways we haven't thought of, that might work
21:03 < walkerma> 2. Could tools like this be adapted to other projects where size is an issue?  Or is WP the only project where the size is really a limiting factor?
21:04 < walkerma> hejko: Point taken
21:05 < Kelson> walkerma: I have a fully automated solution to sort articles for every WMF project based on the importance with a method similar to Carl's one
21:06 < wizzy> to answer 1), we mentioned before that quality metadata is hard to find, so doing things like What-links-here gets out of hand
21:06 < walkerma> Kelson: That sounds great! Can you tell us about it?
21:06 < hejko> walkerma: regarding 1) I'd like to combine the WP1.0 scores with a centrality algorithm to create closed (?) collections on major topics.  http://en.wikipedia.org/wiki/Centrality
21:07 < Kelson> walkerma: sure? Think I have done that in one of the last IRC meeting, is this necessary to retry?
21:07 < walkerma> Sounds very interesting
21:07 < Kelson> walkerma: I propose that I write a small doc... and we can speak about that if someone have additional question during the next meeting ?
21:07 < walkerma> OK
21:08 < Kelson> walkerma: but all dumps I have done are base on this scripts
21:09 < walkerma> I'd like us to weight the links-in according to the score of every article linking in, as a further refinement of the current algorithm
21:09 < walkerma> But that's a detail, probably
21:10 < walkerma> Kelson: We should probably talk on the phone soon (though I'm away for Christmas)
21:11 < walkerma> Patrice:You've worked with WikiQuote and Wiktionary on the cellphone - is the size of these a limiting factor, or not?
21:12 < Kelson> walkerma: I'm at home during the next two weeks
21:12 < Patrice> No, size  is not really an issue for wikiquote and wikitionary
21:13 < walkerma> The other big issue in creating collections is selecting the right article VERSIONS
21:13 < walkerma> With Version 0.7, we found in our 30,000 articles many vandalised versions in our collection
21:14 < walkerma> wizzy wrote a nice script to help us find them, but it took about 100 hours of my time over six months - very boring work - to manually locate them
21:14 < walkerma> and to replace them
21:14 < wizzy> you might be able to do a history analysis to find a stable version ?
21:14 < walkerma> Clearly that isn't sustainable, and also it means our collections are stale
21:15 < walkerma> wizzy - yes, that would certainly help
21:15 < wizzy> also a job for the 'bot ?
21:15 < walkerma> At Wikimania I spoke extensively with Luca de Alfaro, who wrote the WikiTrust extension for WP
21:16 < walkerma> wizzy: Can you explain how tht might work?
21:16 < hejko> what we did in the past was: taking the full dump and create a list of frequent editors. we then for each article selected the last version that was edited by a frequent editor.
21:17 < hejko> I think in the future the wikitrust project will help to select the right version.
21:17 < Philippe|Wiki> interesting... i wondered how that article selection was done
21:18 -!- aude [n=chatzill@wikipedia/Aude] has quit ["ChatZilla 0.9.85 [Firefox 3.5.6/20091201220228]"]
21:18 < Kelson> yes... I also see currently only wikitrust to help us building quickly a good and substainable solution
21:18 < walkerma> http://wikitrust.soe.ucsc.edu/
21:18 < wizzy> it could look at the history, and find a version that was left alone for a few days, or reverted ?
21:19 < walkerma> I have drafted an outline grant proposal to collaborate with Luca on writing code - to choose the "most trustworthy recent version" of each article in a selection
21:19 < walkerma> But he needs a complete dump that includes full article history
21:20 < wizzy> couldn't a bot peek into the history, make a judgement, and link it on the talk page ?
21:20 -!- Netsplit over, joins: Az1568_
21:22 < walkerma> WikiTrust assigns a score to each author, secretly, that shows how often that author has been reverted.  An author with a lot of unreverted edits builds up a high score of trust
21:22 < walkerma> Someone who is a vandal will normally be very obvious with such a scoring, as they will have very low trust
21:23 < walkerma> The actual text of each article is marked up with its own trust rating based on who contributed the text
21:24 < Philippe|Wiki> Hmmm, I'm sure there's an obvious reason, but... why is the score secret?
21:24 < walkerma> We think we could come up with an overall score for each version, based on adding up those trust scores for all the text - thn find the most "trusted version"
21:24 < walkerma> Philippe - That was the subject of the discussion at WikiMania!
21:24 < Philippe|Wiki> For instance, if I'm evaluating things, I'd like to see  that walkerma is a 99 and hejko is a 4, for instance.  (Sorry, hejko)
21:26 < walkerma> I forget all the reasoning, but there are some very good reasons why it needs to be kept from general reading - though there was debate about whether or not admins should see the scores
21:26 < Philippe|Wiki> I just ask because the Quality task force is kicking around a similar issue :)
21:26 < Philippe|Wiki> As is Community Health, I believe
21:27 < walkerma> But you can indirectly find your score simply by looking at the marked up version of an article and seeing what colour your contribution is marked with?
21:27 < Philippe|Wiki> but here's a dumb thought, maybe I should RTFM? :)  You gave a nice URL above.
21:27 < walkerma> !
21:28 < walkerma> Philippe: Luca is limited in what he can do with this now, but I'm pretty sure the WMF plans to make the WikiTrust extension official in the next few months
21:28 < Philippe|Wiki> Thanks, walkerma :)
21:29 -!- Huib is now known as Huib|BezigeBij
21:29 < walkerma> And I think it's really a very nice system.  Luca is a professor at one of the leading computer science depts in the US, so I'm sure he knows his stuff!
21:29 < Philippe|Wiki> Absolutely. :)  I have a high level of wikitrust for him. :)
21:30 < walkerma> (Though that status was before the budget crisis in California :( )
21:30 < Philippe|Wiki> fair point.
21:30 < walkerma> wizzy has mentioned simpler ways of doing article selection
21:31 < walkerma> But what about making selections that are safe for children?  That's much harder, I think
21:31 < walkerma> Do people here have any ideas/thoughts on how we can approach the article selection issue, besides using WikiTrust?
21:32 < walkerma> BTW, I will have to go soon
21:32 < Philippe|Wiki> If I may.... while it's fascinating, I think article selection for particular usages is tactical rather than strategic:)
21:32 < wizzy> wikitrust looks great. I just tried it
21:33 < walkerma> Philippe - if we can't get selections that are suitable for children, that is a big issue that affects our whole strategy, I think
21:34 < walkerma> especially as we have picked schools as a major conduit
21:34 < Philippe|Wiki> Fair point :)
21:34 < Philippe|Wiki> I stand corrected
21:35 < walkerma> wizzy: I thin you've mentioned embarrassing issues on your blog - kids finding articles on porn.  How much of an issue is that in South African culture
21:35 < walkerma> ?
21:37 < wizzy> I don't think it is culture-specific
21:37 < walkerma> OK
21:38 < walkerma> I think often Wikipedians underestimate the importance of such issues in the general public
21:38 < wizzy> This came up when I put a snapshot of the whole en wikipedia (when it was 'only' 20 Gig or so, with pics)
21:39 < walkerma> Because many Wikipedians - esp in the US - are quite libertarian, and many don't have kids.  I have two young daughters, so it's a concern for me
21:39 < wizzy> yes, it is definitely a problem
21:40 < wizzy> perhaps the pics need to be rated nsfw
21:40 < walkerma> Do you think blacklisting articles and perhaps whole categories is the appropriate solution?
21:40 < wizzy> I think it is pics that are the problem mostly - but how to build a bomb might also count
21:41 < walkerma> Yes - pictures are a real problem!  There is the famous story of when a kid was writing his history paper for school on Cortes, and someone had replaced Cortes' picture with an image of a giant penis
21:41 < walkerma> The article text was fine, I'm sure...!
21:41 < wizzy> My nephew in the 'states got into serious trouble for downloading the Anarchists Cookbook
21:43 < walkerma> OK, I will need to go
21:44 < Philippe|Wiki> Is there someone who was here for the full meeting that can send me a full log?  Otherwise, I can go with mine, which is missing about the first three minutes because of the netsplit.
21:44 < wizzy> Strategybot has it ?
21:44 < walkerma> I think we all need to think about these issues, because although they're not as "sexy" as some of the technical issues, any offline collection must ask the question - "What should we put in it"?
21:44 < Philippe|Wiki> If he's been behaving, yes :-)
21:44 < Philippe|Wiki> I should check that first though, wizzy
21:44 < walkerma> I have a complete log, I think - I can email that to you
21:44 < wizzy> I have a log too
21:44 < Philippe|Wiki> that would be great, walkerma, just in case... because I know I saw the bot on both sides of the netsplit.
21:45 < Philippe|Wiki> philippe@wikimedia.org plz :)
21:45 < wizzy> thanks walkerma 
21:46 < walkerma> Let's aim to meet at the same time next week - tentatively.  I will be at my in-laws, but I'll post something on the wiki
21:46 < Philippe|Wiki> I will be unable to be here, but you don't need me :)
21:46 < walkerma> And Philippe - now my grades are in I will start posting more summaries!
21:46 < Philippe|Wiki> I'm on vacation the first two days of the week next week.
21:46 < Philippe|Wiki> Thanks, walkerma :)
21:46 < walkerma> Many people will be travelling so it will be a small meeting , I tihnk, but we'll pick things up in Jan again - nar to our deadline
21:46 < walkerma> Must go
21:47 < Philippe|Wiki> Thanks, all
21:47 < Philippe|Wiki> :)
21:47 < walkerma> Thanks all, and thanks Patrice for joinging us!
21:47 -!- walkerma [n=chatzill@cpe-74-71-218-114.twcny.res.rr.com] has quit ["ChatZilla 0.9.86 [Firefox 3.0.16/2009120208]"]
21:47 -!- peteforsyth [n=petefors@wikipedia/peteforsyth] has quit []
21:48 -!- Philippe|Wiki [n=Philippe@wikimedia/Philippe] has quit []
21:49 < pm27> but i really don't know what we need a task force ?
21:50 < pm27> There is one solution of the storage of the data
21:50 < pm27> one solution to generating dump
21:51 < pm27> one solution for the search engine
21:51 < pm27> ...
21:52 < pm27> just waiting the fondation mozilla and not find alternative solution to use wikipedia one a phone
21:52 < pm27> Kelson: ?
21:53 < pm27> and also for mobile phone there is online solution so wait
21:56 < pm27> there is a lot of work make by some person so just waiting
21:56 < pm27> http://download.wikipedia.org/dvd.html