Task force/Offline/IRC/2009-12-22
Appearance
20:01 < walkerma> Shall we start? Ah - Amgine - glad to see you! wizzy - I'm really glad you're here 20:02 < Amgine_> Okay, I'm just leaving the hotel, so I won't be able to participate. 20:02 < Amgine_> Just wanted to let y'all know that... 20:02 < walkerma> OK, thanks - will you be able to participate later? 20:02 < Amgine_> <will read the log tomorrow if I have internet.> 20:03 < Amgine_> 18 hours road time today. 20:03 -!- Amgine_ [n=Amgine@wikinews/Amgine] has quit [Client Quit] 20:03 < walkerma> OK, I understand. Kelson is involved with the OpenZIM format and also with the French offline project 20:03 -!- hejko [n=hejko@dslb-084-058-089-137.pools.arcor-ip.net] has joined #wikimedia-strategy 20:04 < walkerma> hejko is one of the founders of PediaPress 20:04 < Kelson> hi 20:04 < hejko> hi all 20:04 < walkerma> http://pediapress.com/ 20:04 -!- FT2 [n=FT2@wikipedia/ft2] has quit [Nick collision from services.] 20:04 < walkerma> Hi! Patrice is a founder of WikiPock 20:04 -!- FT2-away [n=FT2@wikipedia/ft2] has joined #wikimedia-strategy 20:04 -!- yannf [n=yannf@wikipedia/yannf] has quit ["Like the Truth, I am elsewhere..."] 20:05 < walkerma> http://www.wikipock.com/ 20:05 -!- Netsplit crichton.freenode.net <-> irc.freenode.net quits: Philippe|Wiki 20:06 < walkerma> OK, since Patrice is here, I'd like to begin by asking him how he WkiPock might be used in developing countries 20:06 < Patrice> . 20:07 < walkerma> We've talked a lot about how cellphones are ubiquitous now in many places, but good internet is not 20:07 < walkerma> Any thoughts? 20:07 < hejko> Patrice: we have no idea how the specifications of the next generation low cost mobile phones will be. do you have any insight here? 20:08 -!- yannf [n=yannf@wikipedia/yannf] has joined #wikimedia-strategy 20:09 -!- Netsplit over, joins: Philippe|Wiki 20:09 < Philippe|Wiki> apologies for my tardiness, I was caught on the other side of a channel split. 20:09 < wizzy> I would also like to ask hejko and Patrice what their starting point for collections is - HTML, or database dumps ? 20:09 -!- peteforsyth [n=petefors@wikipedia/peteforsyth] has joined #wikimedia-strategy 20:09 < hejko> MW-API :) 20:10 -!- FT2 [n=FT2@wikipedia/ft2] has joined #wikimedia-strategy 20:10 < wizzy> hejko: URL ? 20:10 < hejko> we use wiki text and a python library to parse it into a document tree 20:11 < hejko> http://en.wikipedia.org/w/api.php 20:11 -!- pm27 [i=507dac3c@gateway/web/freenode/session] has joined #wikimedia-strategy 20:11 < hejko> http://code.pediapress.com/ 20:11 < pm27> hello all 20:12 < walkerma> Hi pm27! pm27 is president of Linterweb, which produced en:Version0.5 and is producing 0.7, on Okawix 20:12 < Patrice> we just release a symbian version for nokia phones 20:12 < walkerma> We were just asking Patrice two questions as once 20:12 < pm27> president is a very big word :) 20:13 < walkerma> http://www.linterweb.fr/ 20:14 < walkerma> Patrice - are you there? 20:14 < wizzy> walkerma: were you saying last week that wikipock does not render tables ? 20:15 < walkerma> wizzy - I was saying that they DO render tables, but the version I have doesn't 20:15 < pm27> http://blog.wikiwix.com/fr/category/okawix/ for information what we do in offline 20:15 < Patrice> yes, i'm here 20:15 < walkerma> and Patrice can tell us about their Version 2 system 20:15 < Patrice> With our V2 data format is in Beta testing now 20:16 < walkerma> wizzy distributes Wikipedia to schools and villages in South Africa. He's also a proud owner of a nice cellphone, like many South Africans 20:16 < Patrice> The new format is quicker and more compressed 20:16 < Patrice> 3.1 million articles unde 4GB 20:16 < pm27> wich format do you use Patrice ? 20:16 < pm27> ZIM ? 20:17 < wizzy> Patrice: do you have search ? how do you do it ? 20:17 < Patrice> no. we are using a proprietary format (soon open source). We started 18 months ago working on the technology, at that time Zim was not available 20:18 < pm27> I ask this question because it s very hard to have all the content Patrice 20:18 < Patrice> what do you mean by that (all the content)? 20:19 < pm27> see http://www.okawix.com/?page=torrent&lang=en 20:19 < pm27> all language and all project 20:21 < Patrice> is this url working for you ? 20:21 < pm27> I manage the project okawix 20:22 < wizzy> yes, that URL works for me 20:22 < walkerma> pm27: It seems to work for me - though I haven't actually completed a download 20:23 -!- Netsplit crichton.freenode.net <-> irc.freenode.net quits: Az1568_ 20:23 < walkerma> Patrice - can you tell us about searches on WikiPock? 20:23 < Patrice> about WikiPock and the rendering table issue. WikiPock format V2 will render tables 20:23 < pm27> Patrice it s the torrent link 20:24 < Patrice> Our goal was to search and read Wikipedia offline on mobile phones when cpu and memory are very limited 20:24 < wizzy> pm27: that will download a ZIM file that the okawix reader can read ? 20:24 < Patrice> filesystems on blackberry or symbian OS do not perform that well 20:25 < pm27> not yet wizzy but we have working for 20:25 < Patrice> so we really spent a lot of time optimizing the technology for mobiles 20:25 < wizzy> what does it download then ? 20:26 < pm27> wizzy: zeno format 20:26 < Patrice> of course the technology works also for Mac/PC 20:26 < pm27> http://blog.wikiwix.com/en/2009/12/07/okawix-et-openzim/ wizzy 20:26 -!- FT2 [n=FT2@wikipedia/ft2] has quit [Read error: 110 (Connection timed out)] 20:26 < wizzy> and that has a builtin index for search ? (zeno) 20:27 < Patrice> i talked with openZim about the problem and optimization we had to implement to workaround the mobile phones limitations 20:27 -!- Netsplit over, joins: Az1568_ 20:27 < pm27> it s our own search engine, the smae as use for wikiwix 20:28 < pm27> same 20:28 < Patrice> we do not use full text seach, the index will not run/fit on mobiles 20:29 < pm27> the idea is consist to make a dump of our index in online mode ready to use in offline mode 20:29 < Patrice> we only use title index 20:30 < wizzy> Have you considered doing title and first paragraph search ? 20:30 < pm27> like the wikireader of openmoko Patrice 20:30 < Patrice> that's a good idea, 20:30 < Patrice> did not think about that! 20:31 < pm27> Patrice: a full text search engine is possible 20:32 < Patrice> the next generation of microSD will provided 64GB and up to 1TB... so it will be possible to add full text search 20:32 < pm27> it s joke ? 20:33 < wizzy> Patrice: your limitation is CPU, or space for the index ? 20:33 < pm27> that is the mean than in African scholl they need more PC :) 20:34 < pm27> school 20:34 < Patrice> well both are a limitations 20:36 < wizzy> pm27: does your torrent download include pictures ? 20:36 < Patrice> no pictures of cours! 20:36 < Patrice> of course 20:36 < pm27> no but it s possible with the software to add the pictures 20:37 < wizzy> and is it from the current wikipedia ? or an earlier snapshot ? 20:37 < Patrice> martin proposed to build a stripdown version (top 30K articles) with images. That's something possible 20:37 < pm27> every two month we have updating 20:37 < Patrice> we produce a snapshot every 3/4 months 20:38 < wizzy> Patrice: how do you choose your top 30k ? 20:38 < walkerma> Patrice: Yes - I think in 2010 we can start to make 30k selections quite regularly 20:39 < Patrice> from http://toolserver.org/~cbm/release-data/2008-9-13/HTML/ 20:40 -!- fajro [n=fajro@Wikimedia/Fajro] has joined #wikimedia-strategy 20:40 < Patrice> but martin told me there is a new bot in the work for the top 30K articles. 20:40 < pm27> Patrice you come from Paris ? 20:41 < Patrice> that's right. we started wikipock in Paris but then we relocated the company in San Francisco 20:42 < pm27> c'est con je suis de Paris :) 20:42 < walkerma> Since we're discussing the 30k selection, could I seize the opportunity to move onto our "official" agenda item - how to select articles and versions? 20:43 < wizzy> is there a new bot ? what is new about it ? 20:43 < walkerma> As most of you know, it's a big interest of mine 20:43 < walkerma> wizzy: http://en.wikipedia.org/wiki/xxxxxxxxxxxxxxx/Second_generation 20:44 < walkerma> Can I ask people here to keep this under their "hats" until it is officially announced? 20:44 < walkerma> It's still being tested 20:44 < walkerma> http://toolserver.org/~enwp10/ 20:44 < Philippe|Wiki> ahem... 20:44 < Philippe|Wiki> i would remind you that it's a public log 20:44 < Philippe|Wiki> :-) 20:46 < walkerma> Philippe - I understand, it's public knowledge that we are testing it, and it was even in the SignPost - all I ask is that we don't make big announcements on our projects 20:46 < Philippe|Wiki> Ah, okay then. :) 20:46 < Philippe|Wiki> thanks for clarification. 20:46 < Patrice> 2TB memory cards: http://www.sdcard.org/home 20:46 < pm27> Patrice: for have a full search engine in en.wikipedia will have just need 1,5 G 20:46 < walkerma> Perhaps the URL could be kept out of the log just in case (not sure if it's OK) 20:47 < Philippe|Wiki> walkerma, I'm comfortable with that. 20:47 < hejko> walkerma: will the tool provide a XML file with all assessment scares and the hit score? 20:47 < hejko> scores 20:47 -!- Huib|AFK is now known as Huib 20:47 < Patrice> I 1.5 G the size of the index only, right? 20:47 < walkerma> Maybe scares is the right word - some porn articles have quite scary scores! 20:47 < Patrice> Is 1.5 G the size of the index only, right 20:47 < pm27> yes of course Patrice 20:48 < walkerma> hejko: Not sure, but I think it does 20:48 < walkerma> It will be updating and recalculating importance scores much more regularly than the old system 20:49 < walkerma> The quality assessments are done much faster 20:49 < Patrice> i have not found on http://toolserver.org/~cbm/release-data/2008-9-13/HTML/ the top 30K articles in CSV format. how can i get that? 20:49 < walkerma> You can also easily now pull out a selection, say, of articles that appear under France and under chemistry 20:50 < walkerma> Kelson: Can you answer Patrice there? 20:50 < hejko> great, this combined with articles as members of the Outline Of Knowledge category tree should allow to create tools which offer to create custom selections automatically. 20:50 < hejko> http://en.wikipedia.org/wiki/Wikipedia:WikiProject_Outline_of_Knowledge 20:50 < Kelson> walkerma: hmmm... no idea... Carl should know about that ;) 20:51 -!- Netsplit crichton.freenode.net <-> irc.freenode.net quits: Az1568_ 20:51 < walkerma> OK, thanks. Carl is the developer who wrote the new bot, and the old SelectionBot too 20:52 < walkerma> Anyway - on en:WP and on fr:WP, the Wikipedia communities have WikiProjects that tag article talk pages with their Project template 20:52 < walkerma> That template includes a quality rating, and sometimes an importance rating as well 20:53 < walkerma> The bot compiles the data - from over 2 million assessed articles in the case of en:WP - and puts it in a searchable form, a form that can be organised for offline use 20:54 < pm27> Salut Kelson 20:54 < Kelson> pm27: yo 20:54 < pm27> Patrice tu es sur Paris actuellement ? 20:54 < walkerma> It requires >1 thousand WikiProjects to keep on assessing articles, but they love using the bot because it allows them to see what they have in their subject area 20:54 < walkerma> Though some assessments get out of date 20:55 < Patrice> pm27: Pas en ce moment mais je passe a Paris mi-janvier. On peut dej ensemble. 20:56 < walkerma> Take a look at http://toolserver.org/xxxxxxxxxxx/list2.fcgi so you can see how to generate a selection on the fly (Philippe - please remove that URL from the log for now!) 20:56 < wizzy> who maintains the Outline pages ? Seems a quite thankless job 20:56 < Philippe|Wiki> walkerma: yep, noted. 20:56 -!- Netsplit over, joins: Az1568_ 20:56 < walkerma> wizzy: I don't understand your question - what do you mean by Outline pages? 20:57 < wizzy> http://en.wikipedia.org/wiki/Outline_of_water for instance 20:57 < walkerma> Though it's true that much of the 1.0 work is thankless - especially the Version selection! 20:57 < wizzy> heh 20:58 < wizzy> walkerma: nice page at list2.fcgi 20:58 < walkerma> According to the talk page, it's http://en.wikipedia.org/wiki/Wikipedia:WikiProject_Outline_of_Knowledge 20:58 < pm27> Patrice en priv?e 20:59 < walkerma> When the old 1.0 bot was introduced, it made a big change in how articles were organised. I think this new bot will likewise have a big impact too, thought how the community uses it,we'll have to wait & see 21:00 -!- Netsplit crichton.freenode.net <-> irc.freenode.net quits: Az1568_ 21:01 < walkerma> But it may make it VERY easy for a WikiProject to produce a selection in a few minutes - say on French chemists- then make that selection available via PediaPress, Okawiz and WikiPock the same day? 21:01 < wizzy> that would be great 21:02 < walkerma> What I'd like to ask is two questions: 21:02 < hejko> maybe the script at list2.fcgi could offer to emit book outlines that are compatible with the book tool's stored books format 21:03 < walkerma> 1. Are there other ways for making article selections, ways we haven't thought of, that might work 21:03 < walkerma> 2. Could tools like this be adapted to other projects where size is an issue? Or is WP the only project where the size is really a limiting factor? 21:04 < walkerma> hejko: Point taken 21:05 < Kelson> walkerma: I have a fully automated solution to sort articles for every WMF project based on the importance with a method similar to Carl's one 21:06 < wizzy> to answer 1), we mentioned before that quality metadata is hard to find, so doing things like What-links-here gets out of hand 21:06 < walkerma> Kelson: That sounds great! Can you tell us about it? 21:06 < hejko> walkerma: regarding 1) I'd like to combine the WP1.0 scores with a centrality algorithm to create closed (?) collections on major topics. http://en.wikipedia.org/wiki/Centrality 21:07 < Kelson> walkerma: sure? Think I have done that in one of the last IRC meeting, is this necessary to retry? 21:07 < walkerma> Sounds very interesting 21:07 < Kelson> walkerma: I propose that I write a small doc... and we can speak about that if someone have additional question during the next meeting ? 21:07 < walkerma> OK 21:08 < Kelson> walkerma: but all dumps I have done are base on this scripts 21:09 < walkerma> I'd like us to weight the links-in according to the score of every article linking in, as a further refinement of the current algorithm 21:09 < walkerma> But that's a detail, probably 21:10 < walkerma> Kelson: We should probably talk on the phone soon (though I'm away for Christmas) 21:11 < walkerma> Patrice:You've worked with WikiQuote and Wiktionary on the cellphone - is the size of these a limiting factor, or not? 21:12 < Kelson> walkerma: I'm at home during the next two weeks 21:12 < Patrice> No, size is not really an issue for wikiquote and wikitionary 21:13 < walkerma> The other big issue in creating collections is selecting the right article VERSIONS 21:13 < walkerma> With Version 0.7, we found in our 30,000 articles many vandalised versions in our collection 21:14 < walkerma> wizzy wrote a nice script to help us find them, but it took about 100 hours of my time over six months - very boring work - to manually locate them 21:14 < walkerma> and to replace them 21:14 < wizzy> you might be able to do a history analysis to find a stable version ? 21:14 < walkerma> Clearly that isn't sustainable, and also it means our collections are stale 21:15 < walkerma> wizzy - yes, that would certainly help 21:15 < wizzy> also a job for the 'bot ? 21:15 < walkerma> At Wikimania I spoke extensively with Luca de Alfaro, who wrote the WikiTrust extension for WP 21:16 < walkerma> wizzy: Can you explain how tht might work? 21:16 < hejko> what we did in the past was: taking the full dump and create a list of frequent editors. we then for each article selected the last version that was edited by a frequent editor. 21:17 < hejko> I think in the future the wikitrust project will help to select the right version. 21:17 < Philippe|Wiki> interesting... i wondered how that article selection was done 21:18 -!- aude [n=chatzill@wikipedia/Aude] has quit ["ChatZilla 0.9.85 [Firefox 3.5.6/20091201220228]"] 21:18 < Kelson> yes... I also see currently only wikitrust to help us building quickly a good and substainable solution 21:18 < walkerma> http://wikitrust.soe.ucsc.edu/ 21:18 < wizzy> it could look at the history, and find a version that was left alone for a few days, or reverted ? 21:19 < walkerma> I have drafted an outline grant proposal to collaborate with Luca on writing code - to choose the "most trustworthy recent version" of each article in a selection 21:19 < walkerma> But he needs a complete dump that includes full article history 21:20 < wizzy> couldn't a bot peek into the history, make a judgement, and link it on the talk page ? 21:20 -!- Netsplit over, joins: Az1568_ 21:22 < walkerma> WikiTrust assigns a score to each author, secretly, that shows how often that author has been reverted. An author with a lot of unreverted edits builds up a high score of trust 21:22 < walkerma> Someone who is a vandal will normally be very obvious with such a scoring, as they will have very low trust 21:23 < walkerma> The actual text of each article is marked up with its own trust rating based on who contributed the text 21:24 < Philippe|Wiki> Hmmm, I'm sure there's an obvious reason, but... why is the score secret? 21:24 < walkerma> We think we could come up with an overall score for each version, based on adding up those trust scores for all the text - thn find the most "trusted version" 21:24 < walkerma> Philippe - That was the subject of the discussion at WikiMania! 21:24 < Philippe|Wiki> For instance, if I'm evaluating things, I'd like to see that walkerma is a 99 and hejko is a 4, for instance. (Sorry, hejko) 21:26 < walkerma> I forget all the reasoning, but there are some very good reasons why it needs to be kept from general reading - though there was debate about whether or not admins should see the scores 21:26 < Philippe|Wiki> I just ask because the Quality task force is kicking around a similar issue :) 21:26 < Philippe|Wiki> As is Community Health, I believe 21:27 < walkerma> But you can indirectly find your score simply by looking at the marked up version of an article and seeing what colour your contribution is marked with? 21:27 < Philippe|Wiki> but here's a dumb thought, maybe I should RTFM? :) You gave a nice URL above. 21:27 < walkerma> ! 21:28 < walkerma> Philippe: Luca is limited in what he can do with this now, but I'm pretty sure the WMF plans to make the WikiTrust extension official in the next few months 21:28 < Philippe|Wiki> Thanks, walkerma :) 21:29 -!- Huib is now known as Huib|BezigeBij 21:29 < walkerma> And I think it's really a very nice system. Luca is a professor at one of the leading computer science depts in the US, so I'm sure he knows his stuff! 21:29 < Philippe|Wiki> Absolutely. :) I have a high level of wikitrust for him. :) 21:30 < walkerma> (Though that status was before the budget crisis in California :( ) 21:30 < Philippe|Wiki> fair point. 21:30 < walkerma> wizzy has mentioned simpler ways of doing article selection 21:31 < walkerma> But what about making selections that are safe for children? That's much harder, I think 21:31 < walkerma> Do people here have any ideas/thoughts on how we can approach the article selection issue, besides using WikiTrust? 21:32 < walkerma> BTW, I will have to go soon 21:32 < Philippe|Wiki> If I may.... while it's fascinating, I think article selection for particular usages is tactical rather than strategic:) 21:32 < wizzy> wikitrust looks great. I just tried it 21:33 < walkerma> Philippe - if we can't get selections that are suitable for children, that is a big issue that affects our whole strategy, I think 21:34 < walkerma> especially as we have picked schools as a major conduit 21:34 < Philippe|Wiki> Fair point :) 21:34 < Philippe|Wiki> I stand corrected 21:35 < walkerma> wizzy: I thin you've mentioned embarrassing issues on your blog - kids finding articles on porn. How much of an issue is that in South African culture 21:35 < walkerma> ? 21:37 < wizzy> I don't think it is culture-specific 21:37 < walkerma> OK 21:38 < walkerma> I think often Wikipedians underestimate the importance of such issues in the general public 21:38 < wizzy> This came up when I put a snapshot of the whole en wikipedia (when it was 'only' 20 Gig or so, with pics) 21:39 < walkerma> Because many Wikipedians - esp in the US - are quite libertarian, and many don't have kids. I have two young daughters, so it's a concern for me 21:39 < wizzy> yes, it is definitely a problem 21:40 < wizzy> perhaps the pics need to be rated nsfw 21:40 < walkerma> Do you think blacklisting articles and perhaps whole categories is the appropriate solution? 21:40 < wizzy> I think it is pics that are the problem mostly - but how to build a bomb might also count 21:41 < walkerma> Yes - pictures are a real problem! There is the famous story of when a kid was writing his history paper for school on Cortes, and someone had replaced Cortes' picture with an image of a giant penis 21:41 < walkerma> The article text was fine, I'm sure...! 21:41 < wizzy> My nephew in the 'states got into serious trouble for downloading the Anarchists Cookbook 21:43 < walkerma> OK, I will need to go 21:44 < Philippe|Wiki> Is there someone who was here for the full meeting that can send me a full log? Otherwise, I can go with mine, which is missing about the first three minutes because of the netsplit. 21:44 < wizzy> Strategybot has it ? 21:44 < walkerma> I think we all need to think about these issues, because although they're not as "sexy" as some of the technical issues, any offline collection must ask the question - "What should we put in it"? 21:44 < Philippe|Wiki> If he's been behaving, yes :-) 21:44 < Philippe|Wiki> I should check that first though, wizzy 21:44 < walkerma> I have a complete log, I think - I can email that to you 21:44 < wizzy> I have a log too 21:44 < Philippe|Wiki> that would be great, walkerma, just in case... because I know I saw the bot on both sides of the netsplit. 21:45 < Philippe|Wiki> philippe@wikimedia.org plz :) 21:45 < wizzy> thanks walkerma 21:46 < walkerma> Let's aim to meet at the same time next week - tentatively. I will be at my in-laws, but I'll post something on the wiki 21:46 < Philippe|Wiki> I will be unable to be here, but you don't need me :) 21:46 < walkerma> And Philippe - now my grades are in I will start posting more summaries! 21:46 < Philippe|Wiki> I'm on vacation the first two days of the week next week. 21:46 < Philippe|Wiki> Thanks, walkerma :) 21:46 < walkerma> Many people will be travelling so it will be a small meeting , I tihnk, but we'll pick things up in Jan again - nar to our deadline 21:46 < walkerma> Must go 21:47 < Philippe|Wiki> Thanks, all 21:47 < Philippe|Wiki> :) 21:47 < walkerma> Thanks all, and thanks Patrice for joinging us! 21:47 -!- walkerma [n=chatzill@cpe-74-71-218-114.twcny.res.rr.com] has quit ["ChatZilla 0.9.86 [Firefox 3.0.16/2009120208]"] 21:47 -!- peteforsyth [n=petefors@wikipedia/peteforsyth] has quit [] 21:48 -!- Philippe|Wiki [n=Philippe@wikimedia/Philippe] has quit [] 21:49 < pm27> but i really don't know what we need a task force ? 21:50 < pm27> There is one solution of the storage of the data 21:50 < pm27> one solution to generating dump 21:51 < pm27> one solution for the search engine 21:51 < pm27> ... 21:52 < pm27> just waiting the fondation mozilla and not find alternative solution to use wikipedia one a phone 21:52 < pm27> Kelson: ? 21:53 < pm27> and also for mobile phone there is online solution so wait 21:56 < pm27> there is a lot of work make by some person so just waiting 21:56 < pm27> http://download.wikipedia.org/dvd.html