Talk:Wikilytics

From Strategic Planning
Jump to: navigation, search

Contents

Thread titleRepliesLast modified
Problem with transforming419:46, 6 April 2011
Problem (?) with extracting819:51, 5 April 2011
Troubleshooting2512:44, 5 April 2011
Plugin documentation017:18, 2 April 2011
FYI021:38, 1 April 2011
Store Wikipedia dump file117:59, 1 April 2011
Extract Wikipedia dump file317:58, 1 April 2011
Questions about the software.123:53, 3 December 2010

Problem with transforming

python manage.py transform command gives me this message:

Start transforming dataset
wikilytics huwiki_editors_raw
38018
{u'date': datetime.datetime(2003, 9, 13, 7, 14, 41), u'article': 684328, u'ns': 505}
Traceback (most recent call last):
  File "manage.py", line 461, in <module>
    main()
  File "manage.py", line 457, in main
    args.func(rts, logger)
  File "manage.py", line 328, in transformer_launcher
    transformer.transform_editors_single_launcher(rts)
  File "C:\editor_trends\etl\transformer.py", line 313, in transform_editors_single_launcher
    editor()
  File "C:\editor_trends\etl\transformer.py", line 80, in __call__
    character_count = determine_edit_volume(edits, first_year, final_year)
  File "C:\editor_trends\etl\transformer.py", line 226, in determine_edit_volume
    if edit['delta'] < 0:
KeyError: 'delta'
Samat13:52, 5 April 2011

Yes, you need to redo the store and transformation phase as I have made significant changes in the last couple of days (there are more variables added).

so go to mongo enter use wikilytics then enter show collections and then enter:

db.huwiki_editors_raw.drop() db.huwiki_editors_dataset.drop() db.huwiki_articles_raw.drop()

Drdee13:57, 5 April 2011

I get the following error when trying to run manage.py transform:

Microsoft Windows [verziószám: 6.1.7600]
Copyright (c) 2009 Microsoft Corporation. Minden jog fenntartva.

C:\wikimedia\editor_trends>manage.py transform
Final settings after parsing command line arguments:
         Project: Wikipedia
 Input directory: c:\wikimedia
Output directory: c:\wikimedia\hu\wiki and subdirectories
        Language: Hungarian / Magyar / hu
Start transforming dataset
wikilytics huwiki_editors_raw
Traceback (most recent call last):
  File "C:\wikimedia\editor_trends\manage.py", line 461, in <module>
    main()
  File "C:\wikimedia\editor_trends\manage.py", line 457, in main
    args.func(rts, logger)
  File "C:\wikimedia\editor_trends\manage.py", line 328, in transformer_launcher

    transformer.transform_editors_single_launcher(rts)
  File "C:\wikimedia\editor_trends\etl\transformer.py", line 310, in transform_e
ditors_single_launcher
    ids = db.retrieve_distinct_keys(rts.dbname, rts.editors_raw, 'editor')
  File "C:\wikimedia\editor_trends\database\db.py", line 144, in retrieve_distin
ct_keys
    ids = retrieve_distinct_keys_mapreduce(editors, field)
  File "C:\wikimedia\editor_trends\database\db.py", line 156, in retrieve_distin
ct_keys_mapreduce
    cursor = collection.map_reduce(map, reduce)
  File "build\bdist.win-amd64\egg\pymongo\collection.py", line 943, in map_reduc
e
  File "build\bdist.win-amd64\egg\pymongo\database.py", line 293, in command
  File "build\bdist.win-amd64\egg\pymongo\helpers.py", line 119, in _check_comma
nd_response
pymongo.errors.OperationFailure: command SON([('mapreduce', u'huwiki_editors_raw
'), ('map', Code('function () { emit(this.editor, 1)};', {})), ('reduce', Code('
function()', {}))]) failed: db assertion failure
Bdamokos09:56, 6 April 2011

Hi Bdamokos,

I cannot replicate this, are you sure that the store phase was ended successfully and that mongo is still running?

Drdee19:46, 6 April 2011
 
 

If you stumble upon new problems, then please let me know. best, Diederik

Drdee19:44, 6 April 2011
 

Problem (?) with extracting

The command python manage.py extract give this message:

Extracting data from XML
c:\wikimedia c:\wikimedia\hu\wiki
c:\wikimedia\hu\wiki\huwiki-latest-stub-meta-history.xml
c:\wikimedia\hu\wiki\huwiki-latest-stub-meta-history.xml.gz
Process Process-1:
Traceback (most recent call last):
  File "C:\Program Files\Python 2.7\lib\multiprocessing\process.py", line 232, in _bootstrap
    self.run()
  File "C:\Program Files\Python 2.7\lib\multiprocessing\process.py", line 88, in run
    self._target(*self._args, **self._kwargs)
  File "C:\editor_trends\etl\enricher.py", line 697, in stream_raw_xml
    fh = file_utils.create_streaming_buffer(filename)
  File "C:\editor_trends\utils\file_utils.py", line 222, in create_streaming_buffer
    fh = create_txt_filehandle(path, None, 'r', 'utf-8')
  File "C:\editor_trends\utils\file_utils.py", line 207, in create_txt_filehandle
    path = os.path.join(location, filename)
  File "C:\Program Files\Python 2.7\lib\ntpath.py", line 73, in join
    elif isabs(b):
  File "C:\Program Files\Python 2.7\lib\ntpath.py", line 57, in isabs
    s = splitdrive(s)[1]
  File "C:\Program Files\Python 2.7\lib\ntpath.py", line 125, in splitdrive
    if p[1:2] == ':':
TypeError: 'NoneType' object is not subscriptable

After this message it looks extracting dump fine but stopped at this line and nothing happen:

Finished parsing bz2 archives

Following a Ctrl+C I got this message:

Traceback (most recent call last):
  File "manage.py", line 461, in <module>
    main()
  File "manage.py", line 457, in main
    args.func(rts, logger)
  File "manage.py", line 281, in extract_launcher
    enricher.launcher(rts)
  File "C:\editor_trends\etl\enricher.py", line 823, in launcher
    multiprocessor_launcher(function, dataset, storage, locks, rts)
  File "C:\editor_trends\etl\enricher.py", line 777, in multiprocessor_launcher
    input_queue.join()
  File "C:\Program Files\Python 2.7\lib\multiprocessing\queues.py", line 316, in join
    self._cond.wait()
  File "C:\Program Files\Python 2.7\lib\multiprocessing\synchronize.py", line 220, in wait
    self._wait_semaphore.acquire(True, timeout)
KeyboardInterrupt
Samat13:49, 5 April 2011

What OS are you using? and can you email me the log file from /logs/

Drdee13:51, 5 April 2011

Win7 64bit. Sure.

Samat13:53, 5 April 2011

Ok, I am making a lot of changes (as you have noted) and I am really trying to get a stable version as soon as possible. thanks for your patience.

Drdee13:55, 5 April 2011

mmm I cannot replicate this, could you please update to the most recent version on SVN and try again?

Drdee15:38, 5 April 2011

hmm. I used the latest revision (85439; I update editor_trends through SVN before every run). I have replicated right now.

Processing of huwiki-latest-stub-meta-history.xml took 0:21:48.918000
Number of articles: 512102
Number of revisions: 8256238
Finished parsing bz2 archives

[waiting infinite amount of time then CTRL+C]

Traceback (most recent call last):
  File "manage.py", line 461, in <module>
    main()
  File "manage.py", line 457, in main
    args.func(rts, logger)
  File "manage.py", line 405, in all_launcher
    res = function(rts, logger)
  File "manage.py", line 281, in extract_launcher
    enricher.launcher(rts)
  File "C:\editor_trends\etl\enricher.py", line 823, in launcher
    multiprocessor_launcher(function, dataset, storage, locks, rts)
  File "C:\editor_trends\etl\enricher.py", line 777, in multiprocessor_launcher
    input_queue.join()
  File "C:\Program Files\Python 2.7\lib\multiprocessing\queues.py", line 316, in join
    self._cond.wait()
  File "C:\Program Files\Python 2.7\lib\multiprocessing\synchronize.py", line 220, in wait
    self._wait_semaphore.acquire(True, timeout)
KeyboardInterrupt
Samat16:39, 5 April 2011
 
 
 
 
 

Troubleshooting

If you are running into any problems while using the Editor Trends Study/Software then please start a topic here.

Drdee19:43, 2 November 2010

Hi!

Thank you for this software. I've tried it for the Hungarian Wikipedia in Win7 Enterprise environment and I couldn't finish the process. I've followed the instructions step by step and when I've given the command python manage.py export I got this message (everything worked fine until this step):

manage: error: invalid choice: 'export' (choose from 'sort', 'all', 'config', 'show_languages', 'transform', 'django', 'download', 'dataset', 'extract', 'store')

I've also tried python manage.py dataset command and I've got this message:

Traceback (most recent call last):
  File "manage.py", line 450, in <module>
    main()
  File "manage.py", line 423, in main
    rts = runtime_settings.RunTimeSettings(project, language, args)
  File "c:\Program Files\Python 2.7\Scripts\editor_trends\classes\runtime_settings.py", line 62, in __init__
    self.targets = self.split_keywords(self.get_value('charts'))
  File "c:\Program Files\Python 2.7\Scripts\editor_trends\classes\runtime_settings.py", line 115, in split_keywords
    keywords = keywords.split(',')
AttributeError: 'function' object has no attribute 'split'

After this I've tried the python manage.py -l Hungarian all command and I've got this message at the end:

Starting dataset_launcher
Start exporting dataset

Processing time: 0:00:00.010000
Traceback (most recent call last):
  File "manage.py", line 450, in <module>
    main()
  File "manage.py", line 446, in main
    args.func(rts, logger)
  File "manage.py", line 257, in all_launcher
    res = function(rts, logger)
  File "manage.py", line 205, in dataset_launcher
    log.log_to_mongo(properties, 'dataset', 'export', stopwatch, event='finish')
NameError: global name 'properties' is not defined

Could you please help what can be the problem?

Samat09:56, 8 March 2011

Dear Samat,

I think it has to do that the export function has been renamed to dataset. But that's throwing an error and I will fix that. I will let you know when there is an update (hopefully soon). Thanks for reporting this. Best, Diederik

Drdee15:43, 10 March 2011

Dear Diederik,

Thank you for your answer. I'm waiting for the update.

Best regards,

Samat19:35, 10 March 2011

Sorry for the delay. Please download the most recent version from Subversion and give it a spin. Let me know if it works. The documentation needs to be updated as well.

Drdee22:24, 25 March 2011

Dear Diederik,

I have tried this updated version, but I am afraid it still doesn't work properly.

After python manage.py dataset I got this message:

Traceback (most recent call last):
  File "manage.py", line 449, in <module>
    main()
  File "manage.py", line 422, in main
    rts = runtime_settings.RunTimeSettings(project, language, args)
  File "C:\editor_trends\classes\runtime_settings.py", line 62, in __init__
    self.targets = self.split_keywords(self.get_value('charts'))
  File "C:\editor_trends\classes\runtime_settings.py", line 115, in split_keywords
    keywords = keywords.split(',')
AttributeError: 'function' object has no attribute 'split'

If I have tried python manage.py -l Hungarian all I got this message and I didn't find the result csv (where should I find?):

Starting dataset_launcher
Start exporting dataset

Processing time: 0:00:00.010000
Function dataset_launcher does not return a status,                 implement NOW

Could you please check the code again? Thank you, cheers,

Samat17:28, 28 March 2011
 
 
 
 
 

Plugin documentation

Hi, I have started working on documenting the plugins at: http://meta.wikimedia.org/wiki/Wikilytics_Plugins it's in progress but I am working on it. Diederik

Drdee17:18, 2 April 2011

Since this software is being used for projects other than Editor Trends Study now, Diederik (drdee) requested that we move "Editor Trends Study/Software" to a more permanent home on Meta under the name Wikilytics. However, I'm still unsure about export/import of LiquidThreads to a MediaWiki site that lacks LT, so I'm leaving the Talk page alone for now.

Steven Walling at work21:38, 1 April 2011

Store Wikipedia dump file

Not sure if this is related to recent issues with the extraction phase but I'm seeing problems in the store phase after extraction and sorting seemed to have finished without any major errors:

rfaulkner@wmf128:~/trunk/projects/editor_trends$ python manage.py -l Polish store

Wikilytics is (c) 2010-2011 by the Wikimedia Foundation.
Written by Diederik van Liere (dvanliere@gmail.com).
This software comes with ABSOLUTELY NO WARRANTY. This is 
    free software, and you are welcome to distribute it under certain 
    conditions.
See the README.1ST file for more information.

Final settings after parsing command line arguments:
         Project: Wikipedia
 Input directory: /home/rfaulkner/wikimedia/pl/wiki
Output directory: /home/rfaulkner/wikimedia/pl/wiki and subdirectories
        Language: Polish / Polski / pl
Start storing data in MongoDB
Storing article titles...
/home/rfaulkner/wikimedia/pl/wiki
2       False   AWK
Traceback (most recent call last):
  File "manage.py", line 583, in <module>
    main()
  File "manage.py", line 579, in main
    args.func(rts, logger)
  File "manage.py", line 306, in store_launcher
    store.launcher(rts)
  File "/home/rfaulkner/trunk/projects/editor_trends/etl/store.py", line 106, in launcher
    store_articles(rts)
  File "/home/rfaulkner/trunk/projects/editor_trends/etl/store.py", line 96, in store_articles
    collection.insert({'id':id, 'title':title})
UnboundLocalError: local variable 'id' referenced before assignment

Any clues as to what may be happening here? I recall there may have been an issue with xml parsing cElementTree::iterparse .. would this be related?

Renklauf23:18, 29 March 2011

This should be fixed in the current svn repos.

Drdee17:59, 1 April 2011
 

Extract Wikipedia dump file

Hi Diederik,


I had to manually unzip the dump. When I tried:

python manage.py -l Polish extract

The following resulted:

Final settings after parsing command line arguments:
         Project: Wikipedia
 Input directory: /home/rfaulkner/wikimedia/en/wiki
Output directory: /home/rfaulkner/wikimedia/en/wiki and subdirectories
        Language: English / English / en
Extracting data from XML
/home/rfaulkner/wikimedia/en/wiki
Checking if dump file has been extracted...
Dump file enwiki-latest-stub-meta-history.xml.gz has not yet been extracted...
Unzipping zip file

Processing time: 0:00:00.028703
Launching process...
Launching process...
Launching process...
Launching process...
4
Process Process-3:
Traceback (most recent call last):
  File "/usr/lib/python2.6/multiprocessing/process.py", line 232, in _bootstrap
    self.run()
  File "/usr/lib/python2.6/multiprocessing/process.py", line 88, in run
    self._target(*self._args, **self._kwargs)
  File "/home/rfaulkner/trunk/projects/editor_trends/etl/extracter.py", line 286, in parse_dumpfile
    filesize = file_utils.determine_filesize(location, filename)
  File "/home/rfaulkner/trunk/projects/editor_trends/utils/file_utils.py", line 243, in determine_filesize
3

There are no more jobs in the queue left.
2

There are no more jobs in the queue left.
1

There are no more jobs in the queue left.
    return os.path.getsize(path)
  File "/usr/lib/python2.6/genericpath.py", line 49, in getsize
    return os.stat(filename).st_size
OSError: [Errno 2] No such file or directory: '/home/rfaulkner/wikimedia/en/wiki/enwiki-latest-stub-meta-history.xml'
^CTraceback (most recent call last):
  File "manage.py", line 583, in <module>
    main()
  File "manage.py", line 579, in main
    args.func(rts, logger)
  File "manage.py", line 263, in extract_launcher
    extracter.launcher(properties)
  File "/home/rfaulkner/trunk/projects/editor_trends/etl/extracter.py", line 439, in launcher
    tasks.join()
  File "/usr/lib/python2.6/multiprocessing/queues.py", line 316, in join
    self._cond.wait()
  File "/usr/lib/python2.6/multiprocessing/synchronize.py", line 212, in wait
    self._wait_semaphore.acquire(True, timeout)
KeyboardInterrupt

I manually ran the "extract" and it appears to be running with a progress bar but I'm not sure what all of the output means. Also what is happening exactly in this step? A bit more description under here would be helpful.

Renklauf21:29, 29 March 2011

for some reason it is unzipping the english version instead of the polish version. i'll have a look.

Drdee23:54, 29 March 2011

Sorry that exception was from another command when I wasn't specifying the Polish language. However I still had to do the manual extract of the Polish dump. I can replicate that to get the actual error.

Renklauf09:07, 30 March 2011

Can you try it on the latest version and if it still throws an error can you send me the log? see editor_trends/logs thanks!

Drdee17:58, 1 April 2011
 
 
 

Questions about the software.

I posted some questions and comments here.

John Broughton19:50, 2 December 2010

Dear John, Thanks, I will get back to you asap. Best, Diederik

Drdee23:53, 3 December 2010