Extract Wikipedia dump file

Extract Wikipedia dump file

Hi Diederik,

I had to manually unzip the dump. When I tried:

python manage.py -l Polish extract

The following resulted:

Final settings after parsing command line arguments:
         Project: Wikipedia
 Input directory: /home/rfaulkner/wikimedia/en/wiki
Output directory: /home/rfaulkner/wikimedia/en/wiki and subdirectories
        Language: English / English / en
Extracting data from XML
Checking if dump file has been extracted...
Dump file enwiki-latest-stub-meta-history.xml.gz has not yet been extracted...
Unzipping zip file

Processing time: 0:00:00.028703
Launching process...
Launching process...
Launching process...
Launching process...
Process Process-3:
Traceback (most recent call last):
  File "/usr/lib/python2.6/multiprocessing/process.py", line 232, in _bootstrap
  File "/usr/lib/python2.6/multiprocessing/process.py", line 88, in run
    self._target(*self._args, **self._kwargs)
  File "/home/rfaulkner/trunk/projects/editor_trends/etl/extracter.py", line 286, in parse_dumpfile
    filesize = file_utils.determine_filesize(location, filename)
  File "/home/rfaulkner/trunk/projects/editor_trends/utils/file_utils.py", line 243, in determine_filesize

There are no more jobs in the queue left.

There are no more jobs in the queue left.

There are no more jobs in the queue left.
    return os.path.getsize(path)
  File "/usr/lib/python2.6/genericpath.py", line 49, in getsize
    return os.stat(filename).st_size
OSError: [Errno 2] No such file or directory: '/home/rfaulkner/wikimedia/en/wiki/enwiki-latest-stub-meta-history.xml'
^CTraceback (most recent call last):
  File "manage.py", line 583, in <module>
  File "manage.py", line 579, in main
    args.func(rts, logger)
  File "manage.py", line 263, in extract_launcher
  File "/home/rfaulkner/trunk/projects/editor_trends/etl/extracter.py", line 439, in launcher
  File "/usr/lib/python2.6/multiprocessing/queues.py", line 316, in join
  File "/usr/lib/python2.6/multiprocessing/synchronize.py", line 212, in wait
    self._wait_semaphore.acquire(True, timeout)

I manually ran the "extract" and it appears to be running with a progress bar but I'm not sure what all of the output means. Also what is happening exactly in this step? A bit more description under here would be helpful.

Renklauf21:29, 29 March 2011

for some reason it is unzipping the english version instead of the polish version. i'll have a look.

Drdee23:54, 29 March 2011

Sorry that exception was from another command when I wasn't specifying the Polish language. However I still had to do the manual extract of the Polish dump. I can replicate that to get the actual error.

Renklauf09:07, 30 March 2011

Can you try it on the latest version and if it still throws an error can you send me the log? see editor_trends/logs thanks!

Drdee17:58, 1 April 2011