Extract Wikipedia dump file

Extract Wikipedia dump file

Hi Diederik,


I had to manually unzip the dump. When I tried:

python manage.py -l Polish extract

The following resulted:

Final settings after parsing command line arguments:
         Project: Wikipedia
 Input directory: /home/rfaulkner/wikimedia/en/wiki
Output directory: /home/rfaulkner/wikimedia/en/wiki and subdirectories
        Language: English / English / en
Extracting data from XML
/home/rfaulkner/wikimedia/en/wiki
Checking if dump file has been extracted...
Dump file enwiki-latest-stub-meta-history.xml.gz has not yet been extracted...
Unzipping zip file

Processing time: 0:00:00.028703
Launching process...
Launching process...
Launching process...
Launching process...
4
Process Process-3:
Traceback (most recent call last):
  File "/usr/lib/python2.6/multiprocessing/process.py", line 232, in _bootstrap
    self.run()
  File "/usr/lib/python2.6/multiprocessing/process.py", line 88, in run
    self._target(*self._args, **self._kwargs)
  File "/home/rfaulkner/trunk/projects/editor_trends/etl/extracter.py", line 286, in parse_dumpfile
    filesize = file_utils.determine_filesize(location, filename)
  File "/home/rfaulkner/trunk/projects/editor_trends/utils/file_utils.py", line 243, in determine_filesize
3

There are no more jobs in the queue left.
2

There are no more jobs in the queue left.
1

There are no more jobs in the queue left.
    return os.path.getsize(path)
  File "/usr/lib/python2.6/genericpath.py", line 49, in getsize
    return os.stat(filename).st_size
OSError: [Errno 2] No such file or directory: '/home/rfaulkner/wikimedia/en/wiki/enwiki-latest-stub-meta-history.xml'
^CTraceback (most recent call last):
  File "manage.py", line 583, in <module>
    main()
  File "manage.py", line 579, in main
    args.func(rts, logger)
  File "manage.py", line 263, in extract_launcher
    extracter.launcher(properties)
  File "/home/rfaulkner/trunk/projects/editor_trends/etl/extracter.py", line 439, in launcher
    tasks.join()
  File "/usr/lib/python2.6/multiprocessing/queues.py", line 316, in join
    self._cond.wait()
  File "/usr/lib/python2.6/multiprocessing/synchronize.py", line 212, in wait
    self._wait_semaphore.acquire(True, timeout)
KeyboardInterrupt

I manually ran the "extract" and it appears to be running with a progress bar but I'm not sure what all of the output means. Also what is happening exactly in this step? A bit more description under here would be helpful.

Renklauf21:29, 29 March 2011

for some reason it is unzipping the english version instead of the polish version. i'll have a look.

Drdee23:54, 29 March 2011

Sorry that exception was from another command when I wasn't specifying the Polish language. However I still had to do the manual extract of the Polish dump. I can replicate that to get the actual error.

Renklauf09:07, 30 March 2011

Can you try it on the latest version and if it still throws an error can you send me the log? see editor_trends/logs thanks!

Drdee17:58, 1 April 2011