Extract Wikipedia dump file
From Talk:Wikilytics
Extract Wikipedia dump file
Hi Diederik,
I had to manually unzip the dump. When I tried:
python manage.py -l Polish extract
The following resulted:
Final settings after parsing command line arguments: Project: Wikipedia Input directory: /home/rfaulkner/wikimedia/en/wiki Output directory: /home/rfaulkner/wikimedia/en/wiki and subdirectories Language: English / English / en Extracting data from XML /home/rfaulkner/wikimedia/en/wiki Checking if dump file has been extracted... Dump file enwiki-latest-stub-meta-history.xml.gz has not yet been extracted... Unzipping zip file Processing time: 0:00:00.028703 Launching process... Launching process... Launching process... Launching process... 4 Process Process-3: Traceback (most recent call last): File "/usr/lib/python2.6/multiprocessing/process.py", line 232, in _bootstrap self.run() File "/usr/lib/python2.6/multiprocessing/process.py", line 88, in run self._target(*self._args, **self._kwargs) File "/home/rfaulkner/trunk/projects/editor_trends/etl/extracter.py", line 286, in parse_dumpfile filesize = file_utils.determine_filesize(location, filename) File "/home/rfaulkner/trunk/projects/editor_trends/utils/file_utils.py", line 243, in determine_filesize 3 There are no more jobs in the queue left. 2 There are no more jobs in the queue left. 1 There are no more jobs in the queue left. return os.path.getsize(path) File "/usr/lib/python2.6/genericpath.py", line 49, in getsize return os.stat(filename).st_size OSError: [Errno 2] No such file or directory: '/home/rfaulkner/wikimedia/en/wiki/enwiki-latest-stub-meta-history.xml' ^CTraceback (most recent call last): File "manage.py", line 583, in <module> main() File "manage.py", line 579, in main args.func(rts, logger) File "manage.py", line 263, in extract_launcher extracter.launcher(properties) File "/home/rfaulkner/trunk/projects/editor_trends/etl/extracter.py", line 439, in launcher tasks.join() File "/usr/lib/python2.6/multiprocessing/queues.py", line 316, in join self._cond.wait() File "/usr/lib/python2.6/multiprocessing/synchronize.py", line 212, in wait self._wait_semaphore.acquire(True, timeout) KeyboardInterrupt
I manually ran the "extract" and it appears to be running with a progress bar but I'm not sure what all of the output means. Also what is happening exactly in this step? A bit more description under here would be helpful.
for some reason it is unzipping the english version instead of the polish version. i'll have a look.
Sorry that exception was from another command when I wasn't specifying the Polish language. However I still had to do the manual extract of the Polish dump. I can replicate that to get the actual error.