Problem (?) with extracting
Problem (?) with extracting
The command python manage.py extract give this message:
Extracting data from XML c:\wikimedia c:\wikimedia\hu\wiki c:\wikimedia\hu\wiki\huwiki-latest-stub-meta-history.xml c:\wikimedia\hu\wiki\huwiki-latest-stub-meta-history.xml.gz Process Process-1: Traceback (most recent call last): File "C:\Program Files\Python 2.7\lib\multiprocessing\process.py", line 232, in _bootstrap self.run() File "C:\Program Files\Python 2.7\lib\multiprocessing\process.py", line 88, in run self._target(*self._args, **self._kwargs) File "C:\editor_trends\etl\enricher.py", line 697, in stream_raw_xml fh = file_utils.create_streaming_buffer(filename) File "C:\editor_trends\utils\file_utils.py", line 222, in create_streaming_buffer fh = create_txt_filehandle(path, None, 'r', 'utf-8') File "C:\editor_trends\utils\file_utils.py", line 207, in create_txt_filehandle path = os.path.join(location, filename) File "C:\Program Files\Python 2.7\lib\ntpath.py", line 73, in join elif isabs(b): File "C:\Program Files\Python 2.7\lib\ntpath.py", line 57, in isabs s = splitdrive(s)[1] File "C:\Program Files\Python 2.7\lib\ntpath.py", line 125, in splitdrive if p[1:2] == ':': TypeError: 'NoneType' object is not subscriptable
After this message it looks extracting dump fine but stopped at this line and nothing happen:
Finished parsing bz2 archives
Following a Ctrl+C I got this message:
Traceback (most recent call last): File "manage.py", line 461, in <module> main() File "manage.py", line 457, in main args.func(rts, logger) File "manage.py", line 281, in extract_launcher enricher.launcher(rts) File "C:\editor_trends\etl\enricher.py", line 823, in launcher multiprocessor_launcher(function, dataset, storage, locks, rts) File "C:\editor_trends\etl\enricher.py", line 777, in multiprocessor_launcher input_queue.join() File "C:\Program Files\Python 2.7\lib\multiprocessing\queues.py", line 316, in join self._cond.wait() File "C:\Program Files\Python 2.7\lib\multiprocessing\synchronize.py", line 220, in wait self._wait_semaphore.acquire(True, timeout) KeyboardInterrupt
What OS are you using? and can you email me the log file from /logs/
Win7 64bit. Sure.
Ok, I am making a lot of changes (as you have noted) and I am really trying to get a stable version as soon as possible. thanks for your patience.
mmm I cannot replicate this, could you please update to the most recent version on SVN and try again?
hmm. I used the latest revision (85439; I update editor_trends through SVN before every run). I have replicated right now.
Processing of huwiki-latest-stub-meta-history.xml took 0:21:48.918000 Number of articles: 512102 Number of revisions: 8256238 Finished parsing bz2 archives [waiting infinite amount of time then CTRL+C] Traceback (most recent call last): File "manage.py", line 461, in <module> main() File "manage.py", line 457, in main args.func(rts, logger) File "manage.py", line 405, in all_launcher res = function(rts, logger) File "manage.py", line 281, in extract_launcher enricher.launcher(rts) File "C:\editor_trends\etl\enricher.py", line 823, in launcher multiprocessor_launcher(function, dataset, storage, locks, rts) File "C:\editor_trends\etl\enricher.py", line 777, in multiprocessor_launcher input_queue.join() File "C:\Program Files\Python 2.7\lib\multiprocessing\queues.py", line 316, in join self._cond.wait() File "C:\Program Files\Python 2.7\lib\multiprocessing\synchronize.py", line 220, in wait self._wait_semaphore.acquire(True, timeout) KeyboardInterrupt
So it did finish extracting, the problem is that it did not exit the queue. How many processors does your computer have? so you can continue now doing the sort, store and transform phase.
I have updated the code with some extra debugging info. Can you update your code, rerun the extract phase and then copy the output on the console and email it to me?
try both these options to see if if differs:
python manage.py extract
python manage.py all -e download