Problem (?) with extracting

Problem (?) with extracting

The command python manage.py extract give this message:

Extracting data from XML
c:\wikimedia c:\wikimedia\hu\wiki
c:\wikimedia\hu\wiki\huwiki-latest-stub-meta-history.xml
c:\wikimedia\hu\wiki\huwiki-latest-stub-meta-history.xml.gz
Process Process-1:
Traceback (most recent call last):
  File "C:\Program Files\Python 2.7\lib\multiprocessing\process.py", line 232, in _bootstrap
    self.run()
  File "C:\Program Files\Python 2.7\lib\multiprocessing\process.py", line 88, in run
    self._target(*self._args, **self._kwargs)
  File "C:\editor_trends\etl\enricher.py", line 697, in stream_raw_xml
    fh = file_utils.create_streaming_buffer(filename)
  File "C:\editor_trends\utils\file_utils.py", line 222, in create_streaming_buffer
    fh = create_txt_filehandle(path, None, 'r', 'utf-8')
  File "C:\editor_trends\utils\file_utils.py", line 207, in create_txt_filehandle
    path = os.path.join(location, filename)
  File "C:\Program Files\Python 2.7\lib\ntpath.py", line 73, in join
    elif isabs(b):
  File "C:\Program Files\Python 2.7\lib\ntpath.py", line 57, in isabs
    s = splitdrive(s)[1]
  File "C:\Program Files\Python 2.7\lib\ntpath.py", line 125, in splitdrive
    if p[1:2] == ':':
TypeError: 'NoneType' object is not subscriptable

After this message it looks extracting dump fine but stopped at this line and nothing happen:

Finished parsing bz2 archives

Following a Ctrl+C I got this message:

Traceback (most recent call last):
  File "manage.py", line 461, in <module>
    main()
  File "manage.py", line 457, in main
    args.func(rts, logger)
  File "manage.py", line 281, in extract_launcher
    enricher.launcher(rts)
  File "C:\editor_trends\etl\enricher.py", line 823, in launcher
    multiprocessor_launcher(function, dataset, storage, locks, rts)
  File "C:\editor_trends\etl\enricher.py", line 777, in multiprocessor_launcher
    input_queue.join()
  File "C:\Program Files\Python 2.7\lib\multiprocessing\queues.py", line 316, in join
    self._cond.wait()
  File "C:\Program Files\Python 2.7\lib\multiprocessing\synchronize.py", line 220, in wait
    self._wait_semaphore.acquire(True, timeout)
KeyboardInterrupt
Samat13:49, 5 April 2011

What OS are you using? and can you email me the log file from /logs/

Drdee13:51, 5 April 2011

Win7 64bit. Sure.

Samat13:53, 5 April 2011

Ok, I am making a lot of changes (as you have noted) and I am really trying to get a stable version as soon as possible. thanks for your patience.

Drdee13:55, 5 April 2011

mmm I cannot replicate this, could you please update to the most recent version on SVN and try again?

Drdee15:38, 5 April 2011

hmm. I used the latest revision (85439; I update editor_trends through SVN before every run). I have replicated right now.

Processing of huwiki-latest-stub-meta-history.xml took 0:21:48.918000
Number of articles: 512102
Number of revisions: 8256238
Finished parsing bz2 archives

[waiting infinite amount of time then CTRL+C]

Traceback (most recent call last):
  File "manage.py", line 461, in <module>
    main()
  File "manage.py", line 457, in main
    args.func(rts, logger)
  File "manage.py", line 405, in all_launcher
    res = function(rts, logger)
  File "manage.py", line 281, in extract_launcher
    enricher.launcher(rts)
  File "C:\editor_trends\etl\enricher.py", line 823, in launcher
    multiprocessor_launcher(function, dataset, storage, locks, rts)
  File "C:\editor_trends\etl\enricher.py", line 777, in multiprocessor_launcher
    input_queue.join()
  File "C:\Program Files\Python 2.7\lib\multiprocessing\queues.py", line 316, in join
    self._cond.wait()
  File "C:\Program Files\Python 2.7\lib\multiprocessing\synchronize.py", line 220, in wait
    self._wait_semaphore.acquire(True, timeout)
KeyboardInterrupt
Samat16:39, 5 April 2011

So it did finish extracting, the problem is that it did not exit the queue. How many processors does your computer have? so you can continue now doing the sort, store and transform phase.

Drdee17:00, 5 April 2011

The processor has 4 physical cores, with Hyper-threading the program can use 8 cores. It worked fine few days ago.

Samat17:13, 5 April 2011
 

I have updated the code with some extra debugging info. Can you update your code, rerun the extract phase and then copy the output on the console and email it to me?

try both these options to see if if differs:

python manage.py extract

python manage.py all -e download

Drdee19:51, 5 April 2011