Threading problems with httplib

David Fendrich david at aitellu.com
Sun May 11 08:13:11 EDT 2003


Background:
I am implementing a sort of webspider using: Python 2.2, linux 2.4.19 and
PyQt 3.5. I am basically using a Qt-interface to control a master thread
that in turn controls about 30 spider threads that simultaneously downloads
pages from different servers. The spider threads are very simple, so I just
use the "thread" module - not the higher level "threading". The downloading
threads make use of robotparser and urllib (not urllib2 as the online docs
frightened me - not a single example in twenty pages of reference!).

Problem:
All the program threads including the interface regularly locks up for long
periods of time (perhaps 30s) and then everything goes back to normal. The
CPU load is just a few percent but everything just locks.

Hypothesis:
I think that the problem is related to this:
http://mail.python.org/pipermail/python-dev/2002-September/028555.html
("mysterious hangs in socket code")
Does anybody know if that issue was ever resolved.
I think that httplib (which is used by both urllib and robotparser) is not
thread
safe (because I sporadically and non-deterministically get funny error
messages from deep inside the code when I don't surround those calls with
try/except) and that it locks the GIL around code which is idle and waiting
to timeout. According to the post mentioned above, it might be a problem
specific to linux and DNS-resolution.

Does anyone know of any work-arounds? Is there a thread safe
httplib/urllib/robotparser? Am I doing anything wrong?

I would be _really_ grateful for any replies. This code must work today.
/David





More information about the Python-list mailing list