urllib2 and threading
Paul Rubin
http
Fri May 1 03:27:01 EDT 2009
robean <st1999 at gmail.com> writes:
> reach the urls with urllib2. The actual program will involve fairly
> elaborate scraping and parsing (I'm using Beautiful Soup for that) but
> the example shown here is simplified and just confirms the url of the
> site visited.
Keep in mind Beautiful Soup is pretty slow, so if you're doing a lot
of pages and have multiple cpu's, you probably want parallel processes
rather than threads.
> wrong? I am new to both threading and urllib2, so its possible that
> the SNAFU is quite obvious..
> ...
> ulock = threading.Lock()
Without looking at the code for more than a few seconds, using an
explicit lock like that is generally not a good sign. The usual
Python style is to send all inter-thread communications through
Queues. You'd dump all your url's into a queue and have a bunch of
worker threads getting items off the queue and processing them. This
really avoids a lot of lock-related headache. The price is that you
sometimes use more threads than strictly necessary. Unless it's a LOT
of extra threads, it's usually not worth the hassle of messing with
locks.
More information about the Python-list
mailing list