urllib2 and threading
robean
st1999 at gmail.com
Fri May 1 11:09:06 EDT 2009
Thanks for your reply. Obviously you make several good points about
Beautiful Soup and Queue. But here's the problem: even if I do nothing
whatsoever with the threads beyond just visiting the urls with
urllib2, the program chokes. If I replace
else:
ulock.acquire()
print page.geturl() # obviously, do something more useful
here,eventually
page.close()
ulock.release()
with
else:
pass
the urllib2 starts raising URLErrros after the first 3 - 5 urls have
been visited. Do you have any sense what in the threads is corrupting
urllib2's behavior? Many thanks,
Robean
On May 1, 12:27 am, Paul Rubin <http://phr...@NOSPAM.invalid> wrote:
> robean <st1... at gmail.com> writes:
> > reach the urls with urllib2. The actual program will involve fairly
> > elaborate scraping and parsing (I'm using Beautiful Soup for that) but
> > the example shown here is simplified and just confirms the url of the
> > site visited.
>
> Keep in mind Beautiful Soup is pretty slow, so if you're doing a lot
> of pages and have multiple cpu's, you probably want parallel processes
> rather than threads.
>
> > wrong? I am new to both threading and urllib2, so its possible that
> > the SNAFU is quite obvious..
> > ...
> > ulock = threading.Lock()
>
> Without looking at the code for more than a few seconds, using an
> explicit lock like that is generally not a good sign. The usual
> Python style is to send all inter-thread communications through
> Queues. You'd dump all your url's into a queue and have a bunch of
> worker threads getting items off the queue and processing them. This
> really avoids a lot of lock-related headache. The price is that you
> sometimes use more threads than strictly necessary. Unless it's a LOT
> of extra threads, it's usually not worth the hassle of messing with
> locks.
More information about the Python-list
mailing list