[Tutor] Threads
Terry Carroll
carroll at tjc.com
Wed Nov 17 19:49:29 CET 2004
Thanks to everyone for the ideas on using threads for my scraper app.
I was unable to sleep last night (my wife snores), so having nothing
better to do, I got up and played with Python. I was able to convert my
serial download program to a threaded app pretty quickly.
My first step was to discard the use of the list of lists, in which I was
storing the URLs from which to download, in favor of a Queue object, and
then continuing to process the entries the same way. Once that was done,
I found it pretty straightforward to take the consumer part of the program
and turn it into a thread.
Great results. The serial approach took about 21 minutes to process 20
files; basically about a minute to generate the list of files, and then
about a minute each for all the files. With my present threaded approach,
using 4 threads, that's cut down to about 6 minutes: one minute to
generate the list, and then five minutes for each thread to download five
files each. Of course, increasing the number of threads made it even
faster. I went up to six, but feel I'm being abusinve if I do more than
about 4.
I plan to go back and rework the part that generates the queue. As
written, it first generates a list of URLs of pages to process, and then
processes each of those pages; each page in turn has a URL pointing to the
file I want to download. I'm going to rework this so that each page is
processed as soon as identified, rather than identifying all 20, and the
queue entry is immediately made. This would allow my consumer threads to
begin work much earlier, rather than waiting the minute or so to build the
queue in its entirety first.
By the way, the shutdown method I chose was as we discussed yesterday: add
an element to the queue with a shutdown flag on it. When a thread popped
this element off the queue, it requeued it for the next thread to discover
and shut down. Worked like a champ first time.
I'm not a programmer any more, so having something work the first time is
a pretty big deal for me these days!
More information about the Tutor
mailing list