jeff at drinktomi.com
Thu Sep 11 23:30:48 CEST 2008
On Sep 11, 2008, at 7:56 AM, Oleg Oltar wrote:
> I need to open about 1200 urls from my database. And to check that
> those urls are really exists.
> I used urllib2.urlopen for it. But it's a little bit slow. I thought
> that it's might be a good idea to do it in a threads. So it can add
> some performance to my code.
> Unfortunately I can't get started with the treading module. There
> are no simple examples in the python docs. Not sure how to start.
> What I want to have:
> for my list of urls I want to start few threads, each one of them
> should do some validation.
> My code without threads:
The first thing to do is to break your program into phases. Right now
all your logic is intermixed. This isn't a good thing when working
with threads. You want to isolate the parts that need parallelization
from the parts that don't.
A general stylistic note: Don't start your code with utility and helper
functions. You're telling a story. Start with the big overarching
that orchestrates all the bits in pieces. That's the plot of your
little helper functions are like details about a character. They get in
the way until you've figured out what's going on.
It seems to me like you have three distinct parts:
- get the list of urls from the database
- check the urls
- report the urls status
So your main method will look something like:
urls = urls_from_database()
checked_urls = check(urls)
The only method that needs to be parallelized is check(urls).
One approach to threading is to start one thread for every
URL, but if you have too many URLs you could bring your
machine to its knees. Doing 30 things at once may be fine,
but doing 1000 is quite possibly a problem, so you have to
limit the number of threads.
The classic way of doing this uses a pool of tasks and a number
of worker threads that pull jobs out of the pool. Your master
thread fills the pool, starts a number of worker threads, and then
waits for them to finish.
Each worker thread pulls a job from the pool, performs the job,
writes the results to another pool. When the pool is empty the
Your check method might be something like this:
from Queue import Queue
from threading import Thread
unchecked_urls = Queue() # a queue full of url strings
checked_urls = Queue() # a queue full of (url string, is_good
# waits until all the jobs have been emptied from the queue
def fill_job_pool(unchecked_urls, urls):
for url in urls:
def start_worker_threads(unchecked_urls, checked_urls, number_workers):
for x in range(0, number_workers):
# Creates a thread object that will call
# when it is started.
worker = Thread(target=worker_thread, args=(unchecked_urls,
# Python will terminate even if this thread is still alive. This
means that the
# thread doesn't need to kill itself.
# Start the worker thread
results = 
while not checked_urls.empty():
def worker_thread(job_pool, result_pool):
url = job_pool.get()
is_good = check_url(url)
result = (url, is_good)
# YOUR URL CHECK HERE
Once you plug this into your program, you'll start finding ways that
you can shorten the whole program. Instead of passing around
arrays or urls and results you can pass around the queues directly.
In addition you can run the report function as another thread. It prints
the jobs from the result pool as they're completed. These will make
the code more elegant, but the solution here gets at the heart of the
- Jeff Younker - jeff at drinktomi.com -
More information about the Tutor