[Tutor] Threading

Thu Sep 11 23:30:48 CEST 2008

On Sep 11, 2008, at 7:56 AM, Oleg Oltar wrote:

> Hi!
>
> I need to open about 1200 urls from my database. And to check that  
> those urls are really exists.
>
> I used urllib2.urlopen for it. But it's a little bit slow. I thought  
> that it's might be a good idea to do it in a threads. So it can add  
> some performance to my code.
>
> Unfortunately I can't get started with the treading module.  There  
> are no simple examples in the python docs. Not sure how to start.
>
>
> What I want to have:
> for my list of urls I want to start few threads, each one of them  
> should do some validation.
>
> My code without threads:

The first thing to do is to break your program into phases.  Right now
all your logic is intermixed.   This isn't a good thing when working
with threads.  You want to isolate the parts that need parallelization
from the parts that don't.

A general stylistic note:  Don't start your code with utility and helper
functions.  You're telling a story.  Start with the big overarching  
logic
that orchestrates all the bits in pieces.  That's the plot of your  
story. The
little helper functions are like details about a character.  They get in
the way until you've figured out what's going on.

It seems to me like you have three distinct parts:

- get the list of urls from the database
- check the urls
- report the urls status

So your main method will look something like:

def test_urls():
    urls = urls_from_database()
    checked_urls = check(urls)
    report(checked_urls)

The only method that needs to be parallelized is check(urls).

One approach to threading is to start one thread for every
URL, but if you have too many URLs you could bring your
machine to its knees.  Doing 30 things at once may be fine,
but doing 1000 is quite possibly a problem, so you have to
limit the number of threads.

The classic way of doing this uses a pool of tasks and a number
of worker threads that pull jobs out of the pool.  Your master
thread fills the pool, starts a number of worker threads, and then
waits for them to finish.

Each worker thread pulls a job from the pool, performs the job,
writes the results to another pool. When the pool is empty the
program continues.

Your check method might be something like this:

from Queue import Queue
from threading import Thread
...
def check(urls):
     unchecked_urls = Queue()  # a queue full of url strings
     checked_urls = Queue()  # a queue full of (url string, is_good  
boolean) tuples
     fill_job_pool(unchecked_urls, urls)
     start_worker_threads(unchecked_urls, checked_urls,  
number_workers=10)
     # waits until all the jobs have been emptied from the queue
     unchecked_urls.join()
     return results_from(checked_urls)

def fill_job_pool(unchecked_urls, urls):
     for url in urls:
         unchecked_urls.put(url)

def start_worker_threads(unchecked_urls, checked_urls, number_workers):
     for x in range(0, number_workers):
	# Creates a thread object that will call  
worker_thread(unchecked_urls, checked_urls)
         # when it is started.
         worker = Thread(target=worker_thread, args=(unchecked_urls,  
checked_urls))
	# Python will terminate even if this thread is still alive. This  
means that the
         # thread doesn't need to kill itself.
         worker.setDaemon(True)
	# Start the worker thread
         worker.start()

def results_from(checked_urls):
     results = []
     while not checked_urls.empty():
          results.append(checked_urls.get())
     return results

def worker_thread(job_pool, result_pool):
     while True:
         url = job_pool.get()
         is_good = check_url(url)
         result = (url, is_good)
         result_pool.put(result)
         job_pool.task_done()

def check_url(url):
     # YOUR URL CHECK HERE
     return True

Once you plug this into your program, you'll start finding ways that
you can shorten the whole program.   Instead of passing around
arrays or urls and results you can pass around the queues directly.
In addition you can run the report function as another thread. It prints
the jobs from the result pool as they're completed.  These will make
the code more elegant, but the solution here gets at the heart of the
problem.

- Jeff Younker - jeff at drinktomi.com -