[concurrency] Inside the Python GIL

Mon Jun 15 21:57:51 CEST 2009

>>>>> Aahz <aahz at pythoncraft.com> (A) wrote:

>A> On Fri, Jun 12, 2009, Jeremy Hylton wrote:
>>> 
>>> I'm not sure I understand how to distinguish between I/O bound threads
>>> and CPU bound threads.  If you've got a relatively simple
>>> multi-threaded application like an HTTP fetcher with a thread pool
>>> fetching a lot of urls, you're probably going to end up having more
>>> than one thread  with input to process at any instant.  There's a ton
>>> of Python code that executes when that happens.  You've got a urllib
>>> addinfourl wrapper, a httplib HTTPResponse (with read & _safe_read)
>>> and a socket _fileobject.  Heaven help you if you are using readline.
>>> So I could image even this trivial I/O bound program having lots of
>>> CPU contention.

>A> You could imagine, but have you tested it?  ;-)  Back in the 1.5.2 days,
>A> I helped write a web crawler where the sweet spot was around twenty or
>A> thirty threads.  That clearly indicates a significant I/O bottleneck.

I have written a small script to test this. It fires up a couple of
threads (or does it unthreaded) that each fetch a couple of web pages
(random google searches to be precise). It then measures some things
like the CPU percentage (using the psutil module, but you could also
do it with the ps command of course). You can also choose to do some
CPU processing, such as HTML parsing or hash calculation. And writing
something to a file.

I noticed some 5 - 15 % CPU utilisation on my 2-core MacBook, when at
home on a 4Mb/s ADSL line. So apparently I/O bound. I guess on the high speed university
network the CPU load may be a bit higher. I'll test that tomorrow at
work.

And with respect to readline, I don't think there are problems with
that in newer Python versions. My program has an option to use
readline instead of read and I see no significant differences.

Anyway here is the program.

-------------- next part --------------
#!/usr/bin/env python

# Author: Piet van Oostrum <piet at cs.uu.nl>
# This software is free (no rights reserved).

""" This program tries to test the speed of fetching web pages and doing
some processing on them in a multithreaded environment. The main purpose is
to see how much CPU time it uses so that we might draw some conclusions
about the effectivity of using threads in Python. Normally O.S. Threads
should help to get greater throughput, but Python's GIL may hinder this.
The web pages will be the results of some Google searches.

You call this program with the following command line args:

    - number of pages to be fetched
    - number of threads to be used.
      0 means do everything in main thread
      > 0  means start that many threads
    - flags:
      r = use readline instead of read
      h calculate SHA1 and MD5 hashes of the pages
      p do some HTML parsing on the pages
      w write some information to logfile (length and/or calculated hash)
"""

import sys
import os
from random import random
import urllib2
import hashlib
import psutil
process = psutil.Process(os.getpid())
import time
start_time = time.time()

def usage(help):
    progname = sys.argv[0]
    if help:
        print __doc__
    else:
        print >> sys.stderr, """Usage:
        %s npages nthreads flags
        For more help: %s help
        """ % (progname, progname)
    sys.exit(1)

from HTMLParser import HTMLParser

class MyHTMLParser(HTMLParser):

    def __init__(self):
        HTMLParser.__init__(self)
        self.ntags = 0
        self.depth = 0
        self.maxdepth = 0

    def handle_starttag(self, tag, attrs):
        self.ntags += 1
        self.depth +=1
        if self.depth > self.maxdepth:
            self.maxdepth = self.depth

    def handle_endtag(self, tag):
        self.depth -= 1

class DummyLock(object):
    '''Dummy Lock class only used as context handler
    (therefore no aquire and release necessary)
    '''
    def __enter__(self):
        pass
    def __exit__(self, et, ev, tb):
        pass

# get some search terms

words = """acutely alarmclock anaesthesia antitypical arteries autochthones
bargain bestowal blondes brazen butterfingers buttermilk captions cedarwood
cherries circumference codification compliments contagious cotangent
crucified daiquiri defence deplete diagrams discontinue dixieland ducts
elastomers endodontist epistemic evaporator extravert fertilizer flicker
fortuitous futurology geometry godzilla grovel handwriter hemlock hologram
hydrologic ikebana incite ingrowth internally islamization jungle kurdish
leftmost lipstick lymphocyte manufactory melancholia nests nonharmonic
obscene opus overabundant pagesize partaker percolator philosophy pirouette
policy preacher primogenital protuberance pyrite rangers reconvert reindeer
reroute rhapsody rudeness saturday scurry servant sidewalk slurry soul
sprawl still subentry supersede temper thorny tortilla trichome twine
undercover unload unwed velcro vocation wheel wrong zoologic""".split()

nwords = len(words)
google = "http://www.google.nl/search?q="
logfile = "testthreads.log"
BUFSIZE = 1024

try:
    if sys.argv[1].strip().lower() == 'help':
        usage(True)
    npages = int(sys.argv[1])
    nthreads = int(sys.argv[2])
    if len(sys.argv) < 4:
        flags = ''
    else:
        flags = sys.argv[3]
except (ValueError, IndexError):
    usage(False)

use_readline = 'r' in flags
do_hash = 'h' in flags
do_parse = 'p' in flags
do_write = 'w' in flags

user_agent = "Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_4_11; en) AppleWebKit/525.28.3 (KHTML, like Gecko)"
headers = { 'User-Agent' : user_agent }

def doit(np, lock):
    '''Fetch np web pages.
    lock will be used for exclusive access to the log file.
    Global variables do_hash and do_write will determine the behaviour.
    '''
    for i in range(np):
        url = google + "+".join((words[int(nwords * random())] for w in range(3)))
        req = urllib2.Request(url, None, headers)
        doc = urllib2.urlopen(req)
        docsize = 0
        if do_hash:
            h1 = hashlib.sha1()
            h2 = hashlib.md5()
        if do_parse:
            parser = MyHTMLParser()

        while True:
            if use_readline:
                data = doc.readline()
            else:
                data = doc.read(BUFSIZE)
            if not data:
                break
            docsize += len(data)
            if do_hash:
                h1.update(data)
                h2.update(data)
            if do_parse:
                parser.feed(data)

        if do_parse:
            parser.close()
        if do_write:
            with lock:
                log = open(logfile, 'a')
                print >>log, "URL: %s, size: %d" % (url, docsize)
                if do_hash:
                    print >>log, "sha1:", h1.hexdigest()
                    print >>log, "md5:", h2.hexdigest()
                if do_parse:
                    print >>log, "Read %d tags, max depth: %d" % \
                                  (parser.ntags, parser.maxdepth)
                log.close()

def start_thread(np, lock):
    '''Start a new thread fetching np pages, using lock for
    exclusive access to the logfile.
    The thread is put in the running_threads list.
    '''
    thr = threading.Thread(target = doit, args = (np, lock))
    thr.start()
    running_threads.append(thr)

running_threads = []
lock = DummyLock()

if nthreads == 0:
    doit(npages, lock)
else:
    import threading
    np = npages//nthreads
    np1 = npages - np*(nthreads - 1)
    if do_write:
        lock = threading.Lock()

    start_thread(np1, lock)
    for i in range(1, nthreads):
        start_thread(np, lock)

# Wait for all threads to finish

for thr in running_threads:
    thr.join()

print "CPU time (system): %.2f, (user): %.2f secs." % process.get_cpu_times()
print "Elapsed time: %.2f secs." % (time.time() - start_time)
print "CPU utilisation: %.2f %%" % process.get_cpu_percent()

-------------- next part --------------

-- 
Piet van Oostrum <piet at cs.uu.nl>
URL: http://pietvanoostrum.com [PGP 8DAE142BE17999C4]
Private email: piet at vanoostrum.org