[Tutor] Threads

Danny Yoo dyoo at hkn.eecs.berkeley.edu
Wed Nov 17 01:03:55 CET 2004



On Tue, 16 Nov 2004, orbitz wrote:

> urllib is blocking, so you can't really use it wiht non blocking code.
> the urlopen functio could take awhile, and then even if data is on the
> socket then it will still block for te read most likely which is not
> going to help you. One is going to have to use a non blocking url api in
> order to make the most of their time.


Hi Orbitz,


Hmmm!  Yes, you're right: the sockets block by default.  But, when we try
to read() a block of data, select() can tell us which ones will
immediately block and which ones won't.


The real-world situation is actually a bit complicated.  Let's do a test
to make things more explicit and measurable.


For this example, let's say that we have the following 'hello.py' CGI:

###
#!/usr/bin/python
import time
import sys
print "Content-type: text/plain\n\n"
sys.stdout.flush()

print "hello world";
time.sleep(5)
print "goodbye world"
###


I'll be accessing this cgi from the url "http://localhost/~dyoo/hello.py".
I'm also using Apache 2.0 as my web server.  Big note: there's a flush()
after the content-stream header.  This is intentional, and will be
significant later on in this post.



I then wrote the following two test programs:

###
## test1.py
from grab_pages import PageGrabber
from StringIO import StringIO
pg = PageGrabber()
f1, f2, f3 = StringIO(), StringIO(), StringIO()
pg.add("http://localhost/~dyoo/hello.py", f1)
pg.add("http://localhost/~dyoo/hello.py", f2)
pg.add("http://localhost/~dyoo/hello.py", f3)
pg.writeOutAllPages()
print f1.getvalue()
print f2.getvalue()
print f3.getvalue()
###


###
## test2.py
import urllib
print urllib.urlopen("http://localhost/~dyoo/hello.py").read()
print urllib.urlopen("http://localhost/~dyoo/hello.py").read()
print urllib.urlopen("http://localhost/~dyoo/hello.py").read()
###


test1 uses the PageGrabber class we wrote earlier, and test2 uses a
straightforward approach.


If we start timing the perfomance of test1.py and test2.py, we do see a
difference between the two, since test1 will try to grab the pages in
parallel, while test2 will do it serially:


###
[dyoo at shoebox dyoo]$ time python test1.py

hello world
goodbye world


hello world
goodbye world


hello world
goodbye world


real	0m5.106s
user	0m0.043s
sys	0m0.011s

[dyoo at shoebox dyoo]$ time python test2.py

hello world
goodbye world


hello world
goodbye world


hello world
goodbye world


real	0m15.107s
user	0m0.044s
sys	0m0.007s
###


So for this particular example, we're getting good results: test1 takes
about 5 seconds, while test2 takes 15.  So the select() code is doing
pretty ok so far, and does show improvement over the straightforward
approach.  Isn't this wonderful?  *grin*


Well, there's bad news.


The problem is that, as you highlighted, the urllib.urlopen() function
itself can block, and that's actually a very bad problem in practice.  In
particular, it blocks until it sees the end of the HTTP headers, since it
depends on Python's 'httplib' module.

If we take out the flush() out of our hello.py CGI:

###
#!/usr/bin/python
import time
import sys
print "Content-type: text/plain\n\n"
print "hello world";
time.sleep(5)
print "goodbye world"
###


then suddenly things go horribly awry:

###
[dyoo at shoebox dyoo]$ time python test1.py

hello world
goodbye world


hello world
goodbye world


hello world
goodbye world


real	0m15.113s
user	0m0.047s
sys	0m0.006s
###

And suddenly, we do no better than with the serial version!


What's happening is that the web server is buffering the output of its CGI
programs.  Without the sys.stdout.flush(), it's likely that the web server
doesn't send out anything until the whole program is complete.  But
because urllib.urlopen() returns only after seeing the header block from
the HTTP response, it actually ends up waiting until the whole program's
done.


Not all CGI's have been carefully written to output its HTTP headers in a
timely manner, so urllib.urlopen()'s blocking behavior is a show-stopper.
This highlights the need for a framework that's built with nonblocking,
event-driven code as a pervasive concept.  Like... Twisted!  *grin*

Does anyone want to cook up an example with Twisted to show how the
page-grabbing example might work?



Hope this helps!



More information about the Tutor mailing list