[Tutor] Threads
orbitz
orbitz at ezabel.com
Wed Nov 17 02:49:06 CET 2004
Not only is things like waiting for headers a major issue, but simply
resolving the name and connecting. What if your DNS goes down in
mid-download. It could take a long time to timeout while trying to
connect to your DNS, and none of your sockets will be touched, select or
not. So if we are going to use blocking sockets we might as well go all
the way.
Here is a simple twisted example of downloading 3 sites, printing them
to stdout, and exiting, it probably won't make much sense but it's 100%
non blocking at least:)
from twisted.web import client
from twisted.Internet import reactor
from urllib2 import urlparse
def _handlePage(result):
"""The result is the contents of the webpage"""
global num_downloaded
print result
num_downloaded += 1
if num_downloaded == len(URLS) - 1:
reactor.stop()
URLS = ['http://www.google.com/', 'http://www.yahoo.com/',
'http://www.python.org/']
num_downloaded = 0
for i in URLS:
parsed = urlparse.urlsplit(i)
f = client.HTTPClientFactory(parsed[2])
f.host = parsed[1]
f.deferred.addCallback(_handlePage)
reactor.connectTCP(parsed[1], 80, f)
reactor.run()
All this does is download each page, print it out, and when that many
url's has been processed, stop the program (reactor.stop). This does
not handle errors or any exceptional situations.
Danny Yoo wrote:
>On Tue, 16 Nov 2004, orbitz wrote:
>
>
>
>>urllib is blocking, so you can't really use it wiht non blocking code.
>>the urlopen functio could take awhile, and then even if data is on the
>>socket then it will still block for te read most likely which is not
>>going to help you. One is going to have to use a non blocking url api in
>>order to make the most of their time.
>>
>>
>
>
>Hi Orbitz,
>
>
>Hmmm! Yes, you're right: the sockets block by default. But, when we try
>to read() a block of data, select() can tell us which ones will
>immediately block and which ones won't.
>
>
>The real-world situation is actually a bit complicated. Let's do a test
>to make things more explicit and measurable.
>
>
>For this example, let's say that we have the following 'hello.py' CGI:
>
>###
>#!/usr/bin/python
>import time
>import sys
>print "Content-type: text/plain\n\n"
>sys.stdout.flush()
>
>print "hello world";
>time.sleep(5)
>print "goodbye world"
>###
>
>
>I'll be accessing this cgi from the url "http://localhost/~dyoo/hello.py".
>I'm also using Apache 2.0 as my web server. Big note: there's a flush()
>after the content-stream header. This is intentional, and will be
>significant later on in this post.
>
>
>
>I then wrote the following two test programs:
>
>###
>## test1.py
>from grab_pages import PageGrabber
>from StringIO import StringIO
>pg = PageGrabber()
>f1, f2, f3 = StringIO(), StringIO(), StringIO()
>pg.add("http://localhost/~dyoo/hello.py", f1)
>pg.add("http://localhost/~dyoo/hello.py", f2)
>pg.add("http://localhost/~dyoo/hello.py", f3)
>pg.writeOutAllPages()
>print f1.getvalue()
>print f2.getvalue()
>print f3.getvalue()
>###
>
>
>###
>## test2.py
>import urllib
>print urllib.urlopen("http://localhost/~dyoo/hello.py").read()
>print urllib.urlopen("http://localhost/~dyoo/hello.py").read()
>print urllib.urlopen("http://localhost/~dyoo/hello.py").read()
>###
>
>
>test1 uses the PageGrabber class we wrote earlier, and test2 uses a
>straightforward approach.
>
>
>If we start timing the perfomance of test1.py and test2.py, we do see a
>difference between the two, since test1 will try to grab the pages in
>parallel, while test2 will do it serially:
>
>
>###
>[dyoo at shoebox dyoo]$ time python test1.py
>
>hello world
>goodbye world
>
>
>hello world
>goodbye world
>
>
>hello world
>goodbye world
>
>
>real 0m5.106s
>user 0m0.043s
>sys 0m0.011s
>
>[dyoo at shoebox dyoo]$ time python test2.py
>
>hello world
>goodbye world
>
>
>hello world
>goodbye world
>
>
>hello world
>goodbye world
>
>
>real 0m15.107s
>user 0m0.044s
>sys 0m0.007s
>###
>
>
>So for this particular example, we're getting good results: test1 takes
>about 5 seconds, while test2 takes 15. So the select() code is doing
>pretty ok so far, and does show improvement over the straightforward
>approach. Isn't this wonderful? *grin*
>
>
>Well, there's bad news.
>
>
>The problem is that, as you highlighted, the urllib.urlopen() function
>itself can block, and that's actually a very bad problem in practice. In
>particular, it blocks until it sees the end of the HTTP headers, since it
>depends on Python's 'httplib' module.
>
>If we take out the flush() out of our hello.py CGI:
>
>###
>#!/usr/bin/python
>import time
>import sys
>print "Content-type: text/plain\n\n"
>print "hello world";
>time.sleep(5)
>print "goodbye world"
>###
>
>
>then suddenly things go horribly awry:
>
>###
>[dyoo at shoebox dyoo]$ time python test1.py
>
>hello world
>goodbye world
>
>
>hello world
>goodbye world
>
>
>hello world
>goodbye world
>
>
>real 0m15.113s
>user 0m0.047s
>sys 0m0.006s
>###
>
>And suddenly, we do no better than with the serial version!
>
>
>What's happening is that the web server is buffering the output of its CGI
>programs. Without the sys.stdout.flush(), it's likely that the web server
>doesn't send out anything until the whole program is complete. But
>because urllib.urlopen() returns only after seeing the header block from
>the HTTP response, it actually ends up waiting until the whole program's
>done.
>
>
>Not all CGI's have been carefully written to output its HTTP headers in a
>timely manner, so urllib.urlopen()'s blocking behavior is a show-stopper.
>This highlights the need for a framework that's built with nonblocking,
>event-driven code as a pervasive concept. Like... Twisted! *grin*
>
>Does anyone want to cook up an example with Twisted to show how the
>page-grabbing example might work?
>
>
>
>Hope this helps!
>
>
>
>
More information about the Tutor
mailing list