[Tutor] Threads

Wed Nov 17 02:49:06 CET 2004

Not only is things like waiting for headers a major issue, but simply 
resolving the name and connecting.  What if your DNS goes down in 
mid-download. It could take a long time to timeout while trying to 
connect to your DNS, and none of your sockets will be touched, select or 
not.  So if we are going to use blocking sockets we might as well go all 
the way.

Here is a simple twisted example of downloading 3 sites, printing them 
to stdout, and exiting, it probably won't make much sense but it's 100% 
non blocking at least:)

from twisted.web import client
from twisted.Internet import reactor

from urllib2 import urlparse

def _handlePage(result):
  """The result is the contents of the webpage"""
  global num_downloaded
  print result
  num_downloaded += 1
  if num_downloaded == len(URLS) - 1:
    reactor.stop()

URLS = ['http://www.google.com/', 'http://www.yahoo.com/', 
'http://www.python.org/']
num_downloaded = 0

for i in URLS:
  parsed = urlparse.urlsplit(i)
  f = client.HTTPClientFactory(parsed[2])
  f.host = parsed[1]
  f.deferred.addCallback(_handlePage)
  reactor.connectTCP(parsed[1], 80, f)

reactor.run()

All this does is download each page, print it out, and when that many 
url's has been processed, stop the program (reactor.stop).  This does 
not handle errors or any exceptional situations.

Danny Yoo wrote:

>On Tue, 16 Nov 2004, orbitz wrote:
>
>  
>
>>urllib is blocking, so you can't really use it wiht non blocking code.
>>the urlopen functio could take awhile, and then even if data is on the
>>socket then it will still block for te read most likely which is not
>>going to help you. One is going to have to use a non blocking url api in
>>order to make the most of their time.
>>    
>>
>
>
>Hi Orbitz,
>
>
>Hmmm!  Yes, you're right: the sockets block by default.  But, when we try
>to read() a block of data, select() can tell us which ones will
>immediately block and which ones won't.
>
>
>The real-world situation is actually a bit complicated.  Let's do a test
>to make things more explicit and measurable.
>
>
>For this example, let's say that we have the following 'hello.py' CGI:
>
>###
>#!/usr/bin/python
>import time
>import sys
>print "Content-type: text/plain\n\n"
>sys.stdout.flush()
>
>print "hello world";
>time.sleep(5)
>print "goodbye world"
>###
>
>
>I'll be accessing this cgi from the url "http://localhost/~dyoo/hello.py".
>I'm also using Apache 2.0 as my web server.  Big note: there's a flush()
>after the content-stream header.  This is intentional, and will be
>significant later on in this post.
>
>
>
>I then wrote the following two test programs:
>
>###
>## test1.py
>from grab_pages import PageGrabber
>from StringIO import StringIO
>pg = PageGrabber()
>f1, f2, f3 = StringIO(), StringIO(), StringIO()
>pg.add("http://localhost/~dyoo/hello.py", f1)
>pg.add("http://localhost/~dyoo/hello.py", f2)
>pg.add("http://localhost/~dyoo/hello.py", f3)
>pg.writeOutAllPages()
>print f1.getvalue()
>print f2.getvalue()
>print f3.getvalue()
>###
>
>
>###
>## test2.py
>import urllib
>print urllib.urlopen("http://localhost/~dyoo/hello.py").read()
>print urllib.urlopen("http://localhost/~dyoo/hello.py").read()
>print urllib.urlopen("http://localhost/~dyoo/hello.py").read()
>###
>
>
>test1 uses the PageGrabber class we wrote earlier, and test2 uses a
>straightforward approach.
>
>
>If we start timing the perfomance of test1.py and test2.py, we do see a
>difference between the two, since test1 will try to grab the pages in
>parallel, while test2 will do it serially:
>
>
>###
>[dyoo at shoebox dyoo]$ time python test1.py
>
>hello world
>goodbye world
>
>
>hello world
>goodbye world
>
>
>hello world
>goodbye world
>
>
>real	0m5.106s
>user	0m0.043s
>sys	0m0.011s
>
>[dyoo at shoebox dyoo]$ time python test2.py
>
>hello world
>goodbye world
>
>
>hello world
>goodbye world
>
>
>hello world
>goodbye world
>
>
>real	0m15.107s
>user	0m0.044s
>sys	0m0.007s
>###
>
>
>So for this particular example, we're getting good results: test1 takes
>about 5 seconds, while test2 takes 15.  So the select() code is doing
>pretty ok so far, and does show improvement over the straightforward
>approach.  Isn't this wonderful?  *grin*
>
>
>Well, there's bad news.
>
>
>The problem is that, as you highlighted, the urllib.urlopen() function
>itself can block, and that's actually a very bad problem in practice.  In
>particular, it blocks until it sees the end of the HTTP headers, since it
>depends on Python's 'httplib' module.
>
>If we take out the flush() out of our hello.py CGI:
>
>###
>#!/usr/bin/python
>import time
>import sys
>print "Content-type: text/plain\n\n"
>print "hello world";
>time.sleep(5)
>print "goodbye world"
>###
>
>
>then suddenly things go horribly awry:
>
>###
>[dyoo at shoebox dyoo]$ time python test1.py
>
>hello world
>goodbye world
>
>
>hello world
>goodbye world
>
>
>hello world
>goodbye world
>
>
>real	0m15.113s
>user	0m0.047s
>sys	0m0.006s
>###
>
>And suddenly, we do no better than with the serial version!
>
>
>What's happening is that the web server is buffering the output of its CGI
>programs.  Without the sys.stdout.flush(), it's likely that the web server
>doesn't send out anything until the whole program is complete.  But
>because urllib.urlopen() returns only after seeing the header block from
>the HTTP response, it actually ends up waiting until the whole program's
>done.
>
>
>Not all CGI's have been carefully written to output its HTTP headers in a
>timely manner, so urllib.urlopen()'s blocking behavior is a show-stopper.
>This highlights the need for a framework that's built with nonblocking,
>event-driven code as a pervasive concept.  Like... Twisted!  *grin*
>
>Does anyone want to cook up an example with Twisted to show how the
>page-grabbing example might work?
>
>
>
>Hope this helps!
>
>
>  
>