[Tutor] Threads
orbitz
orbitz at ezabel.com
Wed Nov 17 03:06:44 CET 2004
My apologies, should be len(URLS) not len(URLS) - 1
orbitz wrote:
> Not only is things like waiting for headers a major issue, but simply
> resolving the name and connecting. What if your DNS goes down in
> mid-download. It could take a long time to timeout while trying to
> connect to your DNS, and none of your sockets will be touched, select
> or not. So if we are going to use blocking sockets we might as well
> go all the way.
>
> Here is a simple twisted example of downloading 3 sites, printing them
> to stdout, and exiting, it probably won't make much sense but it's
> 100% non blocking at least:)
>
> from twisted.web import client
> from twisted.Internet import reactor
>
> from urllib2 import urlparse
>
> def _handlePage(result):
> """The result is the contents of the webpage"""
> global num_downloaded
> print result
> num_downloaded += 1
> if num_downloaded == len(URLS) - 1:
> reactor.stop()
>
> URLS = ['http://www.google.com/', 'http://www.yahoo.com/',
> 'http://www.python.org/']
> num_downloaded = 0
>
> for i in URLS:
> parsed = urlparse.urlsplit(i)
> f = client.HTTPClientFactory(parsed[2])
> f.host = parsed[1]
> f.deferred.addCallback(_handlePage)
> reactor.connectTCP(parsed[1], 80, f)
>
> reactor.run()
>
>
> All this does is download each page, print it out, and when that many
> url's has been processed, stop the program (reactor.stop). This does
> not handle errors or any exceptional situations.
>
> Danny Yoo wrote:
>
>> On Tue, 16 Nov 2004, orbitz wrote:
>>
>>
>>
>>> urllib is blocking, so you can't really use it wiht non blocking code.
>>> the urlopen functio could take awhile, and then even if data is on the
>>> socket then it will still block for te read most likely which is not
>>> going to help you. One is going to have to use a non blocking url
>>> api in
>>> order to make the most of their time.
>>>
>>
>>
>>
>> Hi Orbitz,
>>
>>
>> Hmmm! Yes, you're right: the sockets block by default. But, when we
>> try
>> to read() a block of data, select() can tell us which ones will
>> immediately block and which ones won't.
>>
>>
>> The real-world situation is actually a bit complicated. Let's do a test
>> to make things more explicit and measurable.
>>
>>
>> For this example, let's say that we have the following 'hello.py' CGI:
>>
>> ###
>> #!/usr/bin/python
>> import time
>> import sys
>> print "Content-type: text/plain\n\n"
>> sys.stdout.flush()
>>
>> print "hello world";
>> time.sleep(5)
>> print "goodbye world"
>> ###
>>
>>
>> I'll be accessing this cgi from the url
>> "http://localhost/~dyoo/hello.py".
>> I'm also using Apache 2.0 as my web server. Big note: there's a flush()
>> after the content-stream header. This is intentional, and will be
>> significant later on in this post.
>>
>>
>>
>> I then wrote the following two test programs:
>>
>> ###
>> ## test1.py
>> from grab_pages import PageGrabber
>> from StringIO import StringIO
>> pg = PageGrabber()
>> f1, f2, f3 = StringIO(), StringIO(), StringIO()
>> pg.add("http://localhost/~dyoo/hello.py", f1)
>> pg.add("http://localhost/~dyoo/hello.py", f2)
>> pg.add("http://localhost/~dyoo/hello.py", f3)
>> pg.writeOutAllPages()
>> print f1.getvalue()
>> print f2.getvalue()
>> print f3.getvalue()
>> ###
>>
>>
>> ###
>> ## test2.py
>> import urllib
>> print urllib.urlopen("http://localhost/~dyoo/hello.py").read()
>> print urllib.urlopen("http://localhost/~dyoo/hello.py").read()
>> print urllib.urlopen("http://localhost/~dyoo/hello.py").read()
>> ###
>>
>>
>> test1 uses the PageGrabber class we wrote earlier, and test2 uses a
>> straightforward approach.
>>
>>
>> If we start timing the perfomance of test1.py and test2.py, we do see a
>> difference between the two, since test1 will try to grab the pages in
>> parallel, while test2 will do it serially:
>>
>>
>> ###
>> [dyoo at shoebox dyoo]$ time python test1.py
>>
>> hello world
>> goodbye world
>>
>>
>> hello world
>> goodbye world
>>
>>
>> hello world
>> goodbye world
>>
>>
>> real 0m5.106s
>> user 0m0.043s
>> sys 0m0.011s
>>
>> [dyoo at shoebox dyoo]$ time python test2.py
>>
>> hello world
>> goodbye world
>>
>>
>> hello world
>> goodbye world
>>
>>
>> hello world
>> goodbye world
>>
>>
>> real 0m15.107s
>> user 0m0.044s
>> sys 0m0.007s
>> ###
>>
>>
>> So for this particular example, we're getting good results: test1 takes
>> about 5 seconds, while test2 takes 15. So the select() code is doing
>> pretty ok so far, and does show improvement over the straightforward
>> approach. Isn't this wonderful? *grin*
>>
>>
>> Well, there's bad news.
>>
>>
>> The problem is that, as you highlighted, the urllib.urlopen() function
>> itself can block, and that's actually a very bad problem in
>> practice. In
>> particular, it blocks until it sees the end of the HTTP headers,
>> since it
>> depends on Python's 'httplib' module.
>>
>> If we take out the flush() out of our hello.py CGI:
>>
>> ###
>> #!/usr/bin/python
>> import time
>> import sys
>> print "Content-type: text/plain\n\n"
>> print "hello world";
>> time.sleep(5)
>> print "goodbye world"
>> ###
>>
>>
>> then suddenly things go horribly awry:
>>
>> ###
>> [dyoo at shoebox dyoo]$ time python test1.py
>>
>> hello world
>> goodbye world
>>
>>
>> hello world
>> goodbye world
>>
>>
>> hello world
>> goodbye world
>>
>>
>> real 0m15.113s
>> user 0m0.047s
>> sys 0m0.006s
>> ###
>>
>> And suddenly, we do no better than with the serial version!
>>
>>
>> What's happening is that the web server is buffering the output of
>> its CGI
>> programs. Without the sys.stdout.flush(), it's likely that the web
>> server
>> doesn't send out anything until the whole program is complete. But
>> because urllib.urlopen() returns only after seeing the header block from
>> the HTTP response, it actually ends up waiting until the whole program's
>> done.
>>
>>
>> Not all CGI's have been carefully written to output its HTTP headers
>> in a
>> timely manner, so urllib.urlopen()'s blocking behavior is a
>> show-stopper.
>> This highlights the need for a framework that's built with nonblocking,
>> event-driven code as a pervasive concept. Like... Twisted! *grin*
>>
>> Does anyone want to cook up an example with Twisted to show how the
>> page-grabbing example might work?
>>
>>
>>
>> Hope this helps!
>>
>>
>>
>>
>
> _______________________________________________
> Tutor maillist - Tutor at python.org
> http://mail.python.org/mailman/listinfo/tutor
>
More information about the Tutor
mailing list