[Chicago] how to use multithread to download?

Dale Sedivec dale at codefu.org
Fri Jun 17 17:15:32 CEST 2011


2011/6/17 守株待兔 <1248283536 at qq.com>:
> i have written a program to download an online book:
> http://www.network-theory.co.uk/docs/pytut/
>
> import time
> import urllib
> import lxml.html
> import os
> time1=time.time()
> os.mkdir('/tmp/python')
> down='http://www.network-theory.co.uk/docs/pytut/'
> file=urllib.urlopen(down).read()
> root=lxml.html.fromstring(file)
> tnodes = root.xpath("//div[@class='main']//ul/li/a")
> for x in tnodes:
>   url='http://www.network-theory.co.uk/docs/pytut/'+x.get('href')
>   name=x.text
>   myfile=open('/tmp/python/'+name,'a')
>   page=urllib.urlopen(url).read()
>   myfile.write(page)
>   myfile.close()
> time2=time.time()
> print time2-time1
>
> it's slow , would  you  mind to revise it with multithread??

Are you sure that the person running this site would welcome lots of
parallel hits coming from you to download the book they're giving
away?  My initial reaction is that you should not parallelize this
task as a matter of politeness.  I have to believe your bottleneck
here is the HTTP request/response; there's nothing super CPU or I/O
intensive on your side.  I'd be surprised if there are more than 150
links on that page.  It can't take _that_ long to download them
sequentially, right?  I suspect many administrators would not welcome
a big flurry of parallel hits to their web site--especially not to
download a book they're giving away in the first place.

Approaching this solely as a hypothetical exercise for learning
parallel processing in Python, I think I'd use something like
multiprocessing.Pool from the standard library (Python 2.6 or later).
Probably Pool.map calling a tiny function to fetch and store each URL
(i.e. most of the inside of that loop).  Maybe using a smallish
chunksize to Pool.map.  Note that this will actually use separate
processes, not threads, but I don't see how that would matter in this
case.

But please don't use this knowledge to download this book in parallel
unless you know the people that run that site wouldn't mind.

Dale


More information about the Chicago mailing list