Downloading binary files - Python3

Stefan Behnel stefan_ml at behnel.de
Sat Mar 21 11:12:02 EDT 2009


Anders Eriksson wrote:
> I have made a short program that given an url will download all referenced
> files on that url.
> 
> It works, but I'm thinking it could use some optimization since it's very
> slow.

What's slow about it? Is downloading each file slow, is it the overhead of
connecting to the server before the download, or is it more the feeling
that the overall process could use your bandwidth better?


> I create a list of tuples where each tuple consist of the url to the file
> and the path to where I want to save it. E.g (http://somewhere.com/foo.mp3,
> c:\Music\foo.mp3)
> 
> The downloading part (which is the part I need help with) looks like this:
> def GetFiles():
>     """do the actual copying of files"""
>     for url,path in hreflist:
>         print(url,end=" ")
>         srcdata = urlopen(url).read()
>         dstfile = open(path,mode='wb')
>         dstfile.write(srcdata)
>         dstfile.close()
>         print("Done!")
> 
> hreflist if the list of tuples.
> 
> at the moment the print(url,end=" ") will not be printed before the actual
> download, instead it will be printed at the same time as print("Done!").
> This I would like to have the way I intended.
> 
> Is downloading a binary file using: srcdata = urlopen(url).read()
> the best way? Is there some other way that would speed up the downloading?

Yes. Instead of running the downloads in a sequential loop, put the code
for downloading one file into a function and start one thread per file,
each of which runs that function (see the threading module). That way, each
thread can happily sit and wait for data coming from its server, without
preventing other threads from receiving data from their server at the same
time. That should get your bandwidth usage up.

You may have to take care that you do not run too many threads against the
same server (which may get upset and block your requests, depending on the
site), or that you limit the number of threads when you download a large
number of files. Running too many threads can slow things down again. But
you'll see that when you try.

Stefan



More information about the Python-list mailing list