urllib2 performance on windows, usb connection

Fri Feb 6 19:26:40 EST 2009

dq wrote:
 > dq wrote:
 >> MRAB wrote:
 >>> dq wrote:
 >>>> Martin v. Löwis wrote:
 >>>>>> So does anyone know what the deal is with this?  Why is the
 >>>>>> same code so much slower on Windows?  Hope someone can tell
 >>>>>> me before a holy war erupts :-)
 >>>>>
 >>>>> Only the holy war can give an answer here. It certainly has
 >>>>> *nothing* to do with Python; Python calls the operating
 >>>>> system functions to read from the network and write to the
 >>>>> disk almost directly. So it must be the operating system
 >>>>> itself that slows it down.
 >>>>>
 >>>>> To investigate further, you might drop the write operating,
 >>>>> and measure only source.read(). If that is slower, then, for
 >>>>> some reason, the network speed is bad on Windows. Maybe you
 >>>>> have the network interfaces misconfigured? Maybe you are
 >>>>> using wireless on Windows, but cable on Linux? Maybe you have
 >>>>> some network filtering software running on Windows? Maybe
 >>>>> it's just that Windows sucks?-)
 >>>>>
 >>>>> If the network read speed is fine, but writing slows down, I
 >>>>> ask the same questions. Perhaps you have some virus scanner
 >>>>> installed that filters all write operations? Maybe Windows
 >>>>> sucks?
 >>>>>
 >>>>> Regards, Martin
 >>>>>
 >>>>
 >>>> Thanks for the ideas, Martin.  I ran a couple of experiments to
 >>>> find the culprit, by downloading the same 20 MB file from the
 >>>> same fast server. I compared:
 >>>>
 >>>> 1.  DL to HD vs USB iPod.
 >>>> 2.  AV on-access protection on vs. off
 >>>> 3.  "source. read()" only vs.  "file.write( source.read() )"
 >>>>
 >>>> The culprit is definitely the write speed on the iPod.  That
 >>>> is, everything runs plenty fast (~1 MB/s down) as long as I'm
 >>>> not writing directly to the iPod.  This is kind of odd, because
 >>>> if I copy the file over from the HD to the iPod using windows
 >>>> (drag-n-drop), it takes about a second or two, so about 10
 >>>> MB/s.
 >>>>
 >>>> So the problem is definitely partially Windows, but it also
 >>>> seems that Python's file.write() function is not without blame.
 >>>> It's the combination of Windows, iPod and Python's data stream
 >>>> that is slowing me down.
 >>>>
 >>>> I'm not really sure what I can do about this.  I'll experiment
 >>>> a little more and see if there's any way around this
 >>>> bottleneck.  If anyone has run into a problem like this, I'd
 >>>> love to hear about it...
 >>>>
 >>> You could try copying the file to the iPod using the command
 >>> line, or copying data from disk to iPod in, say, C, anything but
 >>> Python. This would allow you to identify whether Python itself
 >>> has anything to do with it.
 >>
 >> Well, I think I've partially identified the problem.  target.write(
 >> source.read() ) runs perfectly fast, copies 20 megs in about a
 >> second, from HD to iPod.  However, if I run the same code in a
 >> while loop, using a certain block size, say target.write(
 >> source.read(4096) ), it takes forever (or at least I'm still timing
 >> it while I write this post).
 >>
 >> The mismatch seems to be between urllib2's block size and the write
 >> speed of the iPod, I might try to tweak this a little in the code
 >> and see if it has any effect.
 >>
 >> Oh, there we go:   20 megs in 135.8 seconds.  Yeah... I might want
 >> to try to improve that...
 >
 > After some tweaking of the block size, I managed to get the DL speed
 > up to about 900 Mb/s.  It's still not quite Ubuntu, but it's a good
 > order of magnitude better.  The new DL code is pretty much this:
 >
 > """
 > blocksize = 2 ** 16    # plus or minus a power of 2
 > source = urllib2.urlopen( 'url://string' )
 > target = open( pathname, 'wb')
 > fullsize = float( source.info()['Content-Length'] )
 > DLd = 0
 > while DLd < fullsize:
 >     DLd = DLd + blocksize
 >     # optional:  write some DL progress info
 >     # somewhere, e.g. stdout
 > target.close()
 > source.close()
 > """
 >
I'd like to suggest that the block size you add to 'DLd' be the actual 
size of the returned block, just in case the read() doesn't return all 
you asked for (it might not be guaranteed, and the chances are that the
final block will be shorter, unless 'fullsize' happens to be a multiple
of 'blocksize').

If less is returned by read() then the while-loop might finish before
all the data has been downloaded, and if you just add 'blocksize' each
time it might end up > 'fullsize', ie apparently >100% downloaded!