[python-win32] FIle I/O on Windows XP

Mon Jun 15 15:23:21 CEST 2009

Tony Cappellini wrote:

> I was trying to see if I could speed up processing huge files (in the
> 10's of Gigabytes)
> by passing various values to the readline() method of the file object.
>
> No matter what I passed to readline() each call was slower and slower
> than passing no argument at all.
>
> I've used a SATA bus analyzer to see what kind of block sizes WIndows
> was using for the reads, and typically for non-AHCI reads, the block
> sizes were surpsingly small. 8 blocks were typical.
>
> It's somewhat difficult to separate I/O traffic that windows is
> routinely doing even when you're not reading a specific file, so
> putting the file on a secondary drive will eliminate most of that
> traffic.
>
> Can anyone explain why passing any non-negative values > 0 to
> readline() makes the file processing slower instead of faster?
>
>   
Critical question, what version of Python?  Print out  sys.version for us.

Have you actually measured to be sure the bulk of the time is not spent 
in processing the lines?  And if you have, have you also measured the 
time spent in a dummy loop of simple read() calls, to make sure it isn't 
mostly the disk time?

I'll assume you've somehow narrowed it down to readline() itself.  If 
you haven't, this is mostly a waste of time.  (eg. If half the time is 
spent waiting for the disk, you're a candidate for multi-threading, and 
get_line() specifically allows threads for overlapping I/O.)

file.readline is apparently written in C; you could look up the sources 
to it.  I could make a wild guess and say that any tight loop that's 
checking two termination conditions instead of one would be slower.  If 
it were my code, I might read into a buffer, append a newline to the 
buffer, and search the buffer for successive newlines.  If the user 
gives me a count, it would slow down that inner loop.

But I'm guessing your performance is being hampered by the newline 
logic.  For example, specifiying "u" on the open call will slow things 
down. What are you doing to make that quicker?  Are the files in ASCII, 
are they in Unix or in Windows mode  (the "b" flag on open) ?  Are you 
interpreting the file as UTF-8?  Are you in Python 2.x or 3.x ?

For example, if you're opening a Windows/style text file, with crlf at 
the end of each line, you might try timing readline() with the "b" flag 
on, even though that'll give you  \r\n instead of \n at the end of each 
line.  Depending on how you're parsing the line, that might not slow 
down your other logic at all.  So if it speeds up readline() it might be 
worth it.

Have you investigated using readlines(), with a specified buffer size?  
Obviously you can't (on Win32 anyway) do a single call with the default 
argument.  But that call is mentioned (in fileinput.py) as "a 
significant speedup."  And elsewhere I think the docs imply that  "for 
line in myfile" would be fastest.

Looking at the source:
in Objects\fileobject.c, function  file_readline() seems to be the one 
we want;  it delegates to get_line().  get_line() has a very simple loop 
if you're not in universal newline mode.

Another interesting tidbit in that file:
/* A larger buffer size may actually decrease performance. */
#define READAHEAD_BUFSIZE 8192