[python-win32] FIle I/O on Windows XP
Dave Angel
davea at ieee.org
Mon Jun 15 15:23:21 CEST 2009
Tony Cappellini wrote:
> I was trying to see if I could speed up processing huge files (in the
> 10's of Gigabytes)
> by passing various values to the readline() method of the file object.
>
> No matter what I passed to readline() each call was slower and slower
> than passing no argument at all.
>
> I've used a SATA bus analyzer to see what kind of block sizes WIndows
> was using for the reads, and typically for non-AHCI reads, the block
> sizes were surpsingly small. 8 blocks were typical.
>
> It's somewhat difficult to separate I/O traffic that windows is
> routinely doing even when you're not reading a specific file, so
> putting the file on a secondary drive will eliminate most of that
> traffic.
>
> Can anyone explain why passing any non-negative values > 0 to
> readline() makes the file processing slower instead of faster?
>
>
Critical question, what version of Python? Print out sys.version for us.
Have you actually measured to be sure the bulk of the time is not spent
in processing the lines? And if you have, have you also measured the
time spent in a dummy loop of simple read() calls, to make sure it isn't
mostly the disk time?
I'll assume you've somehow narrowed it down to readline() itself. If
you haven't, this is mostly a waste of time. (eg. If half the time is
spent waiting for the disk, you're a candidate for multi-threading, and
get_line() specifically allows threads for overlapping I/O.)
file.readline is apparently written in C; you could look up the sources
to it. I could make a wild guess and say that any tight loop that's
checking two termination conditions instead of one would be slower. If
it were my code, I might read into a buffer, append a newline to the
buffer, and search the buffer for successive newlines. If the user
gives me a count, it would slow down that inner loop.
But I'm guessing your performance is being hampered by the newline
logic. For example, specifiying "u" on the open call will slow things
down. What are you doing to make that quicker? Are the files in ASCII,
are they in Unix or in Windows mode (the "b" flag on open) ? Are you
interpreting the file as UTF-8? Are you in Python 2.x or 3.x ?
For example, if you're opening a Windows/style text file, with crlf at
the end of each line, you might try timing readline() with the "b" flag
on, even though that'll give you \r\n instead of \n at the end of each
line. Depending on how you're parsing the line, that might not slow
down your other logic at all. So if it speeds up readline() it might be
worth it.
Have you investigated using readlines(), with a specified buffer size?
Obviously you can't (on Win32 anyway) do a single call with the default
argument. But that call is mentioned (in fileinput.py) as "a
significant speedup." And elsewhere I think the docs imply that "for
line in myfile" would be fastest.
Looking at the source:
in Objects\fileobject.c, function file_readline() seems to be the one
we want; it delegates to get_line(). get_line() has a very simple loop
if you're not in universal newline mode.
Another interesting tidbit in that file:
/* A larger buffer size may actually decrease performance. */
#define READAHEAD_BUFSIZE 8192
More information about the python-win32
mailing list