[Python-ideas] duck typing for io write methods

Sun Jun 16 09:41:25 CEST 2013

Steven D'Aprano <steve at ...> writes:

> 
> On 16/06/13 15:42, Wolfgang Maier wrote:
> > Victor Stinner <victor.stinner <at> ...> writes:
> >>
> >> I don't think that converting bytes to str is the bottleneck when you
> >> read a long text file... (Reading data from disk is known to be
> >> *slow*.)
> >>
> >  From the io module docs (Python 3.3):
> > "Text I/O over a binary storage (such as a file) is significantly slower
> > than binary I/O over the same storage, because it requires conversions
> > between unicode and binary data using a character codec. This can become
> > noticeable handling huge amounts of text data like large log files."
> 
> "this can become noticeable" != "this is the bottleneck in your code".
> 
> In my recent tests on my PC (Python 3.3 on a 1GB machine), I have found
that when reading medium-sized pure
> ASCII files, the text IO objects are as little as 2-3 times slower than
binary, which may be unnoticeable
> for a real-world application. (Who cares about the difference between 0.03
second versus 0.01 second in a
> script that takes a total of 0.2 second to run?)
> 
> On the other hand, given a 400MB avi file, reading it as UTF-8 with
errors='ignore' is up to EIGHTY times
> slower than reading it as a binary file. (Hardly surprising, given the
vast number of UTF-8 errors that are
> likely to be found.)
> 
> My gut feeling is that if your file is actually ASCII, and you read it
line-by-line rather than all at once,
> there may be a small speedup from reading it as a binary file and working
with bytes directly, but probably
> not as much as you expect. But I wouldn't be confident without actually
profiling your code. As always, if
> you optimize based on a wild guess as to what you need to optimize, then
you risk wasting your time and
> effort, or worse, risk actually ending up with even slower code.
> 

well, yes, some real timing data from my machine.
While in initial tests I had found text IO to be quite a bit slower than
binary IO, it turned out that this is only true when files are small enough
for OS IO caching, which happens of course if you try to time your code
repeatedly. Here I found, much like Steven, a ~ 100% speed difference.
However, when I now repeated this with a file larger than my system's memory
(effectively wiping out the cache between repeated trials) text and binary
mode are *equal* (and about 20x slower than with caching).

Summary: *Victor is right* (except for the caching case, but then, as Steven
says, under these conditions speed differences aren't that important
anyway.). Oh, and I tested under Python 3.2 and 3.3 and their behavior is
identical.

Best,
Wolfgang