[Python-ideas] duck typing for io write methods

Sun Jun 16 07:57:47 CEST 2013

On 16/06/13 15:42, Wolfgang Maier wrote:
> Victor Stinner <victor.stinner at ...> writes:
>>
>> I don't think that converting bytes to str is the bottleneck when you
>> read a long text file... (Reading data from disk is known to be
>> *slow*.)
>>
>  From the io module docs (Python 3.3):
> "Text I/O over a binary storage (such as a file) is significantly slower
> than binary I/O over the same storage, because it requires conversions
> between unicode and binary data using a character codec. This can become
> noticeable handling huge amounts of text data like large log files."

"this can become noticeable" != "this is the bottleneck in your code".

In my recent tests on my PC (Python 3.3 on a 1GB machine), I have found that when reading medium-sized pure ASCII files, the text IO objects are as little as 2-3 times slower than binary, which may be unnoticeable for a real-world application. (Who cares about the difference between 0.03 second versus 0.01 second in a script that takes a total of 0.2 second to run?)

On the other hand, given a 400MB avi file, reading it as UTF-8 with errors='ignore' is up to EIGHTY times slower than reading it as a binary file. (Hardly surprising, given the vast number of UTF-8 errors that are likely to be found.)

My gut feeling is that if your file is actually ASCII, and you read it line-by-line rather than all at once, there may be a small speedup from reading it as a binary file and working with bytes directly, but probably not as much as you expect. But I wouldn't be confident without actually profiling your code. As always, if you optimize based on a wild guess as to what you need to optimize, then you risk wasting your time and effort, or worse, risk actually ending up with even slower code.

-- 
Steven