[Python-Dev] Deprecate codecs.open() and StreamWriter/StreamReader

Fri May 27 15:29:15 CEST 2011

Le vendredi 27 mai 2011 10:17:29, M.-A. Lemburg a écrit :
> > I think that the readahead algorithm is much more faster than trying to
> > avoid partial input, and it's not a problem to have partial input if you
> > use an incremental decoder.
> 
> Depends on where you're coming from. For non-seekable streams
> such as sockets or pipes, readahead is not going to work.

I don't see how StreamReader/StreamWriter can do a better job than 
TextIOWrapper for non-seekable streams.

> > TextIOWrapper implements this optimization using its readahead
> > algorithm.
> 
> It does yes, but the above was an optimization specific
> to single character encodings, not all encodings and
> TextIOWrapper doesn't know anything about specific characteristics
> of the underlying encodings (except perhaps a few special
> cases).

Please give me numbers: how fast are your suggested optimizations? Are they 
faster than readahead? All of your argumentation is based on theorical facts.

> > Do you mean that you would like to reimplement codecs in C?
> 
> As use of Unicode codecs increases in Python applications,
> this would certainly be an approach to consider, yes.

I am not sure that StreamReader is/can be faster than TextIOWrapper if it is 
reimplemented in C (see the updated benchmark below, codecs vs _pyio).

> > test_io.read(): 3991.0 ms
> > test_codecs.read(): 1736.9 ms
> > -> codecs 130% FASTER than io
> 
> No surprise here. It's also a very common use case
> to read the whole file in one go and the bigger
> the file, the more impact this has.

Oh, I understood why codecs is always faster than _pyio (or even io): it's 
because of IncrementalNewlineDecoder. To be fair, the read(-1) should be 
tested without IncrementalNewlineDecoder: e.g. with newline='\n'.

newline='' cannot be used for the read(-1) test, because even if newline='' 
indicates that we don't want to translate newlines, read(-1) uses the 
IncrementalNewlineDecoder (which is slower than not calling it at all). We may 
optimize this specific case in TextIOWrapper.

> > (3) Decode Lib/test/cjkencodings/gb18030.txt (501 characters) from
> > gb18030
> > 
> > test_io.readline(): 38.9 ms
> > test_codecs.readline(): 15.1 ms
> > -> codecs 157% FASTER than io
> > 
> > test_io.read(1): 369.8 ms
> > test_codecs.read(1): 302.2 ms
> > -> codecs 22% FASTER than io
> > 
> > test_io.read(100): 258.2 ms
> > test_codecs.read(100): 155.1 ms
> > -> codecs 67% FASTER than io
> > 
> > test_io.read(): 1803.2 ms
> > test_codecs.read(): 1002.9 ms
> > -> codecs 80% FASTER than io
> 
> These results are interesting since gb18030 is a shift
> encoding which keeps state in the encoded data stream, so
> the strategy chosen by TextIOWrapper doesn't work out that
> well.

In the 4 tests, TextIOWrapper only calls the decoder *once*, on the whole 
content of the file. The file size if 864 bytes, which is smaller than the 
TextIOWrapper chunk size (2048 bytes).

StreamReader of the gb18030 codec is implemented in C, not in Python (using 
multibytecodec.c). So to be fair, the test on this encoding should be done 
using io, not _pyio for this encoding.

Moreover, the multibytecodec module doesn't support universal newline! It does 
only support '\n' newlines. So to be more fair, the test should use '\n' 
newline.

It's one more reason to TextIOWrapper instead of StreamReader: it has the same 
behaviour (universal newlines) for all encodings. Or is it yet another bug in 
StreamReader?

> I am still -1 on deprecating the StreamReader/Writer parts of
> the codec APIs. I've given numerous reasons on why these are
> useful, what their intention is, why they were added to Python 1.6.

codecs.open() now uses TextIOWrapper, so there is no good reason to keep 
StreamReader or StreamWriter. You did not give me any use case where 
StreamReader or StreamWriter should be used instead of TextIOWrapper. You only 
listed theorical optimizations.

You have until the release of Python 3.3 to prove that StreamReader and/or 
StreamWriter can be faster than TextIOWrapper. If you can prove it using a 
patch and a benchmark, I will be ok to revert my commit.

> Since such a deprecation would change an important documented API,
> please write a PEP outlining your reasoning, including my comments,
> use cases and possibilities for optimizations.

Ok, I will write on a PEP explaining why StreamReader and StreamWriter are 
deprecated.

-----------

I wrote a new benchmarking script which tries to compare more closely codecs 
to io/_pyio (change the newline value and use io for gb18030). It should be a 
little bit more reliable because each test now runs 5 times (taking the 
smallest time), but it's not really reliable... The script is attached to this 
mail.

(1) Decode Objects/unicodeobject.c (317334 characters) from utf-8

_pyio.readline(): 1078.4 ms (8 loops, newline: '')
codecs.readline(): 983.0 ms (8 loops, newline: '')
-> codecs 10% FASTER than _pyio

_pyio.read(1): 3503.5 ms (2 loops, newline: '')
codecs.read(1): 6626.7 ms (2 loops, newline: '')
-> codecs 89% slower than _pyio

_pyio.read(100): 2076.2 ms (80 loops, newline: '')
codecs.read(100): 2870.8 ms (80 loops, newline: '')
-> codecs 38% slower than _pyio

_pyio.read(): 1698.0 ms (800 loops, newline: '\n')
codecs.read(): 1686.4 ms (800 loops, newline: '\n')
-> codecs 1% FASTER than _pyio

(2) Decode Lib/test/cjkencodings/gb18030.txt (501 characters) from gb18030

io.readline(): 5.1 ms (80 loops, newline: '\n')
codecs.readline(): 6.8 ms (80 loops, newline: '\n')
-> codecs 34% slower than io

io.read(1): 5.6 ms (20 loops, newline: '\n')
codecs.read(1): 45.5 ms (20 loops, newline: '\n')
-> codecs 705% slower than io

io.read(100): 54.2 ms (800 loops, newline: '\n')
codecs.read(100): 56.7 ms (800 loops, newline: '\n')
-> codecs 5% slower than io

io.read(): 395.8 ms (8000 loops, newline: '\n')
codecs.read(): 309.2 ms (8000 loops, newline: '\n')
-> codecs 28% FASTER than io

(3) Decode README (6613 characters) from ascii

_pyio.readline(): 385.9 ms (160 loops, newline: '')
codecs.readline(): 384.5 ms (160 loops, newline: '')
-> codecs 0% FASTER than _pyio

_pyio.read(1): 1473.6 ms (40 loops, newline: '')
codecs.read(1): 1913.9 ms (40 loops, newline: '')
-> codecs 30% slower than _pyio

_pyio.read(100): 1081.0 ms (1600 loops, newline: '')
codecs.read(100): 1325.6 ms (1600 loops, newline: '')
-> codecs 23% slower than _pyio

_pyio.read(): 1570.9 ms (16000 loops, newline: '\n')
codecs.read(): 1518.8 ms (16000 loops, newline: '\n')
-> codecs 3% FASTER than _pyio

codecs is still faster in 4 cases:
* ascii, read(): 3% faster than _pyio
* utf-8, readline(): 10% faster than _pyio
* utf-8, read(): 1% faster than _pyio
* gb18030, read(): 28% faster than io (!)

The last one is interesting and should be analyzed.

----

Even if it's not fair, benchmark using io for ASCII and UTF-8 (GB18030 already 
used io for the reasons explained before):

(1) Decode Objects/unicodeobject.c (317334 characters) from utf-8

io.readline(): 52.0 ms (8 loops, newline: '')
codecs.readline(): 1001.0 ms (8 loops, newline: '')
-> codecs 1825% slower than io

io.read(1): 265.7 ms (2 loops, newline: '')
codecs.read(1): 6734.5 ms (2 loops, newline: '')
-> codecs 2434% slower than io

io.read(100): 269.4 ms (80 loops, newline: '')
codecs.read(100): 2881.6 ms (80 loops, newline: '')
-> codecs 970% slower than io

io.read(): 1628.9 ms (800 loops, newline: '\n')
codecs.read(): 1692.8 ms (800 loops, newline: '\n')
-> codecs 4% slower than io

(3) Decode README (6613 characters) from ascii

io.readline(): 25.7 ms (160 loops, newline: '')
codecs.readline(): 415.5 ms (160 loops, newline: '')
-> codecs 1516% slower than io

io.read(1): 153.3 ms (40 loops, newline: '')
codecs.read(1): 2243.6 ms (40 loops, newline: '')
-> codecs 1363% slower than io

io.read(100): 210.2 ms (1600 loops, newline: '')
codecs.read(100): 1521.9 ms (1600 loops, newline: '')
-> codecs 624% slower than io

io.read(): 1100.1 ms (16000 loops, newline: '\n')
codecs.read(): 1501.1 ms (16000 loops, newline: '\n')
-> codecs 36% slower than io

So if you compare codecs to io (and not _pyio), codecs is only faster (26%) in 
one case: read the whole content of the file for multibytecodecs.

Note that the codecs module is 2434% slower than io to read a file in UTF-8 
character by character (which is stupid, don't do that! :-)), and 1825% slower 
to read line by line.

Victor
-------------- next part --------------
A non-text attachment was scrubbed...
Name: bench.py
Type: text/x-python
Size: 3326 bytes
Desc: not available
URL: <http://mail.python.org/pipermail/python-dev/attachments/20110527/8b9511f7/attachment.py>