[Python-Dev] Deprecate codecs.open() and StreamWriter/StreamReader

Wed May 25 13:10:51 CEST 2011

Le mercredi 25 mai 2011 à 11:38 +0200, M.-A. Lemburg a écrit :
> You are missing the point: we have StreamReader and StreamWriter APIs
> on codecs to allow each codecs to implement more efficient ways of
> encoding and decoding streams.
> 
> Examples of such optimizations are reading the stream in
> chunks that can be decoded in one piece, or writing to the stream
> in a way that doesn't generate encoding state problems on the
> receiving end by ending transmission half-way through a
> shift block.
> 
> ...
> 
> We don't have many such specialized implementations in the stdlib,
> but this doesn't mean that there's no use for them. It
> just means that developers and users are simply unaware of the
> possibilities opened by these stateful stream APIs.

Does at least one codec implement such implementation in its
StreamReader or StreamWriter class? And can't we implement such
optimization in incremental encoders and decoders (or in TextIOWrapper)?

I checked all multibyte codecs (UTF and CJK codecs) and I don't see any
of such optimization. UTF codecs handle the BOM, but don't have anything
looking like an optimization. CJK codecs use multibytecodec,
MultibyteStreamReader and MultibyteStreamWriter, which don't look to be
optimized. But I missed maybe something?

TextIOWrapper has an advanced buffer algorithm to prefetch (readahead)
some bytes at each read to speed up small read. It is difficult to
implement such algorithm, but it's done and it works.

--

Ok, let's stop to speak about theorical optimizations, and let's do a
benchmark to compare codecs and the io modules on reading files!

I tested Python 3.3 (70370:178d367c9733) compiled in release mode (gcc
-O3) on a Pentium4 @ 3 GHz with 2 GB of memory. I tunned manually the
number of loops to ensure that the faster test takes at least one
second. I only ran my benchmark once. See the attached bench.py file.

(1) Decode Objects/unicodeobject.c (317336 characters) from utf-8

test_io.readline(): 89.6 ms
test_codecs.readline(): 1272.8 ms
-> codecs 1320% slower than io

test_io.read(1): 1728.9 ms
test_codecs.read(1): 36395.0 ms
-> codecs 2005% slower than io

test_io.read(100): 460.7 ms
test_codecs.read(100): 3897.0 ms
-> codecs 746% slower than io

test_io.read(-1): 1911.7 ms
test_codecs.read(-1): 1740.7 ms
-> codecs 10% FASTER than io

(2) Decode README (6613 characters) from ascii

test_io.readline(): 109.9 ms
test_codecs.readline(): 1023.8 ms
-> codecs 832% slower than io

test_io.read(1): 1560.4 ms
test_codecs.read(1): 29402.6 ms
-> codecs 1784% slower than io

test_io.read(100): 866.9 ms
test_codecs.read(100): 3699.5 ms
-> codecs 327% slower than io

test_io.read(-1): 5140.2 ms
test_codecs.read(-1): 4817.9 ms
-> codecs 7% FASTER than io

(3) Decode Lib/test/cjkencodings/gb18030.txt (501 characters) from
gb18030

test_io.readline(): 1193.7 ms
test_codecs.readline(): 1474.3 ms
-> codecs 24% slower than io

test_io.read(1): 3847.7 ms
test_codecs.read(1): 27103.9 ms
-> codecs 604% slower than io

test_io.read(100): 12839.5 ms
test_codecs.read(100): 13444.2 ms
-> codecs 5% slower than io

test_io.read(-1): 2183.3 ms
test_codecs.read(-1): 1906.1 ms
-> codecs 15% FASTER than io

The readahead code does really help read(1): io is between 6 and 20
times faster than the codecs. But it does really use a more common
usecase, readline: io is between 1.2 and 13 times faster than the
codecs.

codecs is always faster (between 1.07 and 1.15 times faster than io) to
read the whole content of file using read(-1). Something should maybe be
optimized in TextIOWrapper.read() ;-) But the gain is minor if you
compare it to the gain on read(1) and readline()!

Please check my bench.py script and redo the benchmark on your own
computer!

Victor
-------------- next part --------------
A non-text attachment was scrubbed...
Name: bench.py
Type: text/x-python
Size: 1867 bytes
Desc: not available
URL: <http://mail.python.org/pipermail/python-dev/attachments/20110525/dadd9dd4/attachment.py>