[Python-Dev] Deprecate codecs.open() and StreamWriter/StreamReader

Wed May 25 15:43:55 CEST 2011

Victor Stinner wrote:
> Le mercredi 25 mai 2011 à 11:38 +0200, M.-A. Lemburg a écrit :
>> You are missing the point: we have StreamReader and StreamWriter APIs
>> on codecs to allow each codecs to implement more efficient ways of
>> encoding and decoding streams.
>>
>> Examples of such optimizations are reading the stream in
>> chunks that can be decoded in one piece, or writing to the stream
>> in a way that doesn't generate encoding state problems on the
>> receiving end by ending transmission half-way through a
>> shift block.
>>
>> ...
>>
>> We don't have many such specialized implementations in the stdlib,
>> but this doesn't mean that there's no use for them. It
>> just means that developers and users are simply unaware of the
>> possibilities opened by these stateful stream APIs.
> 
> Does at least one codec implement such implementation in its
> StreamReader or StreamWriter class? And can't we implement such
> optimization in incremental encoders and decoders (or in TextIOWrapper)?

I don't see how, since you need control over the file API methods
in order to implement such optimizations. OTOH, adding lots of
special cases to TextIOWrapper isn't a good either, since these
optimizations would then only trigger for a small number of
codecs and completely leave out 3rd party codecs.

> I checked all multibyte codecs (UTF and CJK codecs) and I don't see any
> of such optimization. UTF codecs handle the BOM, but don't have anything
> looking like an optimization. CJK codecs use multibytecodec,
> MultibyteStreamReader and MultibyteStreamWriter, which don't look to be
> optimized. But I missed maybe something?

No, you haven't missed such per-codec optimizations. The base classes
implement general purpose support for reading from streams in
chunks, but the support isn't optimized per codec.

For UTF-16 it would e.g. make sense to always read data in blocks
with even sizes, removing the trial-and-error decoding and extra
buffering currently done by the base classes. For UTF-32, the
blocks should have size % 4 == 0.

For UTF-8 (and other variable length encodings) it would make
sense looking at the end of the (bytes) data read from the
stream to see whether a complete code point was read or not,
rather than simply running the decoder on the complete data
set, only to find that a few bytes at the end are missing.

For single character encodings, it would make sense to prefetch
data in big chunks and skip all the trial and error decoding
implemented by the base classes to address the above problem
with variable length encodings.

Finally, all this could be implemented in C, reducing the
Python call overhead dramatically.

> TextIOWrapper has an advanced buffer algorithm to prefetch (readahead)
> some bytes at each read to speed up small read. It is difficult to
> implement such algorithm, but it's done and it works.
> 
> --
> 
> Ok, let's stop to speak about theorical optimizations, and let's do a
> benchmark to compare codecs and the io modules on reading files!

That's somewhat unfair: TextIOWrapper is implemented in C,
whereas the StreamReader/Writer subclasses used by the
codecs are written in Python.

A fair comparison would use the Python implementation of
TextIOWrapper.

-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, May 25 2011)
>>> Python/Zope Consulting and Support ...        http://www.egenix.com/
>>> mxODBC.Zope.Database.Adapter ...             http://zope.egenix.com/
>>> mxODBC, mxDateTime, mxTextTools ...        http://python.egenix.com/
________________________________________________________________________
2011-05-23: Released eGenix mx Base 3.2.0      http://python.egenix.com/
2011-05-25: Released mxODBC 3.1.1              http://python.egenix.com/
2011-06-20: EuroPython 2011, Florence, Italy               26 days to go

::: Try our new mxODBC.Connect Python Database Interface for free ! ::::

   eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
    D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
           Registered at Amtsgericht Duesseldorf: HRB 46611
               http://www.egenix.com/company/contact/