[Python-Dev] Draft PEP: Deprecate codecs.StreamReader and codecs.StreamWriter

M.-A. Lemburg mal at egenix.com
Thu Jul 7 10:07:38 CEST 2011


Victor Stinner wrote:
> Hi,
> 
> Last may, I proposed to deprecate open() function, StreamWriter and
> StreamReader classes of the codecs module. I accepted to keep open()
> after the discussion on python-dev. Here is a more complete proposition
> as a PEP. It is a draft and I expect a lot of comments :)

The PEP's arguments for deprecating two essential codec design
components are very one sided, by comparing "issues" to "features".

Please add all the comments I've made on the subject to the PEP.
The most important one missing is the fact and major difference
that TextIOWrapper does not work on a per codec basis, but
only on a per stream basis.

By removing the StreamReader and StreamWriter API parts of the
codec design, you essentially make it impossible to add
per codec variations and optimizations that require full access
to the stream interface.

A mentioned before, many improvements are possible and lots of those
can build on TextIOWrapper and the incremental codec parts.

That said, I'm not really up for a longer discussion on this. We've
already had the discussion and decided against removing those
parts of the codec API.

Redirecting codecs.open() to open() should be investigated.

For the issues you mention in the PEP, please open tickets
or add ticket references to the PEP.

Thanks,
-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, Jul 07 2011)
>>> Python/Zope Consulting and Support ...        http://www.egenix.com/
>>> mxODBC.Zope.Database.Adapter ...             http://zope.egenix.com/
>>> mxODBC, mxDateTime, mxTextTools ...        http://python.egenix.com/
________________________________________________________________________

::: Try our new mxODBC.Connect Python Database Interface for free ! ::::


   eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
    D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
           Registered at Amtsgericht Duesseldorf: HRB 46611
               http://www.egenix.com/company/contact/


> Victor
> 
> -----------------------
> 
> PEP: xxx
> Title: Deprecate codecs.StreamReader and codecs.StreamWriter
> Version: $Revision$
> Last-Modified: $Date$
> Author: Victor Stinner
> Status: Draft
> Type: Standards Track
> Content-Type: text/x-rst
> Created: 28-May-2011
> Python-Version: 3.3
> 
> 
> Abstract
> ========
> 
> io.TextIOWrapper and codecs.StreamReaderWriter offer the same API
> [#f1]_. TextIOWrapper has more features and is faster than
> StreamReaderWriter. Duplicate code means that bugs should be fixed
> twice and that we may have subtle differences between the two
> implementations.
> 
> The codecs modules was introduced in Python 2.0, see the PEP 100. The
> io module was introduced in Python 2.6 and 3.0 (see the PEP 3116), and
> reimplemented in C in Python 2.7 and 3.1.
> 
> 
> Motivation
> ==========
> 
> When the Python I/O model was updated for 3.0, the concept of a
> "stream-with-known-encoding" was introduced in the form of
> io.TextIOWrapper. As this class is critical to the performance of
> text-based I/O in Python 3, this module has an optimised C version
> which is used by CPython by default. Many corner cases in handling
> buffering, stateful codecs and universal newlines have been dealt with
> since the release of Python 3.0.
> 
> This new interface overlaps heavily with the legacy
> codecs.StreamReader, codecs.StreamWriter and codecs.StreamReaderWriter
> interfaces that were part of the original codec interface design in
> PEP 100. These interfaces are organised around the principle of an
> encoding with an associated stream (i.e. the reverse of arrangement in
> the io module), so the original PEP 100 design required that codec
> writers provide appropriate StreamReader and StreamWriter
> implementations in addition to the core codec encode() and decode()
> methods. This places a heavy burden on codec authors providing these
> specialised implementations to correctly handle many of the corner
> cases that have now been dealt with by io.TextIOWrapper. While deeper
> integration between the codec and the stream allows for additional
> optimisations in theory, these optimisations have in practice either
> not been carried out and else the associated code duplication means
> that the corner cases that have been fixed in io.TextIOWrapper are
> still not handled correctly in the various StreamReader and
> StreamWriter implementations.
> 
> Accordingly, this PEP proposes that:
> 
> * codecs.open() be updated to delegate to the builtin open() in Python
>   3.3;
> * the legacy codecs.Stream* interfaces, including the streamreader and
>   streamwriter attributes of codecs.CodecInfo be deprecated in Python
>   3.3 and removed in Python 3.4.
> 
> 
> Rationale
> =========
> 
> StreamReader and StreamWriter issues
> ''''''''''''''''''''''''''''''''''''
> 
>  * StreamReader is unable to translate newlines.
>  * StreamReaderWriter handles reads using StreamReader and writes
>    using StreamWriter. These two classes may be inconsistent. To stay
>    consistent, flush() must be called after each write which slows
>    down interlaced read-write.
>  * StreamWriter doesn't support "line buffering" (flush if the input
>    text contains a newline).
>  * StreamReader classes of the CJK encodings (e.g. GB18030) don't
>    support universal newlines, only UNIX newlines ('\\n').
>  * StreamReader and StreamWriter are stateful codecs but don't expose
>    functions to control their state (getstate() or setstate()). Each
>    codec has to implement corner cases, see "Issue with stateful
>    codecs".
>  * StreamReader and StreamWriter are very similar to IncrementalReader
>    and IncrementalEncoder, some code is duplicated for stateful codecs
>    (e.g. UTF-16).
>  * Each codec has to reimplement its own StreamReader and StreamWriter
>    class, even if it's trivial (just call the encoder/decoder).
>  * codecs.open(filename, "r") creates a io.TextIOWrapper object.
>  * No codec implements an optimized method in StreamReader or
>    StreamWriter based on the specificities of the codec.
> 
> 
> TextIOWrapper features
> ''''''''''''''''''''''
> 
>  * TextIOWrapper supports any kind of newline, including translating
>    newlines (to UNIX newlines), to read and write.
>  * TextIOWrapper reuses incremental encoders and decoders (no
>    duplication of code).
>  * The io module (TextIOWrapper) is faster than the codecs module
>    (StreamReader). It is implemented in C, whereas codecs is
>    implemented in Python.
>  * TextIOWrapper has a readahead algorithm which speeds up small
>    reads: read character by character or line by line (io is 10x
>    through 25x faster than codecs on these operations).
>  * TextIOWrapper has a write buffer.
>  * TextIOWrapper.tell() is optimized.
>  * TextIOWrapper supports random access (read+write) using a single
>    class which permit to optimize interlaced read-write (but no such
>    optimization is implemented).
> 
> 
> Possible improvements of StreamReader and StreamWriter
> ''''''''''''''''''''''''''''''''''''''''''''''''''''''
> 
> It would be possible to add functions to StreamReader and StreamWriter
> to give access to the state of codec. And so it would be possible fix
> issues with stateful codecs in a base class instead of having to fix
> them is each stateful StreamReader and StreamWriter classes.
> 
> It would be possible to change StreamReader and StreamWriter to make
> them use IncrementalDecoder and IncrementalEncoder.
> 
> A codec can implement variants which are optimized for the specific
> encoding or intercept certain stream methods to add functionality or
> improve the encoding/decoding performance. TextIOWrapper cannot
> implement such optimization, but TextIOWrapper uses incremental
> encoders and decoders and uses read and write buffers, so the overhead
> of incomplete inputs is low or nul.
> 
> A lot more could be done for other variable length encoding codecs,
> e.g. UTF-8, since these often have problems near the end of a read due
> to missing bytes. The UTF-32-BE/LE codecs could simply multiply the
> character position by 4 to get the byte position.
> 
> 
> Usage of StreamReader and StreamWriter
> ''''''''''''''''''''''''''''''''''''''
> 
> These classes are rarely used directly, but indirectly using
> codecs.open(). They are not used in Python 3 standard library (except
> in the codecs module).
> 
> Some projects implement their own codec with StreamReader and
> StreamWriter, but don't use these classes.
> 
> 
> Backwards Compatibility
> =======================
> 
> Keep the public API, codecs.open
> ''''''''''''''''''''''''''''''''
> 
> codecs.open() can be replaced by the builtin open() function. open()
> has a similar API but has also more options.
> 
> codecs.open() was the only way to open a text file in Unicode mode
> until Python 2.6. Many Python 2 programs uses this function. Removing
> codecs.open() implies more work to port programs from Python 2 to
> Python 3, especially projets using the same code base for the two
> Python versions (without using 2to3 program).
> 
> codecs.open() is kept for backward compatibility with Python 2.
> 
> 
> Deprecate StreamReader and StreamWriter
> '''''''''''''''''''''''''''''''''''''''
> 
> Instanciate StreamReader or StreamWriter must raise a
> DeprecationWarning in Python 3.3. Implement a subclass don't raise a
> DeprecationWarning.
> 
> codecs.open() will be changed to reuse the builtin open() function
> (TextIOWrapper).
> 
> EncodedFile(), StreamRandom, StreamReader, StreamReaderWriter and
> StreamWriter will be removed in Python 3.4.
> 
> 
> Issue with stateful codecs
> ==========================
> 
> It is difficult to use correctly a stateful codec with a stream. Some
> cases are supported by the codecs module, while io has no more known
> bug related to stateful codecs. The main difference between the codecs
> and the io module is that bugs have to be fixed in StreamReader and/or
> StreamWriter classes of each codec for the codecs module, whereas bugs
> can be fixed only once in io.TextIOWrapper. Here are some examples of
> issues with stateful codecs.
> 
> Stateful codecs
> '''''''''''''''
> 
> Python supports the following stateful codecs:
> 
>  * cp932
>  * cp949
>  * cp950
>  * euc_jis_2004
>  * euc_jisx2003
>  * euc_jp
>  * euc_kr
>  * gb18030
>  * gbk
>  * hz
>  * iso2022_jp
>  * iso2022_jp_1
>  * iso2022_jp_2
>  * iso2022_jp_2004
>  * iso2022_jp_3
>  * iso2022_jp_ext
>  * iso2022_kr
>  * shift_jis
>  * shift_jis_2004
>  * shift_jisx0213
>  * utf_8_sig
>  * utf_16
>  * utf_32
> 
> Read and seek(0)
> ''''''''''''''''
> 
> ::
> 
>     with open(filename, 'w', encoding='utf_16') as f:
>         f.write('abc')
>         f.write('def')
>         f.seek(0)
>         assert f.read() == 'abcdef'
>         f.seek(0)
>         assert f.read() == 'abcdef'
> 
> The io and codecs modules support this usecase correctly.
> 
> Write, seek(0) and seek(n)
> ''''''''''''''''''''''''''
> 
> ::
> 
>     with open(filename, 'w', encoding='utf_16') as f:
>         f.write('abc')
>         pos = f.tell()
>     with open(filename, 'r+', encoding='utf_16') as f:
>         f.seek(pos)
>         f.write('def')
>         f.seek(0)
>         f.write('###')
>     with open(filename, 'r', encoding='utf_16') as f:
>         assert f.read() == '###def'
> 
> The io module supports this usecase, whereas codecs fails because it
> writes a new BOM on the second write.
> 
> Append mode
> '''''''''''
> 
> ::
> 
>     with open(filename, 'w', encoding='utf_16') as f:
>         f.write('abc')
>     with open(filename, 'a', encoding='utf_16') as f:
>         f.write('def')
>     with open(filename, 'r', encoding='utf_16') as f:
>         assert f.read() == 'abcdef'
> 
> The io module supports this usecase, whereas codecs fails because it
> writes a new BOM on the second write.
> 
> 
> Links
> =====
> 
>  * `PEP 100: Python Unicode Integration
>    <http://www.python.org/dev/peps/pep-0100/>`_
>  * `PEP 3116 <http://www.python.org/dev/peps/pep-3116/>`_
>  * `Issue #8796: Deprecate codecs.open()
>    <http://bugs.python.org/issue8796>`_
>  * `[python-dev] Deprecate codecs.open() and StreamWriter/StreamReader
>    <http://mail.python.org/pipermail/python-dev/2011-May/111591.html>`_
> 
> 
> Copyright
> =========
> 
> This document has been placed in the public domain.
> 
> 
> Footnotes
> =========
> 
> .. [#f1] StreamReaderWriter has two more attributes than
>          TextIOWrapper, reader and writer.
> 
> _______________________________________________
> Python-Dev mailing list
> Python-Dev at python.org
> http://mail.python.org/mailman/listinfo/python-dev
> Unsubscribe: http://mail.python.org/mailman/options/python-dev/mal%40egenix.com


More information about the Python-Dev mailing list