[Python-3000] Draft PEP for New IO system

Tue Feb 27 22:37:29 CET 2007

On 2/27/07, Walter Dörwald <walter at livinglogic.de> wrote:
> Guido van Rossum wrote:
>
> > On 2/27/07, Walter Dörwald <walter at livinglogic.de> wrote:
> >> Guido van Rossum wrote:
> >>
> >> > The encoding/decoding behavior should be no different from that of the
> >> > encode() and decode() methods on unicode strings and byte arrays.
> >>
> >> Except that it must work in incremental mode. The new (in 2.5)
> >> incremental codecs should be usable for that.
> >
> > Thanks for reminding! Do the incremental codecs have internal state?
>
> They might have, however in all *decoding* cases (except the CJK codecs,
> which I know nothing about) this is just undecoded input. E.g. if the
> UTF-16-LE incremental decoder (which is a BufferedIncrementalDecoder)
> gets passed an odd number of bytes in the decode() call, it decodes as
> much as possible and keeps the last byte in a buffer, which will be
> reused on the next call to decode().
>
> AFAICR the only *encoder* that keeps state is the UTF-16 encoder: it has
> to remember whether a BOM has been output.
>
> I don't know whether the CJK codecs do keep any state besides undecoded
> input for decoding. (E.g. a greedy UTF-7 incremental decoder might have to).
>
> > I
> > wonder how this interacts with non-blocking reads.
>
> Non-blocking reads where the reason for implementing the incremental
> codecs: The codec decodes as much of the available input as possible and
> keeps the undecoded rest until the next decode() call.
>
> > (I know
> > next-to-nothing about incremental codecs beyond that they exist. :-)
>
> The basic principle is that these codecs can encode strings and decode
> bytes in multiple chunks. If you want to encode a unicode string u in
> UTF-16 you can do it in one go:
>     s = u.encode("utf-16")
> or character by character:
>     encoder = codecs.lookup("utf-16").incrementalencoder()
>     s = "".join(encoder.encode(c) for c in u) + encoder.encode(u"", True)
> The incremental encoder makes sure, that the result contains only one BOM.
>
> Decoding works in the same way:
>     decoder = codecs.lookup("utf-16").incrementaldecoder()
>     u = u"".join(decoder.decode(c) for c in s) + decoder.decode("", True)

Thanks for the explanations, it is a little bit clearer now!

> >> > Certainly no normalization of diacritics will be done; surrogate
> >> > handling depends on the encoding and whether the unicode string
> >> > implementation uses 16 or 32 bits per character.
> >> >
> >> > I agree that we need to be able to specify the error handling as well.
> >>
> >> Should it be possible to change the error handling during the lifetime
> >> of a stream? Then this change would have to be passed through to the
> >> underlying codec.
> >
> > Not unless you have a really good use case handy...
>
> Not for decoding, but for encoding: If you're outputting XML and use an
> encoding that can't encode all unicode characters, then it makes sense
> to switch to "xmlcharrefreplace" error handling during the output of
> text nodes (and back to "strict" for element names etc.).

So do the incremental codecs allow this switching?

-- 
--Guido van Rossum (home page: http://www.python.org/~guido/)