[Python-3000] Draft PEP for New IO system

Tue Feb 27 22:27:02 CET 2007

Guido van Rossum wrote:

> On 2/27/07, Walter Dörwald <walter at livinglogic.de> wrote:
>> Guido van Rossum wrote:
>>
>> > The encoding/decoding behavior should be no different from that of the
>> > encode() and decode() methods on unicode strings and byte arrays.
>>
>> Except that it must work in incremental mode. The new (in 2.5)
>> incremental codecs should be usable for that.
> 
> Thanks for reminding! Do the incremental codecs have internal state?

They might have, however in all *decoding* cases (except the CJK codecs, 
which I know nothing about) this is just undecoded input. E.g. if the 
UTF-16-LE incremental decoder (which is a BufferedIncrementalDecoder) 
gets passed an odd number of bytes in the decode() call, it decodes as 
much as possible and keeps the last byte in a buffer, which will be 
reused on the next call to decode().

AFAICR the only *encoder* that keeps state is the UTF-16 encoder: it has 
to remember whether a BOM has been output.

I don't know whether the CJK codecs do keep any state besides undecoded 
input for decoding. (E.g. a greedy UTF-7 incremental decoder might have to).

> I
> wonder how this interacts with non-blocking reads.

Non-blocking reads where the reason for implementing the incremental 
codecs: The codec decodes as much of the available input as possible and 
keeps the undecoded rest until the next decode() call.

> (I know
> next-to-nothing about incremental codecs beyond that they exist. :-)

The basic principle is that these codecs can encode strings and decode 
bytes in multiple chunks. If you want to encode a unicode string u in 
UTF-16 you can do it in one go:
    s = u.encode("utf-16")
or character by character:
    encoder = codecs.lookup("utf-16").incrementalencoder()
    s = "".join(encoder.encode(c) for c in u) + encoder.encode(u"", True)
The incremental encoder makes sure, that the result contains only one BOM.

Decoding works in the same way:
    decoder = codecs.lookup("utf-16").incrementaldecoder()
    u = u"".join(decoder.decode(c) for c in s) + decoder.decode("", True)

>> > Certainly no normalization of diacritics will be done; surrogate
>> > handling depends on the encoding and whether the unicode string
>> > implementation uses 16 or 32 bits per character.
>> >
>> > I agree that we need to be able to specify the error handling as well.
>>
>> Should it be possible to change the error handling during the lifetime
>> of a stream? Then this change would have to be passed through to the
>> underlying codec.
> 
> Not unless you have a really good use case handy...

Not for decoding, but for encoding: If you're outputting XML and use an 
encoding that can't encode all unicode characters, then it makes sense 
to switch to "xmlcharrefreplace" error handling during the output of 
text nodes (and back to "strict" for element names etc.).

Servus,
    Walter