[Python-Dev] Decoding incomplete unicode

Tue Aug 10 21:24:20 CEST 2004

OK, here a my current thoughts on the codec problem:

The optimal solution (ignoring backwards compatibility)
would look like this: codecs.lookup() would return the
following stuff (this could be done by replacing the
4 entry tuple with a real object):

* decode: The stateless decoding function
* encode: The stateless encocing function
* chunkdecoder: The stateful chunk decoder
* chunkencoder: The stateful chunk encoder
* streamreader: The stateful stream decoder
* streamwriter: The stateful stream encoder

The functions and classes look like this:

Stateless decoder:
decode(input, errors='strict'):
     Function that decodes the (str) input object and returns
     a (unicode) output object. The decoder must decode the
     complete input without any remaining undecoded bytes.

Stateless encoder:
encode(input, errors='strict'):
     Function that encodes the complete (unicode) input object and
     returns a (str) output object.

Stateful chunk decoder:
chunkdecoder(errors='strict'):
     A factory function that returns a stateful decoder with the
     following method:

     decode(input, final=False):
         Decodes a chunk of input and return the decoded unicode
         object. This method can be called multiple times and
         the state of the decoder will be kept between calls.
         This includes trailing incomplete byte sequences
         that will be retained until the next call to decode().
         When the argument final is true, this is the last call
         to decode() and trailing incomplete byte sequences will
         not be retained, but a UnicodeError will be raised.

Stateful chunk encoder:
chunkencoder(errors='strict'):
     A factory function that returns a stateful encoder
     with the following method:
     encode(input, final=False):
         Encodes a chunk of input and returns the encoded
         str object. When final is true this is the last
         call to encode().

Stateful stream decoder:
streamreader(stream, errors='strict'):
     A factory function that returns a stateful decoder
     for reading from the byte stream stream, with the
     following methods:

     read(size=-1, chars=-1, final=False):
         Read unicode characters from the stream. When data
         is read from the stream it should be done in chunks of
         size bytes. If size == -1 all the remaining data
         from the stream is read. chars specifies the number
         of characters to read from the stream. read() may return
         less then chars characters if there's not enough data
         available in the byte stream. If chars == -1 as much
         characters are read as are available in the stream.
         Transient errors are ignored and trailing incomplete
         byte sequence are retained when final is false. Otherwise
         a UnicodeError is raised in the case of incomplete byte
         sequences.
     readline(size=-1):
             ...
     next():
             ...
     __iter__():
             ...

Stateful stream encoder:
streamwriter(stream, errors='strict'):
     A factory function that returns a stateful encoder
     for writing unicode data to the byte stream stream,
     with the following methods:

     write(data, final=False):
         Encodes the unicode object data and writes it
         to the stream. If final is true this is the last
         call to write().
     writelines(data):
         ...

I know that this is quite a departure from the current API, and
I'm not sure if we can get all of the functionality without
sacrificing backwards compatibility.

I don't particularly care about the "bytes consumed" return value
from the stateless codec. The codec should always have returned only
the encoded/decoded object, but I guess fixing this would break too
much code. And users who are only interested in the stateless
functionality will probably use unicode.encode/str.decode anyway.

For the stateful API it would be possible to combine the chunk and
stream decoder/encode into one class with the following methods
(for the decoder):

     __init__(stream, errors='strict'):
         Like the current StreamReader constructor, but stream may be
         None, if only the chunk API is used.
     decode(input, final=False):
         Like the current StreamReader (i.e. it returns a (unicode, int)
         tuple.) This does not keep the remaining bytes in a buffer.
         This is the job of the caller.
     feed(input, final=False):
         Decodes input and returns a decoded unicode object. This method
         calls decode() internally and manages the byte buffer.
     read(size=-1, chars=-1, final=False):
     readline(size=-1):
     next():
     __iter__():
         See above.

As before implementers of decoders only need to implement decode().

To be able to support the final argument the decoding functions
in _codecsmodule.c could get an additional argument. With this
they could be used for the stateless codecs too and we can reduce
the number of functions again.

Unfortunately adding the final argument breaks all of the current
codecs, but dropping the final argument requires one of two
changes:
1) When the input stream is exhausted, the bytes read are parsed
    as if final=True. That's the way the CJK codecs currently
    handle it, but unfortunately this doesn't work with the feed
    decoder.
2) Simply ignore any remaing undecoded bytes at the end of the
    stream.

If we really have to drop the final argument, I'd prefer 2).

I've uploaded a second version of the patch. It implements
the final argument, adds the feed() method to StreamReader and
again merges the duplicate decoding functions in the codecs
module. Note that the patch isn't really finished (the final
argument isn't completely supported in the encoders and the
CJK and escape codecs are unchanged), but it should be sufficient
as a base for discussion.

Bye,
    Walter Dörwald