[Python-Dev] Decoding incomplete unicode

Wed Jul 28 11:19:09 CEST 2004

M.-A. Lemburg wrote:

> Martin v. Löwis wrote:
> 
>> M.-A. Lemburg wrote:
>>
>>> I like the idea, but don't think the implementation is
>>> the right way to do it. Instead, I'd suggest using a new
>>> error handling strategy "break" ( = break processing as
>>> soon as errors are found).
>>
>> Can you demonstrate this approach in a patch? I think it
>> is unimplementable: the codec cannot communicate to the
>> error callback that it ran out of data.
> 
> The codec can: the callback gets all the necessary information
> and can even manipulate the objects being worked on.
> 
> But you have a point: the current implementations of the
> various encode/decode functions don't provide interfaces to
> report back the number of bytes read at C level - the codecs
> module wrappers add these numbers assuming that all bytes
> were read.

This is the correct thing to do for the stateless decoders:
any incomplete byte sequence at the end of the input is an
error. But then it doesn't make sense to return the number
of bytes decoded for the stateless decoder, because this is
always the size of the input. For the stateful decoder this
is just some sort of state common to all decoders: the decoder
keeps the incomplete byte sequence to be used in the next call.
But then this state should be internal to the decoder and not
part of the public API. This would make the decode() method
more usable: When you want to implement an XML parser that
supports the xml.sax.xmlreader.IncrementalParser interface,
you have an API mismatch. The parser has to use the stateful
decoding API (i.e. read()), which means the input is in the
form of a byte stream, but this interface expects it's input
as byte chunks passed to multiple calls to the feed() method.
If StreamReader.decode() simply returned the decoded unicode
object and keep the remaining undecoded bytes as an internal
state then StreamReader.decode() would be directly usable.

But this isn't really a "StreamReader" any more, so the best
solution would probably be to have a three level API:
* A stateless decoding function (what codecs.getdecoder
   returns now);
* A stateful "feed reader", which keeps internal state
   (including undecoded byte sequences) and gets passed byte
   chunks (should this feed reader have a error attribute or
   should this be an argument to the feed method?);
* A stateful stream reader that reads its input from a
   byte stream. The functionality for the stream reader could
   be implemented once using the underlying functionality of
   the feed reader (in fact we could implement something similar
   to sio's stacking streams: the stream reader would use
   the feed reader to wrap the byte input stream and add
   only a read() method. The line reading methods (readline(),
   readlines() and __iter__() could be added by another stream
   filter)

> The error callbacks could, however, raise an exception which
> includes all the needed information, including any state that
> may be needed in order to continue with coding operation.

This makes error callbacks effectively unusable with stateful
decoders.

> We may then need to allow additional keyword arguments on the
> encode/decode functions in order to preset a start state.

As those decoding functions are private to the decoder that's
probably OK. I wouldn't want to see additional keyword arguments
on str.decode (which uses the stateless API anyway). BTW, that's
exactly what I did for codecs.utf_7_decode_stateful, but I'm not
really comfortable with the internal state of the UTF-7 decoder
being exposed on the Python level. It would be better to encapsulate
the state in a feed reader implemented in C, so that the state is
inaccessible from the Python level.

Bye,
    Walter Dörwald