[Python-Dev] Decoding incomplete unicode

Thu Jul 29 22:30:43 CEST 2004

M.-A. Lemburg wrote:

> Walter Dörwald wrote:
> 
>> M.-A. Lemburg wrote:
>>
>>> Walter Dörwald wrote:
>>> [...]
>>> The reason why stateless encode and decode APIs return the
>>> number of input items consumed is to accomodate for error
>>> handling situations like these where you want to stop
>>> coding and leave the remaining work to another step.

But then this turns into a stateful decoder. What would
happen when stateless decoders suddenly started to decode
less than the complete string? Every user would
have to check whether decoder(foo)[1] == len(foo).

>> [...]
>> I wonder whether the decode method is part of the public
>> API for StreamReader.
> 
> It is: StreamReader/Writer are "sub-classes" of the Codec
> class.
> 
> However, there's nothing stating that .read() or .write()
> *must* use these methods to do their work and that's
> intentional.

Any read() method can be implemented on top of a stateful
decode() method.

>>> I see two possibilities here:
>>>
>>> 1. you write a custom StreamReader/Writer implementation
>>>    for each of the codecs which takes care of keeping
>>>    state and encoding/decoding as much as possible
>>
>> But I'd like to reuse at least some of the functionality
>> from PyUnicode_DecodeUTF8() etc.
>>
>> Would a version of PyUnicode_DecodeUTF8() with an additional
>> PyUTF_DecoderState * be OK?
> 
> Before you start putting more work into this, let's first
> find a good workable approach.

I agree that we need a proper design for this that gives us
the most convenient codec API with breaking backwards
compatibility (at least not for codec users). Breaking
compatibility for codec implementers shouldn't be an issue.
I'll see if I can come up with something over the weekend.

>>> 2. you extend the existing stateless codec implementations
>>>    to allow communicating state on input and output; the
>>>    stateless operation would then be a special case
>>>
>>>> But this isn't really a "StreamReader" any more, so the best
>>>> solution would probably be to have a three level API:
>>>> * A stateless decoding function (what codecs.getdecoder
>>>>   returns now);
>>>> * A stateful "feed reader", which keeps internal state
>>>>   (including undecoded byte sequences) and gets passed byte
>>>>   chunks (should this feed reader have a error attribute or
>>>>   should this be an argument to the feed method?);
>>>> * A stateful stream reader that reads its input from a
>>>>   byte stream. The functionality for the stream reader could
>>>>   be implemented once using the underlying functionality of
>>>>   the feed reader (in fact we could implement something similar
>>>>   to sio's stacking streams: the stream reader would use
>>>>   the feed reader to wrap the byte input stream and add
>>>>   only a read() method. The line reading methods (readline(),
>>>>   readlines() and __iter__() could be added by another stream
>>>>   filter)
>>>
>>> Why make things more complicated ?
>>>
>>> If you absolutely need a feed interface, you can feed
>>> your data to a StringIO instance which is then read from
>>> by StreamReader.
>>
>> This doesn't work, because a StringIO has only one file position:
>>  >>> import cStringIO
>>  >>> s = cStringIO.StringIO()
>>  >>> s.write("x")
>>  >>> s.read()
>> ''
> 
> Ah, you wanted to do both feeding and reading at the same
> time ?!

There is no other way. You pass the feeder byte string chunks
and it returns the chunks of decoded objects. With the StreamReader
the reader itself will read those chunks from the underlying
stream.

Implementing a stream reader interface on top of a feed interface
is trivial: Basically our current decode method *is* the feed
interface, the only problem is that the user has to keep state
(the undecoded bytes that have to be passed to the next call to
decode). Move that state into an attribute of the instance and
drop it from the return value and you have a feed interface.

>> But something like the Queue class from the tests in the patch
>> might work.
> 
> Right... I don't think that we need a third approach to
> codecs just to implement feed based parsers.

We already have most of the functionality in the decode method.

>>>>> The error callbacks could, however, raise an exception which
>>>>> includes all the needed information, including any state that
>>>>> may be needed in order to continue with coding operation.
>>>>
>>>> This makes error callbacks effectively unusable with stateful
>>>> decoders.
>>>
>>> Could you explain ?
>>
>> If you have to call the decode function with errors='break',
>> you will only get the break error handling and nothing else.
> 
> Yes and ... ? What else do you want it to do ?

The user can pass any value for the errors argument in the
StreamReader constructor. The StreamReader should always honor
this error handling strategy. Example.

import codecs, cStringIO

count = 0
def countandreplace(exc):
    global count
    if not isinstance(exc, UnicodeDecodeError):
       raise TypeError("can handle error")
    count += 1
    return (u"\ufffd", exc.end)

codecs.register_error("countandreplace", countandreplace)

s = cStringIO.StringIO("\xc3foo\xffbar\xc3")

us = codecs.getreader("utf-8")(s)

The first \xc3 and the \xff are real errors, the trailing
\xc3 might be a transient one. To handle this with the break
handler strategy the StreamReader would have to call the
decode() method with errors="break" instead of errors="countandreplace".
break would then have to decide whether it's a transient
error or a real one (presumably from some info in the exception).
If it's a real one it would have to call the original error
handler, but it doesn't have any way of knowing what the
original error handler was. If it's a transient error, it
would have to comunicate this fact to the caller, which could
be done by changing an attribute in the exception object.
But the decoding function still has to put the retained
bytes into the StreamReader so that part doesn't get any
simpler. Alltogether I find this method rather convoluted,
especially since we have most of the machinery in place.
What is missing is the implementation of real stateful
decoding functions.

>>>>> We may then need to allow additional keyword arguments on the
>>>>> encode/decode functions in order to preset a start state.
>>>>
>>>> As those decoding functions are private to the decoder that's
>>>> probably OK. I wouldn't want to see additional keyword arguments
>>>> on str.decode (which uses the stateless API anyway). BTW, that's
>>>> exactly what I did for codecs.utf_7_decode_stateful, but I'm not
>>>> really comfortable with the internal state of the UTF-7 decoder
>>>> being exposed on the Python level. It would be better to encapsulate
>>>> the state in a feed reader implemented in C, so that the state is
>>>> inaccessible from the Python level.
>>>
>>> See above: possibility 1 would be the way to go then.
>>
>> I might give this a try.
> 
> Again, please wait until we have found a good solution
> to this.

OK.

Bye,
    Walter Dörwald