[Python-Dev] Decoding incomplete unicode

Walter Dörwald walter at livinglogic.de
Thu Aug 26 22:10:40 CEST 2004


M.-A. Lemburg wrote:

>>> [...]
>>> def decode_stateful(data, state=None):
>>>     ... decode and modify state ...
>>>     return (decoded_data, length_consumed, state)
>>
>> Another option might be that the decode function changes
>> the state object in place.
> 
> Good idea.

But that's totally up to the implementor.

>>> where the object type and contents of the state variable
>>> is defined per codec (e.g. could be a tuple, just a single
>>> integer or some other special object).
>>
>> If a tuple is passed and returned this makes it possible
>> from Python code to mangle the state. IMHO this should be
>> avoided if possible.
> 
>>> Otherwise we'll end up having different interface
>>> signatures for all codecs and extending them to accomodate
>>> for future enhancement will become unfeasable without
>>> introducing yet another set of APIs.
>>
>> We already have slightly different decoding functions:
>> utf_16_ex_decode() takes additional arguments.
> 
> Right - it was a step in the wrong direction. Let's not
> use a different path for the future.

utf_16_ex_decode() serves a purpose: it help implement
the UTF16 decoder, which has to switch to UTF-16-BE or
UTF-16-LE according to the BOM, so utf_16_ex_decode()
needs a way to comunicate that back to the caller.

>>> Let's discuss this some more and implement it for Python 2.5.
>>> For Python 2.4, I think we can get away with what we already
>>> have:
>>
>>  > [...]
>>
>> OK, I've updated the patch.
>>
>>> [...]
>>> The buffer logic should only be used for streams
>>> that do not support the interface to push back already
>>> read bytes (e.g. .unread()).
>>>
>>>  From a design perspective, keeping read data inside the
>>> codec is the wrong thing to do, simply because it leaves
>>> the input stream in an undefined state in case of an error
>>> and there's no way to correlate the stream's read position
>>> to the location of the error.
>>>
>>> With a pushback method on the stream, all the stream
>>> data will be stored on the stream, not the codec, so
>>> the above would no longer be a problem.
>>
>> On the other hand this requires special stream. Data
>> already read is part of the codec state, so why not
>> put it into the codec?
> 
> Ideally, the codec should not store data,

I consider the remaining undecoded bytes to be part of
the codec state once the have been read from the stream.

> but only
> reference it. It's better to keep things well
> separated which is why I think we need the .unread()
> interface and eventually a queue interface to support
> the feeding operation.
> 
>>> However, we can always add the .unread() support to the
>>> stream codecs at a later stage, so it's probably ok
>>> to default to the buffer logic for Python 2.4.
>>
>> OK.
>>
>>>> That still leaves the issue
>>>> of the last read operation, which I'm tempted to leave unresolved
>>>> for Python 2.4. No matter what the solution is, it would likely
>>>> require changes to all codecs, which is not good.
>>>
>>> We could have a method on the codec which checks whether
>>> the codec buffer or the stream still has pending data
>>> left. Using this method is an application scope consideration,
>>> not a codec issue.
>>
>> But this mean that the normal error handling can't be used
>> for those trailing bytes.
> 
> Right, but then: missing data (which usually causes the trailing
> bytes) is really something for the application to deal with,
> e.g. by requesting more data from the user, another application
> or trying to work around the problem in some way. I don't think
> that the codec error handler can practically cover these
> possibilities.

But in many cases the user might want to use "ignore" or "replace"
error handling.

Bye,
    Walter Dörwald




More information about the Python-Dev mailing list