[Python-Dev] Decoding incomplete unicode
walter at livinglogic.de
Thu Aug 26 22:10:40 CEST 2004
M.-A. Lemburg wrote:
>>> def decode_stateful(data, state=None):
>>> ... decode and modify state ...
>>> return (decoded_data, length_consumed, state)
>> Another option might be that the decode function changes
>> the state object in place.
> Good idea.
But that's totally up to the implementor.
>>> where the object type and contents of the state variable
>>> is defined per codec (e.g. could be a tuple, just a single
>>> integer or some other special object).
>> If a tuple is passed and returned this makes it possible
>> from Python code to mangle the state. IMHO this should be
>> avoided if possible.
>>> Otherwise we'll end up having different interface
>>> signatures for all codecs and extending them to accomodate
>>> for future enhancement will become unfeasable without
>>> introducing yet another set of APIs.
>> We already have slightly different decoding functions:
>> utf_16_ex_decode() takes additional arguments.
> Right - it was a step in the wrong direction. Let's not
> use a different path for the future.
utf_16_ex_decode() serves a purpose: it help implement
the UTF16 decoder, which has to switch to UTF-16-BE or
UTF-16-LE according to the BOM, so utf_16_ex_decode()
needs a way to comunicate that back to the caller.
>>> Let's discuss this some more and implement it for Python 2.5.
>>> For Python 2.4, I think we can get away with what we already
>> > [...]
>> OK, I've updated the patch.
>>> The buffer logic should only be used for streams
>>> that do not support the interface to push back already
>>> read bytes (e.g. .unread()).
>>> From a design perspective, keeping read data inside the
>>> codec is the wrong thing to do, simply because it leaves
>>> the input stream in an undefined state in case of an error
>>> and there's no way to correlate the stream's read position
>>> to the location of the error.
>>> With a pushback method on the stream, all the stream
>>> data will be stored on the stream, not the codec, so
>>> the above would no longer be a problem.
>> On the other hand this requires special stream. Data
>> already read is part of the codec state, so why not
>> put it into the codec?
> Ideally, the codec should not store data,
I consider the remaining undecoded bytes to be part of
the codec state once the have been read from the stream.
> but only
> reference it. It's better to keep things well
> separated which is why I think we need the .unread()
> interface and eventually a queue interface to support
> the feeding operation.
>>> However, we can always add the .unread() support to the
>>> stream codecs at a later stage, so it's probably ok
>>> to default to the buffer logic for Python 2.4.
>>>> That still leaves the issue
>>>> of the last read operation, which I'm tempted to leave unresolved
>>>> for Python 2.4. No matter what the solution is, it would likely
>>>> require changes to all codecs, which is not good.
>>> We could have a method on the codec which checks whether
>>> the codec buffer or the stream still has pending data
>>> left. Using this method is an application scope consideration,
>>> not a codec issue.
>> But this mean that the normal error handling can't be used
>> for those trailing bytes.
> Right, but then: missing data (which usually causes the trailing
> bytes) is really something for the application to deal with,
> e.g. by requesting more data from the user, another application
> or trying to work around the problem in some way. I don't think
> that the codec error handler can practically cover these
But in many cases the user might want to use "ignore" or "replace"
More information about the Python-Dev