[Python-Dev] Decoding incomplete unicode

Tue Aug 24 22:15:01 CEST 2004

M.-A. Lemburg wrote:

> Martin v. Löwis wrote:
> 
>> Walter Dörwald wrote:
>>
>>> OK, let's come up with a patch that fixes the incomplete byte
>>> sequences problem and then discuss non-stream APIs.
>>>
>>> So, what should the next step be?
>>
>> I think your first patch should be taken as a basis for that.
> 
> We do need a way to communicate state between the codec
> and Python.
> 
> However, I don't like the way that the patch
> implements this state handling: I think we should use a
> generic "state" object here which is passed to the stateful
> codec and returned together with the standard return values
> on output:
> 
> def decode_stateful(data, state=None):
>     ... decode and modify state ...
>     return (decoded_data, length_consumed, state)

Another option might be that the decode function changes
the state object in place.

> where the object type and contents of the state variable
> is defined per codec (e.g. could be a tuple, just a single
> integer or some other special object).

If a tuple is passed and returned this makes it possible
from Python code to mangle the state. IMHO this should be
avoided if possible.

> Otherwise we'll end up having different interface
> signatures for all codecs and extending them to accomodate
> for future enhancement will become unfeasable without
> introducing yet another set of APIs.

We already have slightly different decoding functions:
utf_16_ex_decode() takes additional arguments.

> Let's discuss this some more and implement it for Python 2.5.
> For Python 2.4, I think we can get away with what we already
> have:
 > [...]

OK, I've updated the patch.

> [...]
> The buffer logic should only be used for streams
> that do not support the interface to push back already
> read bytes (e.g. .unread()).
> 
>  From a design perspective, keeping read data inside the
> codec is the wrong thing to do, simply because it leaves
> the input stream in an undefined state in case of an error
> and there's no way to correlate the stream's read position
> to the location of the error.
> 
> With a pushback method on the stream, all the stream
> data will be stored on the stream, not the codec, so
> the above would no longer be a problem.

On the other hand this requires special stream. Data
already read is part of the codec state, so why not
put it into the codec?

> However, we can always add the .unread() support to the
> stream codecs at a later stage, so it's probably ok
> to default to the buffer logic for Python 2.4.

OK.

>> That still leaves the issue
>> of the last read operation, which I'm tempted to leave unresolved
>> for Python 2.4. No matter what the solution is, it would likely
>> require changes to all codecs, which is not good.
> 
> We could have a method on the codec which checks whether
> the codec buffer or the stream still has pending data
> left. Using this method is an application scope consideration,
> not a codec issue.

But this mean that the normal error handling can't be used
for those trailing bytes.

Bye,
    Walter Dörwald