[Python-Dev] Decoding incomplete unicode

Wed Jul 28 12:07:43 CEST 2004

M.-A. Lemburg wrote:

> Walter Dörwald wrote:
> 
>> Martin v. Löwis wrote:
>>
>>> M.-A. Lemburg wrote:
>>>
>>>> I like the idea, but don't think the implementation is
>>>> the right way to do it. Instead, I'd suggest using a new
>>>> error handling strategy "break" ( = break processing as
>>>> soon as errors are found).
>>>
>>> Can you demonstrate this approach in a patch? I think it
>>> is unimplementable: the codec cannot communicate to the
>>> error callback that it ran out of data.
>>
>> We would need a special attribute in the exception for
>> that, but the problem IMHO is a different one. This makes
>> it impossible to use other error handling schemes than
>> "break" in stateful decoders.
> 
> I don't understand... are you referring to some extra
> attribute for storing arbitrary state ?

The position of the error is not sufficient to determine
whether it is a truncated data error or a real one:
both r"a\xf".decode("unicode-escape") and
r"a\xfx".decode("unicode-escape") raise a UnicodeDecodeException
with exc.end == len(exc.object), i.e. the error is at
the end of the input. But in the first case the error will
go away once more data is available, but in the second case
it won't.

> If so, why would
> adding such an attribute make it impossible to use
> other error handling schemes ?

It doesn't, but it would make it possible for the callback
to distinguish transient errors from real ones.

> The problem with your patch is that you are adding a whole
> new set of decoders to the core which duplicate much of what
> the already existing decoders implement. I don't like that
> duplication and would like to find a way to only have *one*
> implementation per decode operation.

I don't like the duplication either. In fact we might need
decoders that pass state, but do complain about truncated data
at the end of the stream. I think it's possible to find other
solutions. I would prefer stateful decoders implemented in C.

But I hope you agree that this is a problem that should be fixed.

> Of course, encoders
> would have to provide the same interfaces for symmetry
> reasons.

There are no encoders that have to keep state, except for
UTF-16.

Bye,
    Walter Dörwald