[Python-Dev] Decoding incomplete unicode
Walter Dörwald
walter at livinglogic.de
Tue Aug 24 22:15:01 CEST 2004
M.-A. Lemburg wrote:
> Martin v. Löwis wrote:
>
>> Walter Dörwald wrote:
>>
>>> OK, let's come up with a patch that fixes the incomplete byte
>>> sequences problem and then discuss non-stream APIs.
>>>
>>> So, what should the next step be?
>>
>> I think your first patch should be taken as a basis for that.
>
> We do need a way to communicate state between the codec
> and Python.
>
> However, I don't like the way that the patch
> implements this state handling: I think we should use a
> generic "state" object here which is passed to the stateful
> codec and returned together with the standard return values
> on output:
>
> def decode_stateful(data, state=None):
> ... decode and modify state ...
> return (decoded_data, length_consumed, state)
Another option might be that the decode function changes
the state object in place.
> where the object type and contents of the state variable
> is defined per codec (e.g. could be a tuple, just a single
> integer or some other special object).
If a tuple is passed and returned this makes it possible
from Python code to mangle the state. IMHO this should be
avoided if possible.
> Otherwise we'll end up having different interface
> signatures for all codecs and extending them to accomodate
> for future enhancement will become unfeasable without
> introducing yet another set of APIs.
We already have slightly different decoding functions:
utf_16_ex_decode() takes additional arguments.
> Let's discuss this some more and implement it for Python 2.5.
> For Python 2.4, I think we can get away with what we already
> have:
> [...]
OK, I've updated the patch.
> [...]
> The buffer logic should only be used for streams
> that do not support the interface to push back already
> read bytes (e.g. .unread()).
>
> From a design perspective, keeping read data inside the
> codec is the wrong thing to do, simply because it leaves
> the input stream in an undefined state in case of an error
> and there's no way to correlate the stream's read position
> to the location of the error.
>
> With a pushback method on the stream, all the stream
> data will be stored on the stream, not the codec, so
> the above would no longer be a problem.
On the other hand this requires special stream. Data
already read is part of the codec state, so why not
put it into the codec?
> However, we can always add the .unread() support to the
> stream codecs at a later stage, so it's probably ok
> to default to the buffer logic for Python 2.4.
OK.
>> That still leaves the issue
>> of the last read operation, which I'm tempted to leave unresolved
>> for Python 2.4. No matter what the solution is, it would likely
>> require changes to all codecs, which is not good.
>
> We could have a method on the codec which checks whether
> the codec buffer or the stream still has pending data
> left. Using this method is an application scope consideration,
> not a codec issue.
But this mean that the normal error handling can't be used
for those trailing bytes.
Bye,
Walter Dörwald
More information about the Python-Dev
mailing list