[Python-Dev] Decoding incomplete unicode

Thu Aug 19 19:34:33 CEST 2004

M.-A. Lemburg wrote:

> Walter Dörwald wrote:
> 
>> Martin v. Löwis wrote:
>>
>> [...]
>> We already have an efficient way to communicate incompleteness:
>> the decode method returns the number of decoded bytes.
>>
>> The questions remaining are
>>
>> 1) communicate to whom? IMHO the info should only be used
>>    internally by the StreamReader.
> 
> Handling incompleteness should be something for the codec
> to deal with.

Absolutely. This means that decode() should not be called
by the user. (But the implementation of read() (and feed(),
if we have it) calls it.)

> The queue doesn't have to know about it in an
> way. However, the queue should have interfaces allowing the
> codec to tell whether there are more bytes waiting to be
> processed.

This won't work when the byte stream wrapped by the
StreamReader is not a queue. (Or do you want the wrap the
byte stream in a queue? This would be three wrapping layers.)

And the information is not really useful, because it might
change (e.g. when the user puts additional data into the
queue/stream.)

>> 2) When is incompleteness OK? Incompleteness is of course
>>    not OK in the stateless API. For the stateful API,
>>    incompleteness has to be OK even when the input stream
>>    is (temporarily) exhausted, because otherwise a feed mode
>>    wouldn't work anyway. But then incompleteness is always OK,
>>    because the StreamReader can't distinguish a temporarily
>>    exhausted input stream from a permanently exhausted one.
>>    The only fix for this I can think of is the final argument.
> 
> A final argument may be the way to go. But it should be an
> argument for the .read() method (not only the .decode() method)
> since that's the method reading the data from the queue.

Yes. E.g. the low level charmap decode function doesn't
need the final argument, because there is zero state to
be kept between calls.

> I'd suggest that we extend the existing encode and decode
> codec APIs to take an extra state argument that holds the
> codec state in whatever format the codec needs (e.g. this
> could be a tuple or a special object):
> 
> encode(data, errors='strict', state=None)
> decode(data, errors='strict', state=None)

We don't need a specification for that. The stateless
API doesn't need an explicit state (the state is just
a bunch of variables at the C level) and for the
stateful API the state can be put into StreamReader
attributes. How this state looks is totally up to
the StreamReader itself (see the UTF-7 reader in the
patch for an example). If the stream reader passes
on this state to a low level decoding function
implemented in C, how this state info looks is again
totally up to the codec.

So I think we don't have to specify anything in this
area.

> In the case of the .read() method, decode() would be
> called. If the returned length_consumed does not match
> the length of the data input, the remaining items would
> have to be placed back onto the queue in non-final mode.
> In final mode an exception would be raised to signal
> the problem.

Yes, in non-final mode the bytes would have to be retained
and in final mode an exception is raised (except when
the error handling callback does something else). But I don't
think we should put a queue between the byte stream and
the StreamReader (at least not in the sense of a queue as
another file like object). The remaining items can be kept
in an attribute of the StreamReader instance, that's what
---
data = self.bytebuffer + newdata
object, decodedbytes = self.decode(data, self.errors)
self.bytebuffer = data[decodedbytes:]
---
does in the patch.

The first line combines the items retained from the
last call with those read from the stream (or passed
to the feed method).

The second line does semi-stateful decoding of those
bytes.

The third line puts the new remaining items back into
the buffer.

The decoding is "semi-stateful", because the info about
the remaining bytes is not stored by decode itself, but
by the caller of decode. feed() is the method that does
fully stateful decoding of byte chunks.

> I think it's PEP time for this new extension. If time
> permits I'll craft an initial version over the weekend.

I'm looking forward to the results.

Bye,
    Walter Dörwald