[Python-Dev] Decoding incomplete unicode

Wed Aug 18 22:12:56 CEST 2004

M.-A. Lemburg wrote:

> Martin v. Löwis wrote:
> 
>> M.-A. Lemburg wrote:
>>
>>> I've thought about this some more. Perhaps I'm still missing
>>> something, but wouldn't it be possible to add a feeding
>>> mode to the existing stream codecs by creating a new queue
>>> data type (much like the queue you have in the test cases of
>>> your patch) and using the stream codecs on these ?
>>
>> Here is the problem. In UTF-8, how does the actual algorithm
>> tell (the application) that the bytes it got on decoding provide
>> for three fully decodable characters, and that 2 bytes are left
>> undecoded, and that those bytes are not inherently ill-formed,
>> but lack a third byte to complete the multi-byte sequence?
> 
> This state can be stored in the stream codec instance,
> e.g. by using a special state object that is stored in
> the instance and passed to the encode/decode APIs of the
> codec or by implementing the stream codec itself in C.

That's exactly what my patch does. The state (the bytes
that have already been read from the input stream, but
couldn't be decoded and have to be used on the next
call to read()) are stored in the bytebuffer attribute
of the StreamReader. Most stateful decoder use this
type of state, the only one I can think of that uses more
than this is the UTF-7 decoder, where the decoder decodes
partial +xxxx- sequences, but then has to keep the current
shift state and the partially consumed bits and bytes.
This decoder could be changed, so that it works with only
a byte buffer too, but that would mean that the decoder
doesn't enter incomplete +xxxx- sequences, but retains them
in the byte buffer and only decodes them once the "-" is
encountered.

In fact a trivial implementation of any stateful decoder
could put *everything* it reads into the bytebuffer when
final==False and decode itin one go once read() is called
with final==True.

But IMHO each decoder should decode as much as possible.

> We do need to extend the API between the stream codec
> and the encode/decode functions, no doubt about that.
> However, this is an extension that is well hidden from
> the user of the codec and won't break code.

Exactly: this shouldn't be anything officially documented,
because what kind of data is passed around depends on the
codec. And when the stream reader is implemented in C there
isn't any API anyway.

>> On top of that, you can implement whatever queuing or streaming
>> APIs you want, but you *need* an efficient way to communicate
>> incompleteness.
> 
> Agreed.

Bye,
    Walter Dörwald