[Python-Dev] Decoding incomplete unicode

Walter Dörwald walter at livinglogic.de
Wed Jul 28 11:38:16 CEST 2004


Hye-Shik Chang wrote:

> On Tue, 27 Jul 2004 22:39:45 +0200, Walter Dörwald
> <walter at livinglogic.de> wrote:
> 
>>Pythons unicode machinery currently has problems when decoding
>>incomplete input.
>>
>>When codecs.StreamReader.read() encounters a decoding error it
>>reads more bytes from the input stream and retries decoding.
>>This is broken for two reasons:
>>1) The error might be due to a malformed byte sequence in the input,
>>    a problem that can't be fixed by reading more bytes.
>>2) There may be no more bytes available at this time. Once more
>>    data is available decoding can't continue because bytes from
>>    the input stream have already been read and thrown away.
>>(sio.DecodingInputFilter has the same problems)
> 
> StreamReaders and -Writers from CJK codecs are not suffering from
> this problems because they have internal buffer for keeping states
> and incomplete bytes of a sequence. In fact, CJK codecs has its
> own implementation for UTF-8 and UTF-16 on base of its multibytecodec
> system.  It provides a "working" StreamReader/Writer already. :)

Seems you had the same problems with the builtin stream readers! ;)

BTW, how do you solve the problem that incomplete byte sequences
are retained in the middle of a stream, but should generate errors
at the end?

>>I've uploaded a patch that fixes these problems to SF:
>>http://www.python.org/sf/998993
>>
>>The patch implements a few additional features:
>>- read() has an additional argument chars that can be used to
>>   specify the number of characters that should be returned.
>>- readline() is supported on all readers derived from
>>   codecs.StreamReader().
> 
> I have no comment for these, yet.
> 
>>- readline() and readlines() have an additional option
>>   for dropping the u"\n".
> 
> +1
> 
> I wonder whether we need to add optional argument for writelines()
> to add newline characters for each lines, then.

This would probably be a nice convenient additional feature,
but of course you could always pass a GE to writelines():
stream.writelines(line+u"\n" for line in lines)

Bye,
    Walter Dörwald




More information about the Python-Dev mailing list