[Python-Dev] Decoding incomplete unicode
Walter Dörwald
walter at livinglogic.de
Wed Jul 28 11:38:16 CEST 2004
Hye-Shik Chang wrote:
> On Tue, 27 Jul 2004 22:39:45 +0200, Walter Dörwald
> <walter at livinglogic.de> wrote:
>
>>Pythons unicode machinery currently has problems when decoding
>>incomplete input.
>>
>>When codecs.StreamReader.read() encounters a decoding error it
>>reads more bytes from the input stream and retries decoding.
>>This is broken for two reasons:
>>1) The error might be due to a malformed byte sequence in the input,
>> a problem that can't be fixed by reading more bytes.
>>2) There may be no more bytes available at this time. Once more
>> data is available decoding can't continue because bytes from
>> the input stream have already been read and thrown away.
>>(sio.DecodingInputFilter has the same problems)
>
> StreamReaders and -Writers from CJK codecs are not suffering from
> this problems because they have internal buffer for keeping states
> and incomplete bytes of a sequence. In fact, CJK codecs has its
> own implementation for UTF-8 and UTF-16 on base of its multibytecodec
> system. It provides a "working" StreamReader/Writer already. :)
Seems you had the same problems with the builtin stream readers! ;)
BTW, how do you solve the problem that incomplete byte sequences
are retained in the middle of a stream, but should generate errors
at the end?
>>I've uploaded a patch that fixes these problems to SF:
>>http://www.python.org/sf/998993
>>
>>The patch implements a few additional features:
>>- read() has an additional argument chars that can be used to
>> specify the number of characters that should be returned.
>>- readline() is supported on all readers derived from
>> codecs.StreamReader().
>
> I have no comment for these, yet.
>
>>- readline() and readlines() have an additional option
>> for dropping the u"\n".
>
> +1
>
> I wonder whether we need to add optional argument for writelines()
> to add newline characters for each lines, then.
This would probably be a nice convenient additional feature,
but of course you could always pass a GE to writelines():
stream.writelines(line+u"\n" for line in lines)
Bye,
Walter Dörwald
More information about the Python-Dev
mailing list