[Python-Dev] Decoding incomplete unicode
Walter Dörwald
walter at livinglogic.de
Thu Aug 19 17:45:26 CEST 2004
Martin v. Löwis wrote:
> Walter Dörwald wrote:
>
>> They will not, because StreamReader.decode() already is a feed
>> style API (but with state amnesia).
>>
>> Any stream decoder that I can think of can be (and most are)
>> implemented by overwriting decode().
>
> I consider that an unfortunate implementation artefact. You
> either use the stateless encode/decode that you get from
> codecs.get(encoder/decoder) or you use the file API on
> the streams. You never ever use encode/decode on streams.
That is exactly the problem with the current API.
StreamReader mixes two concepts:
1) The stateful API, which allows decoding a byte input
in chunk and the state of the decoder is kept between
calls.
2) A file API where the chunks to be decoded are read
from a byte stream.
> I would have preferred if the default .write implementation
> would have called self._internal_encode, and the Writer
> would *contain* a Codec, rather than inheriting from Codec.
This would separate the two concepts from above.
> Alas, for (I guess) simplicity, a more direct (and more
> confusing) approach was taken.
>
>> 1) Having feed() as part of the StreamReader API:
>> ---
>> s = u"???".encode("utf-8")
>> r = codecs.getreader("utf-8")()
>> for c in s:
>> print r.feed(c)
>
>
> Isn't that a totally unrelated issue? Aren't we talking about
> short reads on sockets etc?
We're talking about two problems:
1) The current implementation does not really support the
stateful API, because trailing incomplete byte sequences
lead to errors.
2) The current file API is not really convenient for decoding
when the input is not read for a stream.
> I would very much prefer to solve one problem at a time.
Bye,
Walter Dörwald
More information about the Python-Dev
mailing list