[Python-Dev] Decoding incomplete unicode
Walter Dörwald
walter at livinglogic.de
Wed Aug 18 22:35:31 CEST 2004
M.-A. Lemburg wrote:
> Walter Dörwald wrote:
>
>>> I've thought about this some more. Perhaps I'm still missing
>>> something, but wouldn't it be possible to add a feeding
>>> mode to the existing stream codecs by creating a new queue
>>> data type (much like the queue you have in the test cases of
>>> your patch) and using the stream codecs on these ?
>>
>>
>> No, because when the decode method encounters an incomplete
>> chunk (and so return a size that is smaller then size of the
>> input) read() would have to push the remaining bytes back into
>> the queue. This would be code similar in functionality
>> to the feed() method from the patch, with the difference that
>> the buffer lives in the queue not the StreamReader. So
>> we won't gain any code simplification by going this route.
>
> Maybe not code simplification, but the APIs will be well-
> separated.
They will not, because StreamReader.decode() already is a feed
style API (but with state amnesia).
Any stream decoder that I can think of can be (and most are)
implemented by overwriting decode().
> If we require the queue type for feeding mode operation
> we are free to define whatever APIs are needed to communicate
> between the codec and the queue type, e.g. we could define
> a method that pushes a few bytes back onto the queue end
> (much like ungetc() in C).
That would of course be a possibility.
>>> I think such a queue would be generally useful in other
>>> contexts as well, e.g. for implementing fast character based
>>> pipes between threads, non-Unicode feeding parsers, etc.
>>> Using such a type you could potentially add a feeding
>>> mode to stream or file-object API based algorithms very
>>> easily.
>>
>> Yes, so we could put this Queue class into a module with
>> string utilities. Maybe string.py?
>
> Hmm, I think a separate module would be better since we
> could then recode the implementation in C at some point
> (and after the API has settled).
> We'd only need a new name for it, e.g. StreamQueue or
> something.
Sounds reasonable.
>>> We could then have a new class, e.g. FeedReader, which
>>> wraps the above in a nice API, much like StreamReaderWriter
>>> and StreamRecoder.
>>
>> But why should we, when decode() does most of what we need,
>> and the rest has to be implemented in both versions?
>
> To hide the details from the user. It should be possible
> to instantiate one of these StreamQueueReaders (named
> after the queue type) and simply use it in feeding
> mode without having to bother about the details behind
> the implementation.
>
> StreamReaderWriter and StreamRecoder exist for the same
> reason.
Let's compare example uses:
1) Having feed() as part of the StreamReader API:
---
s = u"???".encode("utf-8")
r = codecs.getreader("utf-8")()
for c in s:
print r.feed(c)
---
2) Explicitely using a queue object:
---
from whatever import StreamQueue
s = u"???".encode("utf-8")
q = StreamQueue()
r = codecs.getreader("utf-8")(q)
for c in s:
q.write(c)
print r.read()
---
3) Using a special wrapper that implicitely creates a queue:
----
from whatever import StreamQueueWrapper
s = u"???".encode("utf-8")
r = StreamQueueWrapper(codecs.getreader("utf-8"))
for c in s:
print r.feed(c)
----
I very much prefer option 1).
"If the implementation is hard to explain, it's a bad idea."
Bye,
Walter Dörwald
More information about the Python-Dev
mailing list