[Python-Dev] Decoding incomplete unicode

M.-A. Lemburg mal at egenix.com
Thu Aug 19 22:09:09 CEST 2004


Walter Dörwald wrote:
> M.-A. Lemburg wrote:
> 
>> Walter Dörwald wrote:
>>
>>> Let's compare example uses:
>>>
>>> 1) Having feed() as part of the StreamReader API:
>>> ---
>>> s = u"???".encode("utf-8")
>>> r = codecs.getreader("utf-8")()
>>> for c in s:
>>>    print r.feed(c)
>>> ---
>>
>>
>> I consider adding a .feed() method to the stream codec
>> bad design. .feed() is something you do on a stream, not
>> a codec.
> 
> 
> I don't care about the name, we can call it
> stateful_decode_byte_chunk() or whatever. (In fact I'd
> prefer to call it decode(), but that name is already
> taken by another method. Of course we could always
> rename decode() to _internal_decode() like Martin
> suggested.)

It's not that name that doesn't fit, it's the fact
that you are mixing a stream action into a codec which
I'd rather see well separated.

>>> 2) Explicitely using a queue object:
>>> ---
>>> from whatever import StreamQueue
>>>
>>> s = u"???".encode("utf-8")
>>> q = StreamQueue()
>>> r = codecs.getreader("utf-8")(q)
>>> for c in s:
>>>    q.write(c)
>>>    print r.read()
>>> ---
>>
>>
>> This is probably how an advanced codec writer would use the APIs
>> to build new stream interfaces.
> 
>  >
> 
>>> 3) Using a special wrapper that implicitely creates a queue:
>>> ----
>>> from whatever import StreamQueueWrapper
>>> s = u"???".encode("utf-8")
>>> r = StreamQueueWrapper(codecs.getreader("utf-8"))
>>> for c in s:
>>>    print r.feed(c)
>>> ----
>>
>>
>>
>> This could be turned into something more straight forward,
>> e.g.
>>
>> from codecs import EncodedStream
>>
>> # Load data
>> s = u"???".encode("utf-8")
>>
>> # Write to encoded stream (one byte at a time) and print
>> # the read output
>> q = EncodedStream(input_encoding="utf-8", output_encoding="unicode")
> 
> 
> This is confusing, because there is no encoding named "unicode".
> This should probably read:
> 
> q = EncodedQueue(encoding="utf-8", errors="strict")

Fine.

I was thinking of something similar to EncodedFile()
which also has two separate encodings, one for the file side
of things and one for the Python side.

>> for c in s:
>>    q.write(c)
>>    print q.read()
>>
>> # Make sure we have processed all data:
>> if q.has_pending_data():
>>    raise ValueError, 'data truncated'
> 
> 
> This should be the job of the error callback, the last part should
> probably be:
> 
> for c in s:
>    q.write(c)
>    print q.read()
> print q.read(final=True)

Ok; both methods have their use cases. (You seem to be obsessed
with this final argument ;-)

>>> I very much prefer option 1).
>>
>>
>> I prefer the above example because it's easy to read and
>> makes things explicit.
>>
>>> "If the implementation is hard to explain, it's a bad idea."
>>
>>
>> The user usually doesn't care about the implementation, only it's
>> interfaces.
> 
> 
> Bye,
>    Walter Dörwald
> 
> 
> _______________________________________________
> Python-Dev mailing list
> Python-Dev at python.org
> http://mail.python.org/mailman/listinfo/python-dev
> Unsubscribe: 
> http://mail.python.org/mailman/options/python-dev/mal%40egenix.com

-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, Aug 19 2004)
 >>> Python/Zope Consulting and Support ...        http://www.egenix.com/
 >>> mxODBC.Zope.Database.Adapter ...             http://zope.egenix.com/
 >>> mxODBC, mxDateTime, mxTextTools ...        http://python.egenix.com/
________________________________________________________________________

::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,FreeBSD for free ! ::::


More information about the Python-Dev mailing list