[Python-Dev] Decoding incomplete unicode

Wed Jul 28 20:31:04 CEST 2004

Walter Dörwald wrote:
> M.-A. Lemburg wrote:
> 
>> Walter Dörwald wrote:
>>
>>> This is the correct thing to do for the stateless decoders:
>>> any incomplete byte sequence at the end of the input is an
>>> error. But then it doesn't make sense to return the number
>>> of bytes decoded for the stateless decoder, because this is
>>> always the size of the input. 
>>
>>
>> The reason why stateless encode and decode APIs return the
>> number of input items consumed is to accomodate for error
>> handling situations like these where you want to stop
>> coding and leave the remaining work to another step.
> 
> Which in most cases is the read method.

The read method only happens to use the stateless
encode and decode methods. There nothing in the design
spec that mandates this, though.

>> The C implementation currently doesn't make use of this
>> feature.
>>
>>> For the stateful decoder this
>>> is just some sort of state common to all decoders: the decoder
>>> keeps the incomplete byte sequence to be used in the next call.
>>> But then this state should be internal to the decoder and not
>>> part of the public API. This would make the decode() method
>>> more usable: When you want to implement an XML parser that
>>> supports the xml.sax.xmlreader.IncrementalParser interface,
>>> you have an API mismatch. The parser has to use the stateful
>>> decoding API (i.e. read()), which means the input is in the
>>> form of a byte stream, but this interface expects it's input
>>> as byte chunks passed to multiple calls to the feed() method.
>>> If StreamReader.decode() simply returned the decoded unicode
>>> object and keep the remaining undecoded bytes as an internal
>>> state then StreamReader.decode() would be directly usable.
>>
>>
>>
>> Please don't mix "StreamReader" with "decoder". The codecs
>> module returns 4 different objects if you ask it for
>> a codec set: two stateless APIs for encoding and decoding
>> and two factory functions for creating possibly stateful
>> objects which expose a stream interface.
>>
>> Your "stateful decoder" is actually part of a StreamReader
>> implementation and doesn't have anything to do with the
>> stateless decoder.
> 
> I know. I'd just like to have a stateful decoder that
> doesn't use a stream interface. The stream interface
> could be built on top of that without any knowlegde
> of the encoding.
> 
> I wonder whether the decode method is part of the public
> API for StreamReader.

It is: StreamReader/Writer are "sub-classes" of the Codec
class.

However, there's nothing stating that .read() or .write()
*must* use these methods to do their work and that's
intentional.

>> I see two possibilities here:
>>
>> 1. you write a custom StreamReader/Writer implementation
>>    for each of the codecs which takes care of keeping
>>    state and encoding/decoding as much as possible
> 
> 
> But I'd like to reuse at least some of the functionality
> from PyUnicode_DecodeUTF8() etc.
> 
> Would a version of PyUnicode_DecodeUTF8() with an additional
> PyUTF_DecoderState * be OK?

Before you start putting more work into this, let's first
find a good workable approach.

>> 2. you extend the existing stateless codec implementations
>>    to allow communicating state on input and output; the
>>    stateless operation would then be a special case
>>
>>> But this isn't really a "StreamReader" any more, so the best
>>> solution would probably be to have a three level API:
>>> * A stateless decoding function (what codecs.getdecoder
>>>   returns now);
>>> * A stateful "feed reader", which keeps internal state
>>>   (including undecoded byte sequences) and gets passed byte
>>>   chunks (should this feed reader have a error attribute or
>>>   should this be an argument to the feed method?);
>>> * A stateful stream reader that reads its input from a
>>>   byte stream. The functionality for the stream reader could
>>>   be implemented once using the underlying functionality of
>>>   the feed reader (in fact we could implement something similar
>>>   to sio's stacking streams: the stream reader would use
>>>   the feed reader to wrap the byte input stream and add
>>>   only a read() method. The line reading methods (readline(),
>>>   readlines() and __iter__() could be added by another stream
>>>   filter)
>>
>>
>> Why make things more complicated ?
>>
>> If you absolutely need a feed interface, you can feed
>> your data to a StringIO instance which is then read from
>> by StreamReader.
> 
> 
> This doesn't work, because a StringIO has only one file position:
>  >>> import cStringIO
>  >>> s = cStringIO.StringIO()
>  >>> s.write("x")
>  >>> s.read()
> ''

Ah, you wanted to do both feeding and reading at the same
time ?!

> But something like the Queue class from the tests in the patch
> might work.

Right... I don't think that we need a third approach to
codecs just to implement feed based parsers.

>>>> The error callbacks could, however, raise an exception which
>>>> includes all the needed information, including any state that
>>>> may be needed in order to continue with coding operation.
>>>
>>>
>>> This makes error callbacks effectively unusable with stateful
>>> decoders.
>>
>>
>> Could you explain ?
> 
> 
> If you have to call the decode function with errors='break',
> you will only get the break error handling and nothing else.

Yes and ... ? What else do you want it to do ?

>>>> We may then need to allow additional keyword arguments on the
>>>> encode/decode functions in order to preset a start state.
>>>
>>>
>>> As those decoding functions are private to the decoder that's
>>> probably OK. I wouldn't want to see additional keyword arguments
>>> on str.decode (which uses the stateless API anyway). BTW, that's
>>> exactly what I did for codecs.utf_7_decode_stateful, but I'm not
>>> really comfortable with the internal state of the UTF-7 decoder
>>> being exposed on the Python level. It would be better to encapsulate
>>> the state in a feed reader implemented in C, so that the state is
>>> inaccessible from the Python level.
>>
>>
>> See above: possibility 1 would be the way to go then.
> 
> I might give this a try.

Again, please wait until we have found a good solution
to this.

-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, Jul 28 2004)
 >>> Python/Zope Consulting and Support ...        http://www.egenix.com/
 >>> mxODBC.Zope.Database.Adapter ...             http://zope.egenix.com/
 >>> mxODBC, mxDateTime, mxTextTools ...        http://python.egenix.com/
________________________________________________________________________

::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,FreeBSD for free ! ::::