Re: [Python-Dev] Decoding incomplete unicode

July 28, 2004

      M.-A. Lemburg wrote:
...
Walter Dörwald wrote:
...
This is the correct thing to do for the stateless decoders:
any incomplete byte sequence at the end of the input is an
error. But then it doesn't make sense to return the number
of bytes decoded for the stateless decoder, because this is
always the size of the input.
The reason why stateless encode and decode APIs return the
number of input items consumed is to accomodate for error
handling situations like these where you want to stop
coding and leave the remaining work to another step.
Which in most cases is the read method.
...
The C implementation currently doesn't make use of this
feature.
...
For the stateful decoder this
is just some sort of state common to all decoders: the decoder
keeps the incomplete byte sequence to be used in the next call.
But then this state should be internal to the decoder and not
part of the public API. This would make the decode() method
more usable: When you want to implement an XML parser that
supports the xml.sax.xmlreader.IncrementalParser interface,
you have an API mismatch. The parser has to use the stateful
decoding API (i.e. read()), which means the input is in the
form of a byte stream, but this interface expects it's input
as byte chunks passed to multiple calls to the feed() method.
If StreamReader.decode() simply returned the decoded unicode
object and keep the remaining undecoded bytes as an internal
state then StreamReader.decode() would be directly usable.
Please don't mix "StreamReader" with "decoder". The codecs
module returns 4 different objects if you ask it for
a codec set: two stateless APIs for encoding and decoding
and two factory functions for creating possibly stateful
objects which expose a stream interface.
Your "stateful decoder" is actually part of a StreamReader
implementation and doesn't have anything to do with the
stateless decoder.
I know. I'd just like to have a stateful decoder that
doesn't use a stream interface. The stream interface
could be built on top of that without any knowlegde
of the encoding.

I wonder whether the decode method is part of the public
API for StreamReader.
...
I see two possibilities here:
1. you write a custom StreamReader/Writer implementation
   for each of the codecs which takes care of keeping
   state and encoding/decoding as much as possible
But I'd like to reuse at least some of the functionality
from PyUnicode_DecodeUTF8() etc.

Would a version of PyUnicode_DecodeUTF8() with an additional
PyUTF_DecoderState * be OK?
...
2. you extend the existing stateless codec implementations
   to allow communicating state on input and output; the
   stateless operation would then be a special case
...
But this isn't really a "StreamReader" any more, so the best
solution would probably be to have a three level API:
* A stateless decoding function (what codecs.getdecoder
  returns now);
* A stateful "feed reader", which keeps internal state
  (including undecoded byte sequences) and gets passed byte
  chunks (should this feed reader have a error attribute or
  should this be an argument to the feed method?);
* A stateful stream reader that reads its input from a
  byte stream. The functionality for the stream reader could
  be implemented once using the underlying functionality of
  the feed reader (in fact we could implement something similar
  to sio's stacking streams: the stream reader would use
  the feed reader to wrap the byte input stream and add
  only a read() method. The line reading methods (readline(),
  readlines() and __iter__() could be added by another stream
  filter)
Why make things more complicated ?
If you absolutely need a feed interface, you can feed
your data to a StringIO instance which is then read from
by StreamReader.
This doesn't work, because a StringIO has only one file position:
...
...
...
import cStringIO
s = cStringIO.StringIO()
s.write("x")
s.read()
''
But something like the Queue class from the tests in the patch
might work.
...
...
...
The error callbacks could, however, raise an exception which
includes all the needed information, including any state that
may be needed in order to continue with coding operation.
This makes error callbacks effectively unusable with stateful
decoders.
Could you explain ?
If you have to call the decode function with errors='break',
you will only get the break error handling and nothing else.
...
...
...
We may then need to allow additional keyword arguments on the
encode/decode functions in order to preset a start state.
As those decoding functions are private to the decoder that's
probably OK. I wouldn't want to see additional keyword arguments
on str.decode (which uses the stateless API anyway). BTW, that's
exactly what I did for codecs.utf_7_decode_stateful, but I'm not
really comfortable with the internal state of the UTF-7 decoder
being exposed on the Python level. It would be better to encapsulate
the state in a feed reader implemented in C, so that the state is
inaccessible from the Python level.
See above: possibility 1 would be the way to go then.
I might give this a try.

Bye,
    Walter Dörwald