[Python-Dev] Decoding incomplete unicode

Thu Aug 12 15:15:37 CEST 2004

Hi Walter,

I don't have time to comment on this this week, I'll respond
next week.

Overall, I don't like the idea of adding extra
APIs breaking the existing codec API. I believe that we can
extend stream codecs to also work in a feed mode without
breaking the API.

Walter Dörwald wrote:
> OK, here a my current thoughts on the codec problem:
> 
> The optimal solution (ignoring backwards compatibility)
> would look like this: codecs.lookup() would return the
> following stuff (this could be done by replacing the
> 4 entry tuple with a real object):
> 
> * decode: The stateless decoding function
> * encode: The stateless encocing function
> * chunkdecoder: The stateful chunk decoder
> * chunkencoder: The stateful chunk encoder
> * streamreader: The stateful stream decoder
> * streamwriter: The stateful stream encoder
> 
> The functions and classes look like this:
> 
> 
> Stateless decoder:
> decode(input, errors='strict'):
>     Function that decodes the (str) input object and returns
>     a (unicode) output object. The decoder must decode the
>     complete input without any remaining undecoded bytes.
> 
> Stateless encoder:
> encode(input, errors='strict'):
>     Function that encodes the complete (unicode) input object and
>     returns a (str) output object.
> 
> Stateful chunk decoder:
> chunkdecoder(errors='strict'):
>     A factory function that returns a stateful decoder with the
>     following method:
> 
>     decode(input, final=False):
>         Decodes a chunk of input and return the decoded unicode
>         object. This method can be called multiple times and
>         the state of the decoder will be kept between calls.
>         This includes trailing incomplete byte sequences
>         that will be retained until the next call to decode().
>         When the argument final is true, this is the last call
>         to decode() and trailing incomplete byte sequences will
>         not be retained, but a UnicodeError will be raised.
> 
> Stateful chunk encoder:
> chunkencoder(errors='strict'):
>     A factory function that returns a stateful encoder
>     with the following method:
>     encode(input, final=False):
>         Encodes a chunk of input and returns the encoded
>         str object. When final is true this is the last
>         call to encode().
> 
> Stateful stream decoder:
> streamreader(stream, errors='strict'):
>     A factory function that returns a stateful decoder
>     for reading from the byte stream stream, with the
>     following methods:
> 
>     read(size=-1, chars=-1, final=False):
>         Read unicode characters from the stream. When data
>         is read from the stream it should be done in chunks of
>         size bytes. If size == -1 all the remaining data
>         from the stream is read. chars specifies the number
>         of characters to read from the stream. read() may return
>         less then chars characters if there's not enough data
>         available in the byte stream. If chars == -1 as much
>         characters are read as are available in the stream.
>         Transient errors are ignored and trailing incomplete
>         byte sequence are retained when final is false. Otherwise
>         a UnicodeError is raised in the case of incomplete byte
>         sequences.
>     readline(size=-1):
>             ...
>     next():
>             ...
>     __iter__():
>             ...
> 
> Stateful stream encoder:
> streamwriter(stream, errors='strict'):
>     A factory function that returns a stateful encoder
>     for writing unicode data to the byte stream stream,
>     with the following methods:
> 
>     write(data, final=False):
>         Encodes the unicode object data and writes it
>         to the stream. If final is true this is the last
>         call to write().
>     writelines(data):
>         ...
> 
> 
> I know that this is quite a departure from the current API, and
> I'm not sure if we can get all of the functionality without
> sacrificing backwards compatibility.
> 
> I don't particularly care about the "bytes consumed" return value
> from the stateless codec. The codec should always have returned only
> the encoded/decoded object, but I guess fixing this would break too
> much code. And users who are only interested in the stateless
> functionality will probably use unicode.encode/str.decode anyway.
> 
> For the stateful API it would be possible to combine the chunk and
> stream decoder/encode into one class with the following methods
> (for the decoder):
> 
>     __init__(stream, errors='strict'):
>         Like the current StreamReader constructor, but stream may be
>         None, if only the chunk API is used.
>     decode(input, final=False):
>         Like the current StreamReader (i.e. it returns a (unicode, int)
>         tuple.) This does not keep the remaining bytes in a buffer.
>         This is the job of the caller.
>     feed(input, final=False):
>         Decodes input and returns a decoded unicode object. This method
>         calls decode() internally and manages the byte buffer.
>     read(size=-1, chars=-1, final=False):
>     readline(size=-1):
>     next():
>     __iter__():
>         See above.
> 
> As before implementers of decoders only need to implement decode().
> 
> To be able to support the final argument the decoding functions
> in _codecsmodule.c could get an additional argument. With this
> they could be used for the stateless codecs too and we can reduce
> the number of functions again.
> 
> Unfortunately adding the final argument breaks all of the current
> codecs, but dropping the final argument requires one of two
> changes:
> 1) When the input stream is exhausted, the bytes read are parsed
>    as if final=True. That's the way the CJK codecs currently
>    handle it, but unfortunately this doesn't work with the feed
>    decoder.
> 2) Simply ignore any remaing undecoded bytes at the end of the
>    stream.
> 
> If we really have to drop the final argument, I'd prefer 2).
> 
> I've uploaded a second version of the patch. It implements
> the final argument, adds the feed() method to StreamReader and
> again merges the duplicate decoding functions in the codecs
> module. Note that the patch isn't really finished (the final
> argument isn't completely supported in the encoders and the
> CJK and escape codecs are unchanged), but it should be sufficient
> as a base for discussion.
> 
> Bye,
>    Walter Dörwald
> 
> 
> _______________________________________________
> Python-Dev mailing list
> Python-Dev at python.org
> http://mail.python.org/mailman/listinfo/python-dev
> Unsubscribe: 
> http://mail.python.org/mailman/options/python-dev/mal%40egenix.com

-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, Aug 12 2004)
 >>> Python/Zope Consulting and Support ...        http://www.egenix.com/
 >>> mxODBC.Zope.Database.Adapter ...             http://zope.egenix.com/
 >>> mxODBC, mxDateTime, mxTextTools ...        http://python.egenix.com/
________________________________________________________________________

::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,FreeBSD for free ! ::::