[Python-Dev] Decoding incomplete unicode
M.-A. Lemburg
mal at egenix.com
Thu Aug 12 15:15:37 CEST 2004
Hi Walter,
I don't have time to comment on this this week, I'll respond
next week.
Overall, I don't like the idea of adding extra
APIs breaking the existing codec API. I believe that we can
extend stream codecs to also work in a feed mode without
breaking the API.
Walter Dörwald wrote:
> OK, here a my current thoughts on the codec problem:
>
> The optimal solution (ignoring backwards compatibility)
> would look like this: codecs.lookup() would return the
> following stuff (this could be done by replacing the
> 4 entry tuple with a real object):
>
> * decode: The stateless decoding function
> * encode: The stateless encocing function
> * chunkdecoder: The stateful chunk decoder
> * chunkencoder: The stateful chunk encoder
> * streamreader: The stateful stream decoder
> * streamwriter: The stateful stream encoder
>
> The functions and classes look like this:
>
>
> Stateless decoder:
> decode(input, errors='strict'):
> Function that decodes the (str) input object and returns
> a (unicode) output object. The decoder must decode the
> complete input without any remaining undecoded bytes.
>
> Stateless encoder:
> encode(input, errors='strict'):
> Function that encodes the complete (unicode) input object and
> returns a (str) output object.
>
> Stateful chunk decoder:
> chunkdecoder(errors='strict'):
> A factory function that returns a stateful decoder with the
> following method:
>
> decode(input, final=False):
> Decodes a chunk of input and return the decoded unicode
> object. This method can be called multiple times and
> the state of the decoder will be kept between calls.
> This includes trailing incomplete byte sequences
> that will be retained until the next call to decode().
> When the argument final is true, this is the last call
> to decode() and trailing incomplete byte sequences will
> not be retained, but a UnicodeError will be raised.
>
> Stateful chunk encoder:
> chunkencoder(errors='strict'):
> A factory function that returns a stateful encoder
> with the following method:
> encode(input, final=False):
> Encodes a chunk of input and returns the encoded
> str object. When final is true this is the last
> call to encode().
>
> Stateful stream decoder:
> streamreader(stream, errors='strict'):
> A factory function that returns a stateful decoder
> for reading from the byte stream stream, with the
> following methods:
>
> read(size=-1, chars=-1, final=False):
> Read unicode characters from the stream. When data
> is read from the stream it should be done in chunks of
> size bytes. If size == -1 all the remaining data
> from the stream is read. chars specifies the number
> of characters to read from the stream. read() may return
> less then chars characters if there's not enough data
> available in the byte stream. If chars == -1 as much
> characters are read as are available in the stream.
> Transient errors are ignored and trailing incomplete
> byte sequence are retained when final is false. Otherwise
> a UnicodeError is raised in the case of incomplete byte
> sequences.
> readline(size=-1):
> ...
> next():
> ...
> __iter__():
> ...
>
> Stateful stream encoder:
> streamwriter(stream, errors='strict'):
> A factory function that returns a stateful encoder
> for writing unicode data to the byte stream stream,
> with the following methods:
>
> write(data, final=False):
> Encodes the unicode object data and writes it
> to the stream. If final is true this is the last
> call to write().
> writelines(data):
> ...
>
>
> I know that this is quite a departure from the current API, and
> I'm not sure if we can get all of the functionality without
> sacrificing backwards compatibility.
>
> I don't particularly care about the "bytes consumed" return value
> from the stateless codec. The codec should always have returned only
> the encoded/decoded object, but I guess fixing this would break too
> much code. And users who are only interested in the stateless
> functionality will probably use unicode.encode/str.decode anyway.
>
> For the stateful API it would be possible to combine the chunk and
> stream decoder/encode into one class with the following methods
> (for the decoder):
>
> __init__(stream, errors='strict'):
> Like the current StreamReader constructor, but stream may be
> None, if only the chunk API is used.
> decode(input, final=False):
> Like the current StreamReader (i.e. it returns a (unicode, int)
> tuple.) This does not keep the remaining bytes in a buffer.
> This is the job of the caller.
> feed(input, final=False):
> Decodes input and returns a decoded unicode object. This method
> calls decode() internally and manages the byte buffer.
> read(size=-1, chars=-1, final=False):
> readline(size=-1):
> next():
> __iter__():
> See above.
>
> As before implementers of decoders only need to implement decode().
>
> To be able to support the final argument the decoding functions
> in _codecsmodule.c could get an additional argument. With this
> they could be used for the stateless codecs too and we can reduce
> the number of functions again.
>
> Unfortunately adding the final argument breaks all of the current
> codecs, but dropping the final argument requires one of two
> changes:
> 1) When the input stream is exhausted, the bytes read are parsed
> as if final=True. That's the way the CJK codecs currently
> handle it, but unfortunately this doesn't work with the feed
> decoder.
> 2) Simply ignore any remaing undecoded bytes at the end of the
> stream.
>
> If we really have to drop the final argument, I'd prefer 2).
>
> I've uploaded a second version of the patch. It implements
> the final argument, adds the feed() method to StreamReader and
> again merges the duplicate decoding functions in the codecs
> module. Note that the patch isn't really finished (the final
> argument isn't completely supported in the encoders and the
> CJK and escape codecs are unchanged), but it should be sufficient
> as a base for discussion.
>
> Bye,
> Walter Dörwald
>
>
> _______________________________________________
> Python-Dev mailing list
> Python-Dev at python.org
> http://mail.python.org/mailman/listinfo/python-dev
> Unsubscribe:
> http://mail.python.org/mailman/options/python-dev/mal%40egenix.com
--
Marc-Andre Lemburg
eGenix.com
Professional Python Services directly from the Source (#1, Aug 12 2004)
>>> Python/Zope Consulting and Support ... http://www.egenix.com/
>>> mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/
>>> mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/
________________________________________________________________________
::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,FreeBSD for free ! ::::
More information about the Python-Dev
mailing list