[Python-Dev] Decoding incomplete unicode
Walter Dörwald
walter at livinglogic.de
Tue Aug 10 21:24:20 CEST 2004
OK, here a my current thoughts on the codec problem:
The optimal solution (ignoring backwards compatibility)
would look like this: codecs.lookup() would return the
following stuff (this could be done by replacing the
4 entry tuple with a real object):
* decode: The stateless decoding function
* encode: The stateless encocing function
* chunkdecoder: The stateful chunk decoder
* chunkencoder: The stateful chunk encoder
* streamreader: The stateful stream decoder
* streamwriter: The stateful stream encoder
The functions and classes look like this:
Stateless decoder:
decode(input, errors='strict'):
Function that decodes the (str) input object and returns
a (unicode) output object. The decoder must decode the
complete input without any remaining undecoded bytes.
Stateless encoder:
encode(input, errors='strict'):
Function that encodes the complete (unicode) input object and
returns a (str) output object.
Stateful chunk decoder:
chunkdecoder(errors='strict'):
A factory function that returns a stateful decoder with the
following method:
decode(input, final=False):
Decodes a chunk of input and return the decoded unicode
object. This method can be called multiple times and
the state of the decoder will be kept between calls.
This includes trailing incomplete byte sequences
that will be retained until the next call to decode().
When the argument final is true, this is the last call
to decode() and trailing incomplete byte sequences will
not be retained, but a UnicodeError will be raised.
Stateful chunk encoder:
chunkencoder(errors='strict'):
A factory function that returns a stateful encoder
with the following method:
encode(input, final=False):
Encodes a chunk of input and returns the encoded
str object. When final is true this is the last
call to encode().
Stateful stream decoder:
streamreader(stream, errors='strict'):
A factory function that returns a stateful decoder
for reading from the byte stream stream, with the
following methods:
read(size=-1, chars=-1, final=False):
Read unicode characters from the stream. When data
is read from the stream it should be done in chunks of
size bytes. If size == -1 all the remaining data
from the stream is read. chars specifies the number
of characters to read from the stream. read() may return
less then chars characters if there's not enough data
available in the byte stream. If chars == -1 as much
characters are read as are available in the stream.
Transient errors are ignored and trailing incomplete
byte sequence are retained when final is false. Otherwise
a UnicodeError is raised in the case of incomplete byte
sequences.
readline(size=-1):
...
next():
...
__iter__():
...
Stateful stream encoder:
streamwriter(stream, errors='strict'):
A factory function that returns a stateful encoder
for writing unicode data to the byte stream stream,
with the following methods:
write(data, final=False):
Encodes the unicode object data and writes it
to the stream. If final is true this is the last
call to write().
writelines(data):
...
I know that this is quite a departure from the current API, and
I'm not sure if we can get all of the functionality without
sacrificing backwards compatibility.
I don't particularly care about the "bytes consumed" return value
from the stateless codec. The codec should always have returned only
the encoded/decoded object, but I guess fixing this would break too
much code. And users who are only interested in the stateless
functionality will probably use unicode.encode/str.decode anyway.
For the stateful API it would be possible to combine the chunk and
stream decoder/encode into one class with the following methods
(for the decoder):
__init__(stream, errors='strict'):
Like the current StreamReader constructor, but stream may be
None, if only the chunk API is used.
decode(input, final=False):
Like the current StreamReader (i.e. it returns a (unicode, int)
tuple.) This does not keep the remaining bytes in a buffer.
This is the job of the caller.
feed(input, final=False):
Decodes input and returns a decoded unicode object. This method
calls decode() internally and manages the byte buffer.
read(size=-1, chars=-1, final=False):
readline(size=-1):
next():
__iter__():
See above.
As before implementers of decoders only need to implement decode().
To be able to support the final argument the decoding functions
in _codecsmodule.c could get an additional argument. With this
they could be used for the stateless codecs too and we can reduce
the number of functions again.
Unfortunately adding the final argument breaks all of the current
codecs, but dropping the final argument requires one of two
changes:
1) When the input stream is exhausted, the bytes read are parsed
as if final=True. That's the way the CJK codecs currently
handle it, but unfortunately this doesn't work with the feed
decoder.
2) Simply ignore any remaing undecoded bytes at the end of the
stream.
If we really have to drop the final argument, I'd prefer 2).
I've uploaded a second version of the patch. It implements
the final argument, adds the feed() method to StreamReader and
again merges the duplicate decoding functions in the codecs
module. Note that the patch isn't really finished (the final
argument isn't completely supported in the encoders and the
CJK and escape codecs are unchanged), but it should be sufficient
as a base for discussion.
Bye,
Walter Dörwald
More information about the Python-Dev
mailing list