Re: [Python-Dev] Decoding incomplete unicode

27 Aug 2004


      M.-A. Lemburg wrote:
...
...
...
[...]
def decode_stateful(data, state=None):
    ... decode and modify state ...
    return (decoded_data, length_consumed, state)
Another option might be that the decode function changes
the state object in place.
Good idea.
But that's totally up to the implementor.
...
...
...
where the object type and contents of the state variable
is defined per codec (e.g. could be a tuple, just a single
integer or some other special object).
If a tuple is passed and returned this makes it possible
from Python code to mangle the state. IMHO this should be
avoided if possible.
...
...
Otherwise we'll end up having different interface
signatures for all codecs and extending them to accomodate
for future enhancement will become unfeasable without
introducing yet another set of APIs.
We already have slightly different decoding functions:
utf_16_ex_decode() takes additional arguments.
Right - it was a step in the wrong direction. Let's not
use a different path for the future.
utf_16_ex_decode() serves a purpose: it help implement
the UTF16 decoder, which has to switch to UTF-16-BE or
UTF-16-LE according to the BOM, so utf_16_ex_decode()
needs a way to comunicate that back to the caller.
...
...
...
Let's discuss this some more and implement it for Python 2.5.
For Python 2.4, I think we can get away with what we already
have:
...
[...]
OK, I've updated the patch.
...
[...]
The buffer logic should only be used for streams
that do not support the interface to push back already
read bytes (e.g. .unread()).
From a design perspective, keeping read data inside the
codec is the wrong thing to do, simply because it leaves
the input stream in an undefined state in case of an error
and there's no way to correlate the stream's read position
to the location of the error.
With a pushback method on the stream, all the stream
data will be stored on the stream, not the codec, so
the above would no longer be a problem.
On the other hand this requires special stream. Data
already read is part of the codec state, so why not
put it into the codec?
Ideally, the codec should not store data,
I consider the remaining undecoded bytes to be part of
the codec state once the have been read from the stream.
...
but only
reference it. It's better to keep things well
separated which is why I think we need the .unread()
interface and eventually a queue interface to support
the feeding operation.
...
...
However, we can always add the .unread() support to the
stream codecs at a later stage, so it's probably ok
to default to the buffer logic for Python 2.4.
OK.
...
...
That still leaves the issue
of the last read operation, which I'm tempted to leave unresolved
for Python 2.4. No matter what the solution is, it would likely
require changes to all codecs, which is not good.
We could have a method on the codec which checks whether
the codec buffer or the stream still has pending data
left. Using this method is an application scope consideration,
not a codec issue.
But this mean that the normal error handling can't be used
for those trailing bytes.
Right, but then: missing data (which usually causes the trailing
bytes) is really something for the application to deal with,
e.g. by requesting more data from the user, another application
or trying to work around the problem in some way. I don't think
that the codec error handler can practically cover these
possibilities.
But in many cases the user might want to use "ignore" or "replace"
error handling.

Bye,
    Walter Dörwald