[I18n-sig] Proposal: Extended error handling for unicode.encode

M.-A. Lemburg mal@lemburg.com
Thu, 21 Dec 2000 19:48:26 +0100

"Martin v. Loewis" wrote:
> > The problem with this is that the error handler will usually
> > have to have access to the internal data structure of the codec
> > to be able to process the error, e.g. <char> in your example
> > could be a single character, a UTF-16 sequence, etc.
> Please note that in his encoding, char is a Unicode string
> (specifically, character), so it can't be a UTF-16 sequence.
> What *encoder* that you know needs to have internal state?

The codec is much general and kept symmetric for obvious reasons.
In his case, char would be a Unicode string, but the input to
an encoder could just as well be an image, a sound or some other
abstract form of data storage. It is not unlikely that these
encoder will need to keep state.

Even for Unicode you will need to keep state in the encoder,
e.g. to write an encoder which uses the Unicode compression
algorithm as basis (the output stream contains markers to
switch pages).
> Anyway, if you think that state should be accessible to the error
> handling function, it won't be hard to pass state to the callback.
> E.g. you could pass the string being encoded, the current position,
> and optionally a Codec instance (many codecs would pass None, as they
> don't keep any state).

Hmm, I don't think this is generally useful. Using the codec
instances directly would be the right way to go, IMHO. I don't
want to overload .encode() or unicode() with too much functionality.
Writing your own function helpers which then apply all the necessary
magic is simple and doesn't warrant changing APIs in the core.

Since the error handling is extensible by adding new options
such as 'callback', the existing codecs could be extended to
provide this functionality as well. We'd only need a way to
pass the callback to the codecs in some way, e.g. by using
a keyword argument on the constructor or by subclassing it
and providing a new method for the error handling in question.

> > The codec in general knows better what to do in case of an error
> In the demonstrated use case, it doesn't know. It should create an XML
> character entity, but doesn't know anything about XML character
> entities.

I meant that it knows better about the current state and
parameters of the encoding and input it is working on. The ideal
error handling scheme would call a method on the codec which
you could then override to provide your own handling, e.g.
XML entity encoding.

> > Since your main problem is locating the character causing the
> > error, one possibility would be to extend the error instance
> > to reference the position of the error as error instance
> > attribute, e.g. unierror.position.
> That would work as well, but it would require to re-encode everything
> up to that position. The callback solution is more general.

Sure, but the more general solution needs to be well designed.
The above trick only adds additional information to the error
instance -- this is easy to implement and doesn't break anything.

Note: simply changing the error parameter to a PyObject doesn't 
work, since all C APIs expect a simple const char.

Marc-Andre Lemburg
Company:                                        http://www.egenix.com/
Consulting:                                    http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/