[I18n-sig] Proposal: Extended error handling for unicode.encode

M.-A. Lemburg mal@lemburg.com
Sat, 23 Dec 2000 13:27:31 +0100

"Martin v. Loewis" wrote:
> > Hmm, I don't think this is generally useful. Using the codec
> > instances directly would be the right way to go, IMHO. I don't
> > want to overload .encode() or unicode() with too much functionality.
> > Writing your own function helpers which then apply all the necessary
> > magic is simple and doesn't warrant changing APIs in the core.
> Ok, then I have a challenge for you. Write a codec family that emits
> XML character entities on encoding errors for any of the standard
> Python codecs. If its really simple, then I'd *really* appreciate
> concrete, working code. I really mean that - I doubt that this is
> simple. If a problem arises doing it for all of the encodings, just
> pick one. If that is still asked too much, outline a solution;
> preferably one that is as efficient as would be the solution involving
> the callback.

Martin, I have a feeling that we both want to achieve the same
thing. The only difference is that you want to add it fast and
without reflecting about the APIs and needed changes, while I
prefer to first draw up a design and then make a decision based
on that design. The latter needs more time and some tossing around
of ideas. Your approach is one of the possible ways to do this.

Please let's not fight over this, but instead discuss a general
design for error handlers. The design will have to assure (at least)
these things:

* backward compatibility
* fast implementation
* reuse of existing codecs
* extensible
* fits in with the existing C APIs (or extends these)
* provides ways to set an error handler at C level as both
  C function and Python callable object

About the "function helpers": see below.
> > Since the error handling is extensible by adding new options such as
> > 'callback', the existing codecs could be extended to provide this
> > functionality as well. We'd only need a way to pass the callback to
> > the codecs in some way, e.g. by using a keyword argument on the
> > constructor or by subclassing it and providing a new method for the
> > error handling in question.
> That solution is quite similar to the callback approach, so we could
> probably chose either. I'm not entirely sure how the usage scenario
> is. Did you think that users, instead of writing
>   u.encode("koi8-r",errors=xmlcharentities)
> would write
>   I,forgot,which,parameter = codecs.lookup("koi8-r")
>   encode = I()
>   encode.install_error_cb(xmlcharentities)
>   encode.encode(u,errors="callback")
> or did you have a more convenient API in mind?

This is what I was referring to with the "function helpers"
above. An alternative would probably be adding another
optional argument to the .encode() method and the unicode()

u.encode('koi8-r', 'callback', myerrorhandler)


unicode(data, 'koi8-r', 'callback', myerrorhandler)

> [...]
> > Note: simply changing the error parameter to a PyObject doesn't 
> > work, since all C APIs expect a simple const char.
> Sure. Looking from the Python core side of the things, it's a large
> change. Looking from the users' point of view, it's a small one.

Right and that's why we have to be careful about the design.

Cheers and Merry Christmas,
Marc-Andre Lemburg
Company:                                        http://www.egenix.com/
Consulting:                                    http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/