[I18n-sig] Proposal: Extended error handlingforunicode.encode

M.-A. Lemburg mal@lemburg.com
Fri, 05 Jan 2001 09:40:52 +0100

"Martin v. Loewis" wrote:
> > How would such a scheme allow passing back control information
> > such as: continue with the next character in the stream or
> > break with an exception ?
> If it wanted to break with an exception, it would raise one. So the
> function really has to acceptable results: an exception, and a Unicode
> object. Since most Python functions are allowed to raise exceptions,
> that went without saying.

Sure, exceptions are not much of a problem, but how would the
callback tell the encoder/decoder to e.g. skip forward 2 bytes or perhaps
backward 10 bytes ? What if the callback would have to scan the
stream from the beginning to find out where to continue or look
ahead a few hundred bytes to find the next valid encodable sequence ?

Again, you should keep in mind that the scheme has to work
for all encoding/decoding work, not only conversion from and
to Unicode.
> > Which is what I'm talking about all along: the codecs know best
> > what to do, so better extend them than try to fiddle in some
> > information using a callback.
> If that means to touch the source of all codecs, than that would be an
> unacceptable solution. Doing it in a generic way would be ok - except
> that I still can't see *how* this could possibly work.

If we were to provide a callback as optional method to 
StreamReaders/Writers, the task could be done either statically
by subclassing the existing codec StreamReaders/Writers or
dynamically by asking the codec registry to return the StreamReader/
Writer classes.

But since there aren't all that many codec implementations
around (only the few in unicodeobject.c), the proposed generic
solution of adding new error treatment strings would go a long
> > I did propose a solution which would satisfy your needs: simply
> > add a new error treatment 'xml-escape' to the builtin codecs
> > which then does the needed XML escaping. XML is general enough
> > to warrant such a step and the required changes are minor.
> Sorry, I missed that. That would also solve the problem at hand. Since
> nobody has come up with a different use case for a more general
> solution, that might be the solution which we can reasonably implement
> for 2.1.

> > Another candidate for a new error treatment would be
> > 'unicode-escape' which then replaces the character in question with
> > '\uXXXX'.
> +0. While that falls into the same category, I haven't seen anybody
> saying "I need such a feature".

This would be handy for the case where you don't want to have
exceptions raised, but still require some form of retaining the
original data.
> > For the general case, I'd rather add new PyUnicode_EncodeEx()
> > and PyUnicode_DecodeEx() APIs which then take a Python
> > context object as extra argument. The error treatment string
> > would then define how to use this context object, e.g. 'callback'
> > could be made to apply processing similar to what Walter
> > suggested.
> What other acceptable values for the string would you foresee?

Another option would be 'copy' which tries to simply copy input
to output in case this is reasonably possible given the encoding
(e.g. Unicode -> 8-bit encoding would copy all 8-bit Unicode chars as
is in case no mapping is defined). An option 'raise' could also
be valuable in conjunction with an exception context object to have
the codec raise customized exceptions. Provided the context
object points to another encoder/decoder, an option 'fallback'
could be used to tell the codec to pass the failing input data
to the alternate encoder/decoder in order to have it converted.
Etc. etc. 

There are many things one could do with the error string.

Marc-Andre Lemburg
Company:                                        http://www.egenix.com/
Consulting:                                    http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/