[I18n-sig] Proposal: Extended error handling for unicode.encode

M.-A. Lemburg mal@lemburg.com
Wed, 20 Dec 2000 19:52:37 +0100

Walter Doerwald wrote:
> Problem:
> Most character encodings do not support the full range of
> Unicode characters. For these cases many high level protocols
> support a way of escaping a Unicode character (e.g. Python
> itself support the \x, \u and \U convention, XML supports
> character references via &#xxxx; etc.). The problem with the
> current implementation of unicode.encode is that for determining
> which characters are unencodable by a certain encoding, every
> single character has to be tried, because encode does not
> provide any information about the location of the error(s), so
>    us = u"xxx"
>    s = us.encode("encoding", errors="strict")
> has to be replaced by:
>    us = u"xxx"
>    v = ""
>    for c in us:
>         try:
>            v.append(c.encode("encoding", "strict"))
>         except UnicodeError:
>            v.append("&#" + ord(c) + ";")
>    s = "".join(v)
> This slows down encoding dramatically as now the loop through
> the string is done in Python code and no longer in C code.
> Solution:
> One simple and extensible solution would be to be able to
> pass an error handler function as the error argument for encode.
> This error handler function is passed every unencodable character
> and might either raise an exception itself, or return a unicode
> string that will be encoded instead of the unencodable character.
> (Note that this requires the the encoding *must* be able to encode
> what is returned from the handler)
> Example:
>    us = unicode("aou", "latin1")
>    def xmlEscape(char):
>       return u"&#" + unicode(ord(char),"ascii") + u";"
>    print s.encode("us-ascii", xmlEscape)
> will result in
>    aäoöuü
> With this scheme it would even be possible to reimplement the
> old error handling with the new one:
> def strict(char):
>         raise UnicodeError("can't encode %r" % char)
> def ignore(char):
>         return u""
> def replace(char):
>         return u"\uFFFD"
> Does this make sense?

The problem with this is that the error handler will usually
have to have access to the internal data structure of the codec
to be able to process the error, e.g. <char> in your example
could be a single character, a UTF-16 sequence, etc. The codec
in general knows better what to do in case of an error, that's
why there's a simple string argument for the error handling:
the codec can then decide on what to do depending on the value
of this argument (and even call back to some error handler
it implements as method).

Since your main problem is locating the character causing the
error, one possibility would be to extend the error instance
to reference the position of the error as error instance
attribute, e.g. unierror.position.

Marc-Andre Lemburg
Company:                                        http://www.egenix.com/
Consulting:                                    http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/