[I18n-sig] Proposal: Extended error handling for unicode.encode

Walter Doerwald walter@livinglogic.de
Wed, 20 Dec 2000 15:06:25 +0100

Most character encodings do not support the full range of 
Unicode characters. For these cases many high level protocols 
support a way of escaping a Unicode character (e.g. Python 
itself support the \x, \u and \U convention, XML supports
character references via &#xxxx; etc.). The problem with the 
current implementation of unicode.encode is that for determining
which characters are unencodable by a certain encoding, every 
single character has to be tried, because encode does not 
provide any information about the location of the error(s), so

   us =3D u"xxx"
   s =3D us.encode("encoding", errors=3D"strict")

has to be replaced by:

   us =3D u"xxx"
   v =3D ""
   for c in us:
	   v.append(c.encode("encoding", "strict"))
	except UnicodeError:
	   v.append("&#" + ord(c) + ";")
   s =3D "".join(v)

This slows down encoding dramatically as now the loop through 
the string is done in Python code and no longer in C code.

One simple and extensible solution would be to be able to
pass an error handler function as the error argument for encode.
This error handler function is passed every unencodable character
and might either raise an exception itself, or return a unicode
string that will be encoded instead of the unencodable character.
(Note that this requires the the encoding *must* be able to encode
what is returned from the handler)


   us =3D unicode("a=E4o=F6u=FC", "latin1")

   def xmlEscape(char):
      return u"&#" + unicode(ord(char),"ascii") + u";"
   print s.encode("us-ascii", xmlEscape)

will result in


With this scheme it would even be possible to reimplement the
old error handling with the new one:

def strict(char):
	raise UnicodeError("can't encode %r" % char)

def ignore(char):
	return u""

def replace(char):
	return u"\uFFFD"

Does this make sense?

   Walter D=F6rwald

Walter D=F6rwald =B7 LivingLogic AG =B7 Bayreuth, Germany =B7