[I18n-sig] Proposal: Extended error handling for unicode.encode
Walter Doerwald
walter@livinglogic.de
Wed, 20 Dec 2000 15:06:25 +0100
Problem:
Most character encodings do not support the full range of
Unicode characters. For these cases many high level protocols
support a way of escaping a Unicode character (e.g. Python
itself support the \x, \u and \U convention, XML supports
character references via &#xxxx; etc.). The problem with the
current implementation of unicode.encode is that for determining
which characters are unencodable by a certain encoding, every
single character has to be tried, because encode does not
provide any information about the location of the error(s), so
us =3D u"xxx"
s =3D us.encode("encoding", errors=3D"strict")
has to be replaced by:
us =3D u"xxx"
v =3D ""
for c in us:
try:
v.append(c.encode("encoding", "strict"))
except UnicodeError:
v.append("&#" + ord(c) + ";")
s =3D "".join(v)
This slows down encoding dramatically as now the loop through
the string is done in Python code and no longer in C code.
Solution:
One simple and extensible solution would be to be able to
pass an error handler function as the error argument for encode.
This error handler function is passed every unencodable character
and might either raise an exception itself, or return a unicode
string that will be encoded instead of the unencodable character.
(Note that this requires the the encoding *must* be able to encode
what is returned from the handler)
Example:
us =3D unicode("a=E4o=F6u=FC", "latin1")
def xmlEscape(char):
return u"&#" + unicode(ord(char),"ascii") + u";"
print s.encode("us-ascii", xmlEscape)
will result in
aäoöuü
With this scheme it would even be possible to reimplement the
old error handling with the new one:
def strict(char):
raise UnicodeError("can't encode %r" % char)
def ignore(char):
return u""
def replace(char):
return u"\uFFFD"
Does this make sense?
Bye,
Walter D=F6rwald
--
Walter D=F6rwald =B7 LivingLogic AG =B7 Bayreuth, Germany =B7
www.livinglogic.de