[I18n-sig] Error handling (was: Re: validity of lone surrogates)

Walter D÷rwald walter@livinglogic.de
Mon, 02 Jul 2001 13:40:52 +0200


 > > How would this work together with the proposed encode error handling
 > > callback feature (see patch #432401)? Does this patch have any 
change of
 > > getting into Python (when it's finished)?
 >
 > I don't know.  The patch looks awfully big, and the motivation seems
 > thin, so I don't have high hopes.  I doubt that I would use it myself,
 > and I fear that it would be pretty slow if called frequently.

Here are a few speed comparisons:
---
import time

s = u"a"*20000000
t1 = time.time()
s.encode("ascii")
t2 = time.time()
print t2-t1
---
The result with Python 2.1 is:
0.65726006031

With the patch the time is:
0.895708084106
(This is probably due to the memory reallocation tests, which could
be avoided for most encoders)

And a test script with a error handler:
---
import time

s = u"ań"*1000000
t1 = time.time()
s.encode("ascii", lambda enc,uni,pos: u"&#%d;" % ord(uni[pos]))
t2 = time.time()
print t2-t1
---
37.0272110701

There a version of this error handler implemented in C, so
replacing
s.encode("ascii", lambda enc,uni,pos: u"&#%d;" % ord(uni[pos]))
with
s.encode("ascii", codecs.xmlcharrefreplace_unicodeencode_errors)
gives a result of
4.77566099167

The equivalent Python code:
---
import time

s = u"ań"*1000000
t1 = time.time()
v = []
for c in s:
    try:
       v.append(s.encode("ascii"))
    except UnicodeError:
       v.append("&#%d;" % ord(c))
"".join(v)
t2 = time.time()
print t2-t1
---
345.193374991

(Note that this is not really equivalent, because it doesn't work with
stateful encoders (e.g. UTF16 generates multiple BOMs))

 > An alternative way to get what you want would be to write your own
 > codec.

This would have to be more like a meta codec, because this feature 
should be available for every character encoding.

 > Also, some standard codecs might be subclassable in a way that
 > makes it easy to get the desired functionality through subclassing
 > rather than through changing lots of C level APIs.

The patch changes the API in two places:

1. "PyObject *error" is used instead of "const char *error", because 
error may be a callable object instead of a string. There would be a 
possibility to have error argument as "const char *error": Define an 
error handling registry were error handling function can be registered 
by name:
codec.registerError("xmlreplace",
    lambda enc,uni,pos: "&#%d;" % ord(uni[pos]))
and then the following call can be made:
	u"ń÷Ř".encode("ascii", "xmlreplace")
As soon as the first error is encountered, the encoder uses it's builtin 
error handling method if it recognizes the name ("strict", "replace" or 
"ignore") or looks up the error handling function in the registry if it 
doesn't. In this way the speed for the backwards compatible features is 
the same as before and "const char *error" can be kept as the parameter 
to all encoding functions. For speed common error handling names could 
even be implemented in the encoder itself.

2. The arguments "Py_UNICODE *str, int size" to the encoder functions 
have been replaced with "PyObject *unicode", this was done because the 
original string is passed to the callback handler, which is just an 
INCREF when the string is already available as "PyObject *unicode", but 
a new string has to be created from str/size (but this has to be done 
only once for the first error). So it's possible to changethis back to 
the original.

With this it would be possible to implement the functionality without 
changing the API and without any loss of speed for already existing 
functionality. Old third party encoders will continue to work for the 
old error options and would simply raise an "unknown error handling" 
exception for the new ones.

Should I try this approach? Does it have a better chance of getting into
Python?

Bye,
	Walter D÷rwald