[Python-Dev] PEP 293, Codec Error Handling Callbacks

Walter Dörwald walter@livinglogic.de
Mon, 12 Aug 2002 20:39:25 +0200

This is a multi-part message in MIME format.
Content-Type: text/plain; charset=ISO-8859-15; format=flowed
Content-Transfer-Encoding: 8bit

Martin v. Loewis wrote:

> Walter Dörwald <walter@livinglogic.de> writes:
>> > - charmap_encoding_error, which insists on implementing known error
>> >   handling algorithms inline,
>>This is done for performance reasons.
> Is that really worth it? Such errors are rare, and when they occur,
> they usually cause an exception as the result of the "strict" error
> handling.

Of course it's irrelevant how fast the exception is raised, but it could
be important for the handlers that really do a replacement.

> I'd strongly encourage you to avoid duplication of code, and use
> Python whereever possible.
>>The PyCodec_XMLCharRefReplaceErrors functionality is
>>independent of the rest, so moving this to Python
>>won't reduce complexity that much. And it will
>>slow down "xmlcharrefreplace" handling for those
>>codecs that don't implement it inline.
> Sure it will. But how much does that matter in the overall context of
> generating HTML/XML?

See the attached test script. It encodes 100 versions of the german
text on http://www.gutenberg2000.de/grimm/maerchen/tapfere.htm

Output is as follows:
1790000 chars, 2.330% unenc
ignore: 0.022 (factor=1.000)
xmlcharrefreplace: 0.044 (factor=1.962)
xml2: 0.267 (factor=12.003)
xml3: 0.723 (factor=32.506)
workaround: 5.151 (factor=231.702)
i.e. a 1.7MB string with 2.3% unencodable characters was

Using the the inline xmlcharrefplace instead of ignore is
half as fast. Using a callback instead of the inline
implementation is a factor of 12 slower than ignore.
Using the Python implementation of the callback is a
factor of 32 slower and using the pre-PEP workaround
is a factor of 231 slower.

Replacing every unencodable character with u"\u4242" and
using "iso-8859-15" gives:
ignore: 0.351 (factor=1.000)
xmlcharrefreplace: 0.390 (factor=1.113)
xml2: 0.653 (factor=1.862)
xml3: 1.137 (factor=3.244)
workaround: 12.310 (factor=35.117)

 > [...]
>>The exception attributes can then be members of the C struct and the
>>accessor functions can be simple macros.
> Again, I sense premature optimization.

No it's more like anticipating change.

>>1. For each error handler two Python function objects are created:
>>One in the registry and a different one in the codecs module. This
>>means that e.g.
>>codecs.lookup_error("replace") != codecs.replace_errors
> Why would this be a problem?

It's just unintuitive.

>>We can fix that by making the name ob the Python function object
>>globally visible or by changing the codecs init function to do a
>>lookup and use the result or simply by removing codecs.replace_errors
> I recommend to fix this by implementing the registry in Python.

Even simpler would be to move the initialization of the module
variables from Modules/_codecsmodule.c to Lib/codecs.py. There is
no need for them to be available in _codecs. All that is required
for this change is to add

    strict_errors = lookup_error("strict")
    ignore_errors = lookup_error("ignore")
    replace_errors = lookup_error("replace")
    xmlcharrefreplace_errors = lookup_error("xmlcharrefreplace")
    backslashreplace_errors = lookup_error("backslashreplace")

to codecs.py

The registry should be available via two simple C APIs, just
like the encoding registry.

>>4. Assigning to an attribute of an exception object does not
>>change the appropriate entry in the args attribute. Is this
>>worth changing?
> No. Exception objects should be treated as immutable (even if they
> aren't).

The codecs in the PEP *do* modify attributes of the exception

> If somebody complains, we can fix it; until then, it suffices
> if this is documented.

It can't really be fixed for codecs implemented in Python. For codecs
that use the C functions we could add the functionality that e.g.
PyUnicodeEncodeError_SetReason(exc) sets exc.reason and exc.args[3],
but AFAICT it can't be done easily for Python where attribute assignment
directly goes to the instance dict.

If those exception classes were new style classes it would be simple, 
because the attributes would be properties and args would probably
be generated lazily.

>>5. UTF-7 decoding does not yet take full advantage of the machinery:
>>When an unterminated shift sequence is encountered (e.g. "+xxx")
>>the faulty byte sequence has already been emitted.
> It would be ok if it works as good as it did in 2.2. UTF-7 is rarely
> used; if it is used, it is machine-generated, so there shouldn't be
> any errors.

It does:
 >>> "+xxx".decode("utf-7", "replace")

althought the result should probably have been u'\ufffd'.

    Walter Dörwald

Content-Type: text/plain;
Content-Transfer-Encoding: 7bit
Content-Disposition: inline;

import codecs, time

def xml3(exc):
	if isinstance(exc, UnicodeEncodeError):
		return (u"".join([ u"&#%d;" % ord(c) for c in exc.object[exc.start:exc.end]]), exc.end)
		raise TypeError("don't know how to handle %r" % exc)

count = 0

def check(exc):
	global count
	count += exc.end-exc.start
	return (u"", exc.end)

codecs.register_error("xmlcheck", check)
codecs.register_error("xml2", codecs.xmlcharrefreplace_errors)
codecs.register_error("xml3", xml3)

l = 100
s = unicode(open("tapferschneider.txt").read(), "latin-1")
s *= l

s.encode("ascii", "xmlcheck")

print "%d chars, %.03f%% unenc" % (len(s), 100.*(float(count)/len(s)))

handlers = ["ignore", "xmlcharrefreplace", "xml2", "xml3"]
times = [0]*(len(handlers)+1)
res = [0]*(len(handlers)+1)
for (i, h) in enumerate(handlers):
	t1 = time.time()
	res[i] = s.encode("ascii", h)
	t2 = time.time()
	times[i] = t2-t1
	print "%s: %.03f (factor=%.03f)" % (handlers[i], times[i], times[i]/times[0])

i = len(handlers)
t1 = time.time()
v = []
for c in s:
	except UnicodeError:
		v.append("&#%d;" % ord(c))
res[i] = "".join(v)
t2 = time.time()
times[i] = t2-t1
print "workaround: %.03f (factor=%.03f)" % (times[i], times[i]/times[0])