[Python-Dev] PEP 293, Codec Error Handling Callbacks

Walter Dörwald walter@livinglogic.de
Tue, 13 Aug 2002 13:31:15 +0200

Martin v. Loewis wrote:

> Walter Dörwald <walter@livinglogic.de> writes:
>>Output is as follows:
>>1790000 chars, 2.330% unenc
>>ignore: 0.022 (factor=1.000)
>>xmlcharrefreplace: 0.044 (factor=1.962)
>>xml2: 0.267 (factor=12.003)
>>xml3: 0.723 (factor=32.506)
>>workaround: 5.151 (factor=231.702)
>>i.e. a 1.7MB string with 2.3% unencodable characters was
> Those numbers are impressive. Can you please add
> def xml4(exc):
>   if isinstance(exc, UnicodeEncodeError):
>     if exc.end-exc.start == 1:
>       return u"&#"+str(ord(exc.object[exc.start]))+u";"
>     else:
>       r = []
>       for c in exc.object[exc.start:exc.end]:
>         r.extend([u"&#", str(ord(c)), u";"])
>       return u"".join(r)
>   else:
>     raise TypeError("don't know how to handle %r" % exc)
> and report how that performs (assuming I made no error)?

You must return a tuple (replacement, new input position)
otherwise the code is correct. It tried it and two new

def xml5(exc):
     if isinstance(exc, UnicodeEncodeError):
         return (u"&#%d;" % ord(exc.object[exc.start]), exc.start+1)
         raise TypeError("don't know how to handle %r" % exc)

def xml6(exc):
     if isinstance(exc, UnicodeEncodeError):
         return (u"&#" + str(ord(exc.object[exc.start]) + u";"), 
         raise TypeError("don't know how to handle %r" % exc)

Here are the results:

1790000 chars, 2.330% unenc
ignore: 0.022 (factor=1.000)
xmlcharrefreplace: 0.042 (factor=1.935)
xml2: 0.264 (factor=12.084)
xml3: 0.733 (factor=33.529)
xml4: 0.504 (factor=23.057)
xml5: 0.474 (factor=21.649)
xml6: 0.481 (factor=22.010)
workaround: 5.138 (factor=234.862)

>>Using a callback instead of the inline implementation is a factor of
>>12 slower than ignore.
> For the purpose of comparing C and Python, this isn't relevant, is it?
> Only the C version of xmlcharrefreplace and a Python version should be
> compared.

I was just to lazy to code this. ;)

Python is a factor of 2.7 slower than the C callback
(or 1.9 for your version).

>>It can't really be fixed for codecs implemented in Python. For codecs
>>that use the C functions we could add the functionality that e.g.
>>PyUnicodeEncodeError_SetReason(exc) sets exc.reason and exc.args[3],
>>but AFAICT it can't be done easily for Python where attribute assignment
>>directly goes to the instance dict.
> You could add methods into the class set_reason etc, which error
> handler authors would have to use.
> Again, these methods could be added through Python code, so no C code
> would be necessary to implemenet them.
> You could even implement a setattr method in Python - although you'ld
> have to search this from C while initializing the class.

For me this sounds much more complicated than the current C functions, 
especially for using them from C, which most codecs probably will.

    Walter Dörwald