[Python-Dev] PEP 293, Codec Error Handling Callbacks
Walter Dörwald
walter@livinglogic.de
Tue, 13 Aug 2002 13:31:15 +0200
Martin v. Loewis wrote:
> Walter Dörwald <walter@livinglogic.de> writes:
>
>
>>Output is as follows:
>>1790000 chars, 2.330% unenc
>>ignore: 0.022 (factor=1.000)
>>xmlcharrefreplace: 0.044 (factor=1.962)
>>xml2: 0.267 (factor=12.003)
>>xml3: 0.723 (factor=32.506)
>>workaround: 5.151 (factor=231.702)
>>i.e. a 1.7MB string with 2.3% unencodable characters was
>>encoded.
>
>
> Those numbers are impressive. Can you please add
>
> def xml4(exc):
> if isinstance(exc, UnicodeEncodeError):
> if exc.end-exc.start == 1:
> return u"&#"+str(ord(exc.object[exc.start]))+u";"
> else:
> r = []
> for c in exc.object[exc.start:exc.end]:
> r.extend([u"&#", str(ord(c)), u";"])
> return u"".join(r)
> else:
> raise TypeError("don't know how to handle %r" % exc)
>
> and report how that performs (assuming I made no error)?
You must return a tuple (replacement, new input position)
otherwise the code is correct. It tried it and two new
versions:
def xml5(exc):
if isinstance(exc, UnicodeEncodeError):
return (u"&#%d;" % ord(exc.object[exc.start]), exc.start+1)
else:
raise TypeError("don't know how to handle %r" % exc)
def xml6(exc):
if isinstance(exc, UnicodeEncodeError):
return (u"&#" + str(ord(exc.object[exc.start]) + u";"),
exc.start+1)
else:
raise TypeError("don't know how to handle %r" % exc)
Here are the results:
1790000 chars, 2.330% unenc
ignore: 0.022 (factor=1.000)
xmlcharrefreplace: 0.042 (factor=1.935)
xml2: 0.264 (factor=12.084)
xml3: 0.733 (factor=33.529)
xml4: 0.504 (factor=23.057)
xml5: 0.474 (factor=21.649)
xml6: 0.481 (factor=22.010)
workaround: 5.138 (factor=234.862)
>>Using a callback instead of the inline implementation is a factor of
>>12 slower than ignore.
>
>
> For the purpose of comparing C and Python, this isn't relevant, is it?
> Only the C version of xmlcharrefreplace and a Python version should be
> compared.
I was just to lazy to code this. ;)
Python is a factor of 2.7 slower than the C callback
(or 1.9 for your version).
>>It can't really be fixed for codecs implemented in Python. For codecs
>>that use the C functions we could add the functionality that e.g.
>>PyUnicodeEncodeError_SetReason(exc) sets exc.reason and exc.args[3],
>>but AFAICT it can't be done easily for Python where attribute assignment
>>directly goes to the instance dict.
>
>
> You could add methods into the class set_reason etc, which error
> handler authors would have to use.
>
> Again, these methods could be added through Python code, so no C code
> would be necessary to implemenet them.
>
> You could even implement a setattr method in Python - although you'ld
> have to search this from C while initializing the class.
For me this sounds much more complicated than the current C functions,
especially for using them from C, which most codecs probably will.
Bye,
Walter Dörwald