[Python-Dev] Unicode charmap decoders slow

Walter Dörwald walter at livinglogic.de
Fri Oct 14 18:26:37 CEST 2005


Martin v. Löwis wrote:

> Tony Nelson wrote:
> 
>> I have written my fastcharmap decoder and encoder.  It's not meant to be
>> better than the patch and other changes to come in a future version of
>> Python, but it does work now with the current codecs.
> 
> It's an interesting solution.

I like the fact that encoding doesn't need a special data structure.

>> To use, hook each codec to be speed up:
>>
>>     import fastcharmap
>>     help(fastcharmap)
>>     fastcharmap.hook('name_of_codec')
>>     u = unicode('some text', 'name_of_codec')
>>     s = u.encode('name_of_codec')
>>
>> No codecs were rewritten.  It took me a while to learn enough to do this
>> (Pyrex, more Python, some Python C API), and there were some surprises.
>> Hooking in is grosser than I would have liked.  I've only used it on 
>> Python
>> 2.3 on FC3.
> 
> Indeed, and I would claim that you did not completely achieve your "no 
> changes necessary" goal: you still have to install the hooks explicitly.
> I also think overriding codecs.charmap_{encode,decode} is really ugly.
> 
> Even if this could be simplified if you would modify the existing
> codecs, I still don't think supporting changes to the encoding dict
> is worthwhile. People will probably want to update the codecs in-place,
> but I don't think we need to make a guarantee that that such an approach
> works independent of the Python version. People would be much better off
> writing their own codecs if they think the distributed ones are
> incorrect.

Exacty. If you need another codec write your own insteaad of patching an 
existing one on the fly!

Of course we can't accept Pyrex code in the Python core, so it would be 
great to rewrite the encoder as a patch to PyUnicode_EncodeCharmap(). 
This version must be able to cope with encoding tables that are random 
strings without crashing.

We've already taken care of decoding. What we still need is a new 
gencodec.py and regenerated codecs.

Bye,
    Walter Dörwald


More information about the Python-Dev mailing list