[Python-Dev] Unicode charmap decoders slow

Wed Oct 5 20:03:03 CEST 2005

At 8:36 AM +0200 10/5/05, Martin v. Löwis wrote:
>Tony Nelson wrote:
 ...
>> Encoding can be made fast using a simple hash table with external chaining.
>> There are max 256 codepoints to encode, and they will normally be well
>> distributed in their lower 8 bits.  Hash on the low 8 bits (just mask), and
>> chain to an area with 256 entries.  Modest storage, normally short chains,
>> therefore fast encoding.
>
>This is what is currently done: a hash map with 256 keys. You are
>complaining about the performance of that algorithm. The issue of
>external chaining is likely irrelevant: there likely are no collisions,
>even though Python uses open addressing.

I think I'm complaining about the implementation, though on decode, not encode.

In any case, there are likely to be collisions in my scheme.  Over the next
few days I will try to do it myself, but I will need to learn Pyrex, some
of the Python C API, and more about Python to do it.

>>>...I suggest instead just /caching/ the translation in C arrays stored
>>>with the codec object.  The cache would be invalidated on any write to the
>>>codec's mapping dictionary, and rebuilt the next time anything was
>>>translated.  This would maintain the present semantics, work with current
>>>codecs, and still provide the desired speed improvement.
>
>That is not implementable. You cannot catch writes to the dictionary.

I should have been more clear.  I am thinking about using a proxy object in
the codec's 'encoding_map' and 'decoding_map' slots, that will forward all
the dictionary stuff.  The proxy will delete the cache on any call which
changes the dictionary contents.  There are proxy classed and dictproxy
(don't know how its implemented yet) so it seems doable, at least as far as
I've gotten so far.

>> Note that this caching is done by new code added to the existing C
>> functions (which, if I have it right, are in unicodeobject.c).  No
>> architectural changes are made; no existing codecs need to be changed;
>> everything will just work
>
>Please try to implement it. You will find that you cannot. I don't
>see how regenerating/editing the codecs could be avoided.

Will do!
____________________________________________________________________
TonyN.:'                       <mailto:tonynelson at georgeanelson.com>
      '                              <http://www.georgeanelson.com/>