[Python-Dev] Unicode charmap decoders slow

Wed Oct 5 08:36:58 CEST 2005

Tony Nelson wrote:
>>For decoding it should be sufficient to use a unicode string of
>>length 256. u"\ufffd" could be used for "maps to undefined". Or the
>>string might be shorter and byte values greater than the length of
>>the string are treated as "maps to undefined" too.
> 
> 
> With Unicode using more than 64K codepoints now, it might be more forward
> looking to use a table of 256 32-bit values, with no need for tricky
> values.

You might be missing the point. \ufffd is REPLACEMENT CHARACTER,
which would indicate that the byte with that index is really unused
in that encoding.

> Encoding can be made fast using a simple hash table with external chaining.
> There are max 256 codepoints to encode, and they will normally be well
> distributed in their lower 8 bits.  Hash on the low 8 bits (just mask), and
> chain to an area with 256 entries.  Modest storage, normally short chains,
> therefore fast encoding.

This is what is currently done: a hash map with 256 keys. You are 
complaining about the performance of that algorithm. The issue of
external chaining is likely irrelevant: there likely are no collisions,
even though Python uses open addressing.

>>...I suggest instead just /caching/ the translation in C arrays stored
>>with the codec object.  The cache would be invalidated on any write to the
>>codec's mapping dictionary, and rebuilt the next time anything was
>>translated.  This would maintain the present semantics, work with current
>>codecs, and still provide the desired speed improvement.

That is not implementable. You cannot catch writes to the dictionary.

> Note that this caching is done by new code added to the existing C
> functions (which, if I have it right, are in unicodeobject.c).  No
> architectural changes are made; no existing codecs need to be changed;
> everything will just work

Please try to implement it. You will find that you cannot. I don't
see how regenerating/editing the codecs could be avoided.

Regards,
Martin