[Python-Dev] Unicode charmap decoders slow
"Martin v. Löwis"
martin at v.loewis.de
Wed Oct 5 08:36:58 CEST 2005
Tony Nelson wrote:
>>For decoding it should be sufficient to use a unicode string of
>>length 256. u"\ufffd" could be used for "maps to undefined". Or the
>>string might be shorter and byte values greater than the length of
>>the string are treated as "maps to undefined" too.
>
>
> With Unicode using more than 64K codepoints now, it might be more forward
> looking to use a table of 256 32-bit values, with no need for tricky
> values.
You might be missing the point. \ufffd is REPLACEMENT CHARACTER,
which would indicate that the byte with that index is really unused
in that encoding.
> Encoding can be made fast using a simple hash table with external chaining.
> There are max 256 codepoints to encode, and they will normally be well
> distributed in their lower 8 bits. Hash on the low 8 bits (just mask), and
> chain to an area with 256 entries. Modest storage, normally short chains,
> therefore fast encoding.
This is what is currently done: a hash map with 256 keys. You are
complaining about the performance of that algorithm. The issue of
external chaining is likely irrelevant: there likely are no collisions,
even though Python uses open addressing.
>>...I suggest instead just /caching/ the translation in C arrays stored
>>with the codec object. The cache would be invalidated on any write to the
>>codec's mapping dictionary, and rebuilt the next time anything was
>>translated. This would maintain the present semantics, work with current
>>codecs, and still provide the desired speed improvement.
That is not implementable. You cannot catch writes to the dictionary.
> Note that this caching is done by new code added to the existing C
> functions (which, if I have it right, are in unicodeobject.c). No
> architectural changes are made; no existing codecs need to be changed;
> everything will just work
Please try to implement it. You will find that you cannot. I don't
see how regenerating/editing the codecs could be avoided.
Regards,
Martin
More information about the Python-Dev
mailing list