[Python-Dev] Unicode charmap decoders slow

Sun Oct 16 18:53:24 CEST 2005

Tony Nelson wrote:
> Umm, 0 (NUL) is a valid output character in most of the 8-bit character
> sets.  It could be handled by having a separate "exceptions" string of the
> unicode code points that actually map to the exception char.

Yes. But only U+0000 should normally map to 0. It could be special-cased
altogether.

> As you are concerned about pathological cases for hashing (that would make
> the hash chains long), it is worth noting that in such cases this data
> structure could take 64K bytes.  Of course, no such case occurs in standard
> encodings, and 64K is affordable as long is it is rare.

I'm not concerned with long hash chains, I dislike having collisions in 
the first place (even if they are only for two code points). Having to
deal with collisions makes the code more complicated, and less predictable.

It's primarily a matter of taste: avoid hashtables if you can :-)

Regards,
Martin