[Python-Dev] Unicode charmap decoders slow

Tony Nelson tonynelson at georgeanelson.com
Sun Oct 16 18:33:25 CEST 2005


At 11:56 AM +0200 10/16/05, Martin v. Löwis wrote:
>Tony Nelson wrote:
>> BTW, Martin, if you care to, would you explain to me how a Trie would be
>> used for charmap encoding?  I know a couple of approaches, but I don't know
>> how to do it fast.  (I've never actually had the occasion to use a Trie.)
>
>I currently envision a three-level trie, with 5, 4, and 7 bits. You take
>the Unicode character (only chacters below U+FFFF supported), and take
>the uppermost 5 bits, as index in an array. There you find the base
>of a second array, to which you add the next 4 bits, which gives you an
>index into a third array, where you add the last 7 bits. This gives
>you the character, or 0 if it is unmappable.

Umm, 0 (NUL) is a valid output character in most of the 8-bit character
sets.  It could be handled by having a separate "exceptions" string of the
unicode code points that actually map to the exception char.  Usually
"exceptions" would be a string of length 1.  Suggested changes below.


>struct encoding_map{
>   unsigned char level0[32];
>   unsigned char *level1;
>   unsigned char *level2;

    Py_UNICODE *exceptions;

>};
>
>struct encoding_map *table;
>Py_UNICODE character;
>int level1 = table->level0[character>>11];
>if(level1==0xFF)raise unmapped;
>int level2 = table->level1[16*level1 + ((character>>7) & 0xF)];
>if(level2==0xFF)raise unmapped;
>int mapped = table->level2[128*level2 + (character & 0x7F)];

change:

>if(mapped==0)raise unmapped;

to:

if(mapped==0) {
    Py_UNICODE *ep;
    for(ep=table->exceptions; *ep; ep++)
        if(*ep==character)
            break;
    if(!*ep)raise unmapped;
}


>Over a hashtable, this has the advantage of not having to deal with
>collisions. Instead, it guarantees you a lookup in a constant time.

OK, I see the benefit.  Your code is about the same amount of work as the
hash table lookup in instructions, indirections, and branches, normally
uses less of the data cache, and has a fixed running time.  It may use one
more branch, but its branches are easily predicted.  Thank you for
explaining it.


>It is also quite space-efficient: all tables use bytes as indizes.
>As each level0 deals with 2048 characters, most character maps
>will only use 1 or two level1 blocks, meaning 16 or 32 bytes
>for level1. The number of level3 blocks required depends on
>the number of 127-character rows which the encoding spans;
>for most encodings, 3 or four such blocks will be sufficient
>(with ASCII spanning one such block typically), causing the
>entire memory consumption for many encodings to be less than
>600 bytes.
 ...

As you are concerned about pathological cases for hashing (that would make
the hash chains long), it is worth noting that in such cases this data
structure could take 64K bytes.  Of course, no such case occurs in standard
encodings, and 64K is affordable as long is it is rare.
____________________________________________________________________
TonyN.:'                       <mailto:tonynelson at georgeanelson.com>
      '                              <http://www.georgeanelson.com/>


More information about the Python-Dev mailing list