[Python-Dev] Unicode charmap decoders slow

"Martin v. Löwis" martin at v.loewis.de
Sun Oct 16 11:56:01 CEST 2005


Tony Nelson wrote:
> BTW, Martin, if you care to, would you explain to me how a Trie would be
> used for charmap encoding?  I know a couple of approaches, but I don't know
> how to do it fast.  (I've never actually had the occasion to use a Trie.)

I currently envision a three-level trie, with 5, 4, and 7 bits. You take
the Unicode character (only chacters below U+FFFF supported), and take
the uppermost 5 bits, as index in an array. There you find the base
of a second array, to which you add the next 4 bits, which gives you an
index into a third array, where you add the last 7 bits. This gives
you the character, or 0 if it is unmappable.

struct encoding_map{
   unsigned char level0[32];
   unsigned char *level1;
   unsigned char *level2;
};

struct encoding_map *table;
Py_UNICODE character;
int level1 = table->level0[character>>11];
if(level1==0xFF)raise unmapped;
int level2 = table->level1[16*level1 + ((character>>7) & 0xF)];
if(level2==0xFF)raise unmapped;
int mapped = table->level2[128*level2 + (character & 0x7F)];
if(mapped==0)raise unmapped;

Over a hashtable, this has the advantage of not having to deal with
collisions. Instead, it guarantees you a lookup in a constant time.

It is also quite space-efficient: all tables use bytes as indizes.
As each level0 deals with 2048 characters, most character maps
will only use 1 or two level1 blocks, meaning 16 or 32 bytes
for level1. The number of level3 blocks required depends on
the number of 127-character rows which the encoding spans;
for most encodings, 3 or four such blocks will be sufficient
(with ASCII spanning one such block typically), causing the
entire memory consumption for many encodings to be less than
600 bytes.

It would be possible to remove one level of indirection (giving
some more speed) at the expense of additional memory: for
example, and 8-8 trie would use 256 bytes for level 0, and
then 256 bytes for each Unicode row where the encoding has
characters, likely resulting in 1kiB for a typical encoding.

Hye-Shik Chang version of the fast codec uses such an 8-8
trie, but conserves space by observing that most 256-char
rows are only sparsely used by encodings (e.g. if you have
only ASCII, you use only 128 characters from row 0). He
therefore strips unused cells from the row, by also recording
the first and last character per row. This brings some
space back, at the expense of slow-down again.

Regards,
Martin


More information about the Python-Dev mailing list