[Python-Dev] 2.1 alpha: what about the unicode name database?

Finn Bock bckfnn@worldonline.dk
Sun, 14 Jan 2001 22:20:51 GMT


[/F]

>here's the description:

Thanks.

>From: "Fredrik Lundh" <effbot@telia.com>
>Date: Sun, 16 Jul 2000 20:40:46 +0200
>
>/.../
>
>    The unicodenames database consists of two parts: a name
>    database which maps character codes to names, and a code
>    database, mapping names to codes.
>
>* The Name Database (getname)
>
>    First, the 10538 text strings are split into 42193 words,
>    and combined into a 4949-word lexicon (a 29k array).

I only added a word to the lexicon if it was used more than once and if
the length was larger then the lexicon index. I ended up with 1385
entries in the lexicon. (a 7k array)

>    Each word is given a unique index number (common words get
>    lower numbers), and there's a "lexicon offset" table mapping
>    from numbers to words (10k).

My lexicon offset table is 3k and I also use 4k on a perfect hash of the
words.

>    To get back to the original text strings, I use a "phrase
>    book".  For each original string, the phrase book stores a a
>    list of word numbers.  Numbers 0-127 are stored in one byte,
>    higher numbers (less common words) use two bytes.  At this
>    time, about 65% of the words can be represented by a single
>    byte.  The result is a 56k array.

Because not all words are looked up in the lexicon, I used the values
0-38 for the letters and number, 39-250 are used for one byte lexicon
index, and 251-255 are combined with following byte to form a two byte.
This also result in a 57k array

So far it is only minor variations.

>    The final data structure is an offset table, which maps code
>    points to phrase book offsets.  Instead of using one big
>    table, I split each code point into a "page number" and a
>    "line number" on that page.
>
>      offset = line[ (page[code>>SHIFT]<<SHIFT) + (code&MASK) ]
>
>    Since the unicode space is sparsely populated, it's possible
>    to split the code so that lots of pages gets no contents.  I
>    use a brute force search to find the optimal SHIFT value.
>
>    In the current database, the page table has 1024 entries
>    (SHIFT is 6), and there are 199 unique pages in the line
>    table.  The total size of the offset table is 26k.
>
>* The code database (getcode)
>
>    For the code table, I use a straight-forward hash table to store
>    name to code mappings.  It's basically the same implementation
>    as in Python's dictionary type, but a different hash algorithm.
>    The table lookup loop simply uses the name database to check
>    for hits.
>
>    In the current database, the hash table is 32k.

I chose to split a unicode name into words even when looking up a
unicode name. Each word is hashed to a lexicon index and a "phrase book
string" is created. The sorted phrase book is then search with a binary
search among 858 entries that can be address directly followed by a
sequential search among 12 entries. The phrase book search index is 8k
and a table that maps phrase book indexes to codepoints is another 20k.

The searching I do makes jython slower then the direct calculation you
do. I'll take another look at this after jython 2.0 to see if I can
improve performance with your page/line number scheme and a total
hashing of all the unicode names.

regards,
finn