[Python-Dev] 2.1 alpha: what about the unicode name database?
Finn Bock
bckfnn@worldonline.dk
Sun, 14 Jan 2001 22:20:51 GMT
[/F]
>here's the description:
Thanks.
>From: "Fredrik Lundh" <effbot@telia.com>
>Date: Sun, 16 Jul 2000 20:40:46 +0200
>
>/.../
>
> The unicodenames database consists of two parts: a name
> database which maps character codes to names, and a code
> database, mapping names to codes.
>
>* The Name Database (getname)
>
> First, the 10538 text strings are split into 42193 words,
> and combined into a 4949-word lexicon (a 29k array).
I only added a word to the lexicon if it was used more than once and if
the length was larger then the lexicon index. I ended up with 1385
entries in the lexicon. (a 7k array)
> Each word is given a unique index number (common words get
> lower numbers), and there's a "lexicon offset" table mapping
> from numbers to words (10k).
My lexicon offset table is 3k and I also use 4k on a perfect hash of the
words.
> To get back to the original text strings, I use a "phrase
> book". For each original string, the phrase book stores a a
> list of word numbers. Numbers 0-127 are stored in one byte,
> higher numbers (less common words) use two bytes. At this
> time, about 65% of the words can be represented by a single
> byte. The result is a 56k array.
Because not all words are looked up in the lexicon, I used the values
0-38 for the letters and number, 39-250 are used for one byte lexicon
index, and 251-255 are combined with following byte to form a two byte.
This also result in a 57k array
So far it is only minor variations.
> The final data structure is an offset table, which maps code
> points to phrase book offsets. Instead of using one big
> table, I split each code point into a "page number" and a
> "line number" on that page.
>
> offset = line[ (page[code>>SHIFT]<<SHIFT) + (code&MASK) ]
>
> Since the unicode space is sparsely populated, it's possible
> to split the code so that lots of pages gets no contents. I
> use a brute force search to find the optimal SHIFT value.
>
> In the current database, the page table has 1024 entries
> (SHIFT is 6), and there are 199 unique pages in the line
> table. The total size of the offset table is 26k.
>
>* The code database (getcode)
>
> For the code table, I use a straight-forward hash table to store
> name to code mappings. It's basically the same implementation
> as in Python's dictionary type, but a different hash algorithm.
> The table lookup loop simply uses the name database to check
> for hits.
>
> In the current database, the hash table is 32k.
I chose to split a unicode name into words even when looking up a
unicode name. Each word is hashed to a lexicon index and a "phrase book
string" is created. The sorted phrase book is then search with a binary
search among 858 entries that can be address directly followed by a
sequential search among 12 entries. The phrase book search index is 8k
and a table that maps phrase book indexes to codepoints is another 20k.
The searching I do makes jython slower then the direct calculation you
do. I'll take another look at this after jython 2.0 to see if I can
improve performance with your page/line number scheme and a total
hashing of all the unicode names.
regards,
finn