[Python-Dev] Re: Unicode Database Compression

M.-A. Lemburg mal@lemburg.com
Wed, 22 Mar 2000 11:11:25 +0100

Christian Tismer wrote:
> Hi,
> I have spent the last four days on compressing the
> Unicode database.

Cool :-)
> With little decoding effort, I can bring the data down to 25kb.
> This would still be very fast, since codes are randomly
> accessible, although there are some simple shifts and masks.
> With a bit more effort, this can be squeezed down to 15kb
> by some more aggressive techniques like common prefix
> elimination. Speed would be *slightly* worse, since a small
> loop (average 8 cycles) is performed to obtain a character
> from a packed nybble.
> This is just all the data which is in Marc's unicodedatabase.c
> file. I checked efficiency by creating a delimited file like
> the original database text file with only these columns and
> ran PkZip over it. The result was 40kb. This says that I found
> a lot of correlations which automatic compressors cannot see.

Not bad ;-)
> Now, before generating the final C code, I'd like to ask some
> questions:
> What is more desirable: Low compression and blinding speed?
> Or high compression and less speed, since we always want to
> unpack a whole code page?

I'd say high speed and less compression. The reason is that
the Asian codecs will need fast access to the database. With
their large mapping tables size the few more kB don't hurt,
I guess.

> Then, what about the other database columns?
> There are a couple of extra atrributes which I find coded
> as switch statements elsewhere. Should I try to pack these
> codes into my squeezy database, too?

You basically only need to provide the APIs (and columns)
defined in the unicodedata Python API, e.g. the
character description column is not needed.
> And last: There are also two quite elaborated columns with
> textual descriptions of the codes (the uppercase blah version
> of character x). Do we want these at all? And if so, should
> I try to compress them as well? Should these perhaps go
> into a different source file as a dynamic module, since they
> will not be used so often?

I guess you are talking about the "Unicode 1.0 Name"
and the "10646 comment field" -- see above, there's no
need to include these descriptions in the database...
Marc-Andre Lemburg
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/