[Python-Dev] Unicode Database Compression

Christian Tismer tismer@tismer.com
Tue, 21 Mar 2000 21:13:38 +0100


Hi,

I have spent the last four days on compressing the
Unicode database.

With little decoding effort, I can bring the data down to 25kb.
This would still be very fast, since codes are randomly
accessible, although there are some simple shifts and masks.

With a bit more effort, this can be squeezed down to 15kb
by some more aggressive techniques like common prefix
elimination. Speed would be *slightly* worse, since a small
loop (average 8 cycles) is performed to obtain a character
from a packed nybble.

This is just all the data which is in Marc's unicodedatabase.c
file. I checked efficiency by creating a delimited file like
the original database text file with only these columns and
ran PkZip over it. The result was 40kb. This says that I found
a lot of correlations which automatic compressors cannot see.

Now, before generating the final C code, I'd like to ask some
questions:

What is more desirable: Low compression and blinding speed?
Or high compression and less speed, since we always want to
unpack a whole code page?

Then, what about the other database columns?
There are a couple of extra atrributes which I find coded
as switch statements elsewhere. Should I try to pack these
codes into my squeezy database, too?

And last: There are also two quite elaborated columns with
textual descriptions of the codes (the uppercase blah version
of character x). Do we want these at all? And if so, should
I try to compress them as well? Should these perhaps go
into a different source file as a dynamic module, since they
will not be used so often?

waiting for directives - ly y'rs - chris

-- 
Christian Tismer             :^)   <mailto:tismer@appliedbiometrics.com>
Applied Biometrics GmbH      :     Have a break! Take a ride on Python's
Kaunstr. 26                  :    *Starship* http://starship.python.net
14163 Berlin                 :     PGP key -> http://wwwkeys.pgp.net
PGP Fingerprint       E182 71C7 1A9D 66E9 9D15  D3CC D4D7 93E2 1FAE F6DF
     we're tired of banana software - shipped green, ripens at home