[Python-Dev] Unicode Database Compression
Tue, 21 Mar 2000 21:13:38 +0100
I have spent the last four days on compressing the
With little decoding effort, I can bring the data down to 25kb.
This would still be very fast, since codes are randomly
accessible, although there are some simple shifts and masks.
With a bit more effort, this can be squeezed down to 15kb
by some more aggressive techniques like common prefix
elimination. Speed would be *slightly* worse, since a small
loop (average 8 cycles) is performed to obtain a character
from a packed nybble.
This is just all the data which is in Marc's unicodedatabase.c
file. I checked efficiency by creating a delimited file like
the original database text file with only these columns and
ran PkZip over it. The result was 40kb. This says that I found
a lot of correlations which automatic compressors cannot see.
Now, before generating the final C code, I'd like to ask some
What is more desirable: Low compression and blinding speed?
Or high compression and less speed, since we always want to
unpack a whole code page?
Then, what about the other database columns?
There are a couple of extra atrributes which I find coded
as switch statements elsewhere. Should I try to pack these
codes into my squeezy database, too?
And last: There are also two quite elaborated columns with
textual descriptions of the codes (the uppercase blah version
of character x). Do we want these at all? And if so, should
I try to compress them as well? Should these perhaps go
into a different source file as a dynamic module, since they
will not be used so often?
waiting for directives - ly y'rs - chris
Christian Tismer :^) <mailto:firstname.lastname@example.org>
Applied Biometrics GmbH : Have a break! Take a ride on Python's
Kaunstr. 26 : *Starship* http://starship.python.net
14163 Berlin : PGP key -> http://wwwkeys.pgp.net
PGP Fingerprint E182 71C7 1A9D 66E9 9D15 D3CC D4D7 93E2 1FAE F6DF
we're tired of banana software - shipped green, ripens at home