[Python-Dev] How about braindead Unicode "compression"?

M.-A. Lemburg mal@lemburg.com
Sun, 24 Sep 2000 23:20:06 +0200

Tim Peters wrote:
> unicodedatabase.c has 64K lines of the form:
> /* U+009a */ { 13, 0, 15, 0, 0 },
> Each struct getting initialized there takes 8 bytes on most machines (4
> unsigned chars + a char*).
> However, there are only 3,567 unique structs (54,919 of them are all 0's!).

That's because there are only around 11k definitions in the
Unicode database -- most of the rest is divided into private,
user defined and surrogate high/low byte reserved ranges.

> So a braindead-easy mechanical "compression" scheme would simply be to
> create one vector with the 3,567 unique structs, and replace the 64K record
> constructors with 2-byte indices into that vector.  Data size goes down from
>     64K * 8b = 512Kb
> to
>     3567 * 8b + 64K * 2b ~= 156Kb
> at once; the source-code transformation is easy to do via a Python program;
> the compiler warnings on my platform (due to unicodedatabase.c's sheer size)
> can go away; and one indirection is added to access (which remains utterly
> uniform).
> Previous objections to compression were, as far as I could tell, based on
> fear of elaborate schemes that rendered the code unreadable and the access
> code excruciating.  But if we can get more than a factor of 3 with little
> work and one new uniform indirection, do people still object?

Oh, there was no fear about making the code unreadable...
Christian and Fredrik were both working on smart schemes.
My only objection about these was missing documentation
and generation tools -- vast tables of completely random
looking byte data are unreadable ;-)
> If nobody objects by the end of today, I intend to do it.

+1 from here.

Marc-Andre Lemburg
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/