[Python-Dev] Unicode patches checked in

Christian Tismer tismer@tismer.com
Wed, 15 Mar 2000 17:22:42 +0100


Fredrik Lundh wrote:
> 
> CT:
> > How do I build a dist that doesn't need to change a lot of
> > stuff in the user's installation?
> 
> somewhere in this thread, Guido wrote:
> 
> > BTW, I added a tag "pre-unicode" to the CVS tree to the revisions
> > before the Unicode changes were made.
> 
> maybe you could base SLP on that one?

I have no idea how this works. Would this mean that I cannot
get patctes which come after unicode?

Meanwhile, I've looked into the sources. It is easy for me
to get rid of the problem by supplying my own unicodedata.c,
where I replace all functions by some unimplemented exception.

Furthermore, I wondered about the data format. Is the unicode
database used inyou re package as well? Otherwise, I see
only references form unicodedata.c, and that means the data
structure can be massively enhanced.
At the moment, that baby is 64k entries long, with four bytes
and an optional string.
This is a big waste. The strings are almost all some distinct
<xxx> prefixes, together with a list of hex smallwords. This
is done as strings, probably this makes 80 percent of the space.

The only function that uses the "decomposition" field (namely
the string) is unicodedata_decomposition. It does nothing
more than to wrap it into a PyObject.
We can do a little better here. I gues I can bring it down
to a third of this space without much effort, just by using
- binary encoding for the <xxx> tags as enumeration
- binary encoding of the hexed entries
- omission of the spaces
Instead of a 64 k of structures which contain pointers anyway,
I can use a 64k pointer array with offsets into one packed
table.

The unicodedata access functions would change *slightly*,
just building some hex strings and so on. I guess this
is not a time critical section?

Should I try this evening? :-)

cheers - chris

-- 
Christian Tismer             :^)   <mailto:tismer@appliedbiometrics.com>
Applied Biometrics GmbH      :     Have a break! Take a ride on Python's
Kaunstr. 26                  :    *Starship* http://starship.python.net
14163 Berlin                 :     PGP key -> http://wwwkeys.pgp.net
PGP Fingerprint       E182 71C7 1A9D 66E9 9D15  D3CC D4D7 93E2 1FAE F6DF
     we're tired of banana software - shipped green, ripens at home