[Python-Dev] Unicode character names

M.-A. Lemburg mal@lemburg.com
Fri, 24 Mar 2000 14:41:27 +0100


Christian Tismer wrote:
> 
> "M.-A. Lemburg" wrote:
> >
> > "Andrew M. Kuchling" wrote:
> > >
> > > Paul Prescod writes:
> > > >The new \N escape interpolates named characters within strings. For
> > > >example, "Hi! \N{WHITE SMILING FACE}" evaluates to a string with a
> > > >unicode smiley face at the end.
> > >
> > > Cute idea, and it certainly means you can avoid looking up Unicode
> > > numbers.  (You can look up names instead. :) )  Note that this means the
> > > Unicode database is no longer optional if this is done; it has to be
> > > around at code-parsing time.  Python could import it automatically, as
> > > exceptions.py is imported.  Christian's work on compressing
> > > unicodedatabase.c is therefore really important.  (Is Perl5.6 actually
> > > dragging around the Unicode database in the binary, or is it read out
> > > of some external file or data structure?)
> >
> > Sorry to disappoint you guys, but the Unicode name and comments
> > are *not* included in the unicodedatabase.c file Christian
> > is currently working on. The reason is simple: it would add
> > huge amounts of string data to the file. So this is a no-no
> > for the core distribution...
> 
> This is not settled, still an open question.

Well, ok, depends on how much you can sqeeze out of the
text columns ;-) I still think that its better to leave
these gimmicks out of the core and put them into some
add-on, though.

> What I have for non-textual data:
> 25 kb with dumb compression
> 15 kb with enhanced compression

Looks good :-) With these sizes I think we could even integrate
the unicodedatabase.c + API into the core interpreter and
only have the unicodedata module to access the database
from within Python.
 
> What amounts of data am I talking about?
> - The whole unicode database text file has size
>   632 kb.
> - With PkZip this goes down to
>   96 kb.
> 
> Now, I produced another text file with just the currently
> used data in it, and this sounds so:
> - the stripped unicode text file has size
>   216 kb.
> - PkZip melts this down to
>   40 kb.
> 
> Please compare that to my results above: I can do at least
> twice as good. I hope I can compete for the text sections
> as well (since this is something where zip is *good* at),
> but just let me try.
> Let's target 60 kb for the whole crap, and I'd be very pleased.
>
> Then, there is still the question where to put the data.
> Having one file in the dll and another externally would
> be an option. I could also imagine to use a binary external
> file all the time, with maximum possible compression.
> By loading this structure, this would be partially expanded
> to make it fast.
> An advantage is that the compressed Unicode database
> could become a stand-alone product. The size is in fact
> so crazy small, that I'd like to make this available
> to any other language.

You could take the unicodedatabase.c file (+ header file)
and use it everywhere... I don't think it needs to contain
any Python specific code. The API names would have to follow
the Python naming schemes though.
 
> > Still, the above is easily possible by inventing a new
> > encoding, say unicode-with-smileys, which then reads in
> > a file containing the Unicode names and applies the necessary
> > magic to decode/encode data as Paul described above.
> 
> That sounds reasonable. Compression makes sense as well here,
> since the expanded stuff makes quite an amount of kb, compared
> to what it is "worth", compared to, say, the Python dll.

With 25kB for the non-text columns, I'd suggest simply
adding the file to the core. Text columns could then
go into a separate module.

-- 
Marc-Andre Lemburg
______________________________________________________________________
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/