[Python-Dev] Re: Unicode character names

Bill Tutt billtut@microsoft.com
Thu, 23 Mar 2000 18:46:06 -0800


MAL wrote:

>Andrew M. Kuchling" wrote:
>> 
>> Paul Prescod writes:
>>>The new \N escape interpolates named characters within strings. For
>>>example, "Hi! \N{WHITE SMILING FACE}" evaluates to a string with a
>>>unicode smiley face at the end.
>> 
>> Cute idea, and it certainly means you can avoid looking up Unicode
>> numbers.  (You can look up names instead. :) )  Note that this means the
>> Unicode database is no longer optional if this is done; it has to be
>> around at code-parsing time.  Python could import it automatically, as
>> exceptions.py is imported.  Christian's work on compressing
>> unicodedatabase.c is therefore really important.  (Is Perl5.6 actually
>> dragging around the Unicode database in the binary, or is it read out
>> of some external file or data structure?)
>
> Sorry to disappoint you guys, but the Unicode name and comments
> are *not* included in the unicodedatabase.c file Christian
> is currently working on. The reason is simple: it would add
> huge amounts of string data to the file. So this is a no-no
> for the core distribution...
>

Ok, now you're just being silly. Its possible to put the character names in
a separate structure so that they don't automatically get paged in with the
normal unicode character property data. If you never use it, it won't get
paged in, its that simple....

Looking up the Unicode code value from the Unicode character name smells
like a good time to use gperf to generate a perfect hash function for the
character names. Esp. for the Unicode 3.0 character namespace. Then you can
just store the hashkey -> Unicode character mapping, and hardly ever need to
page in the actual full character name string itself.

I haven't looked at what the comment field contains, so I have no idea how
useful that info is.

*waits while gperf crunches through the ~10,550 Unicode characters where
this would be useful*

Bill