[Python-ideas] Extend unicodedata with a name search

Sat Oct 4 08:21:36 CEST 2014

On Sat, Oct 04, 2014 at 12:17:58PM +0900, Stephen J. Turnbull wrote:
> M.-A. Lemburg writes:
>  > On 03.10.2014 23:10, Philipp A. wrote:
> 
>  > > Unfortunately, unicodedata is very limited.
> 
> Phillip, do you really mean *very* limited?  If so, I wonder what else
> you think is missing besides "fuzzy" name lookup.  The UCD is defined
> by the standard, and AFAICS access to all properties is provided.

Hmmm. There's a lot of properties in Unicode, and I'm pretty sure that 
unicodedata does not give access to *all* of them. Here's a line 
from UnicodeData.txt:

http://www.unicode.org/Public/UNIDATA/UnicodeData.txt

04BF;CYRILLIC SMALL LETTER ABKHASIAN CHE WITH 
  DESCENDER;Ll;0;L;;;;;N;CYRILLIC SMALL LETTER IE HOOK OGONEK;;04BE;;04BE

There are 15 semi-colon separated fields. The zeroth is the code point, 
the others are described here:

http://www.unicode.org/Public/5.1.0/ucd/UCD.html#UnicodeData.txt

I don't believe that there is any way to get access to all 14 (excluding 
the code point itself) fields. E.g. how do I find out the "Unicode_1_Name"?

And UnicodeData.txt is only one of many Unicode databases. See the 
UCD.html link above.

>  > > But the name database is only queryable using full names! I want
>  > > to do unicodedata.search('clock') and get a list of dozens of glyphs
>  
>  > You should be able to code this as a PyPI package. I don't think
>  > it's a use case that warrants making the unicodedata module more
>  > complex.
> 
> I think it's unfortunate that unicodedata is limited in this
> particular way, since the database is in C, and as you point out
> hardly extensible.  For example, as a native English speaker who
> enjoys wordplay I was able to guess which euphemism is the source of
> the name of U+1F4A9 without looking it up, but I doubt a non-native
> would be able to.  A builtin ability to do fuzzy searches
> ("unicodenames.startswith('PILE OF')") would be useful.

I would love it if unicodedata exposed the full UnicodeData.txt database 
in some efficient format. That would allow people to scratch their own 
itch without having to duplicate the UnicodeData.txt database.

Failing that, the two features I miss the most are:

(1) fuzzy_lookup(glob):
    Return iterator which yields (ordinal, name) for
    each unicode code point which matches the glob.

    Names beginning with a substring:
        fuzzy_lookup("SPAM*")

    Names ending with a substring:
        fuzzy_lookup("*SPAM")

    Names containing a substring:
        fuzzy_lookup("SPAM")

(2) get_data(ordinal_or_character):
    Return a namedtuple with 15 fields, taken directly from
    the UnicodeData.txt database.

The first function solves the very common problem of "I kind of know 
what the character is called, but not exactly", the second would allow 
people to code their own arbitrary lookups.

-- 
Steven