[Python-ideas] Extend unicodedata with a name search
Steven D'Aprano
steve at pearwood.info
Sat Oct 4 08:21:36 CEST 2014
On Sat, Oct 04, 2014 at 12:17:58PM +0900, Stephen J. Turnbull wrote:
> M.-A. Lemburg writes:
> > On 03.10.2014 23:10, Philipp A. wrote:
>
> > > Unfortunately, unicodedata is very limited.
>
> Phillip, do you really mean *very* limited? If so, I wonder what else
> you think is missing besides "fuzzy" name lookup. The UCD is defined
> by the standard, and AFAICS access to all properties is provided.
Hmmm. There's a lot of properties in Unicode, and I'm pretty sure that
unicodedata does not give access to *all* of them. Here's a line
from UnicodeData.txt:
http://www.unicode.org/Public/UNIDATA/UnicodeData.txt
04BF;CYRILLIC SMALL LETTER ABKHASIAN CHE WITH
DESCENDER;Ll;0;L;;;;;N;CYRILLIC SMALL LETTER IE HOOK OGONEK;;04BE;;04BE
There are 15 semi-colon separated fields. The zeroth is the code point,
the others are described here:
http://www.unicode.org/Public/5.1.0/ucd/UCD.html#UnicodeData.txt
I don't believe that there is any way to get access to all 14 (excluding
the code point itself) fields. E.g. how do I find out the "Unicode_1_Name"?
And UnicodeData.txt is only one of many Unicode databases. See the
UCD.html link above.
> > > But the name database is only queryable using full names! I want
> > > to do unicodedata.search('clock') and get a list of dozens of glyphs
>
> > You should be able to code this as a PyPI package. I don't think
> > it's a use case that warrants making the unicodedata module more
> > complex.
>
> I think it's unfortunate that unicodedata is limited in this
> particular way, since the database is in C, and as you point out
> hardly extensible. For example, as a native English speaker who
> enjoys wordplay I was able to guess which euphemism is the source of
> the name of U+1F4A9 without looking it up, but I doubt a non-native
> would be able to. A builtin ability to do fuzzy searches
> ("unicodenames.startswith('PILE OF')") would be useful.
I would love it if unicodedata exposed the full UnicodeData.txt database
in some efficient format. That would allow people to scratch their own
itch without having to duplicate the UnicodeData.txt database.
Failing that, the two features I miss the most are:
(1) fuzzy_lookup(glob):
Return iterator which yields (ordinal, name) for
each unicode code point which matches the glob.
Names beginning with a substring:
fuzzy_lookup("SPAM*")
Names ending with a substring:
fuzzy_lookup("*SPAM")
Names containing a substring:
fuzzy_lookup("SPAM")
(2) get_data(ordinal_or_character):
Return a namedtuple with 15 fields, taken directly from
the UnicodeData.txt database.
The first function solves the very common problem of "I kind of know
what the character is called, but not exactly", the second would allow
people to code their own arbitrary lookups.
--
Steven
More information about the Python-ideas
mailing list