
On Sat, Oct 04, 2014 at 12:17:58PM +0900, Stephen J. Turnbull wrote:
M.-A. Lemburg writes:
On 03.10.2014 23:10, Philipp A. wrote:
Unfortunately, unicodedata is very limited.
Phillip, do you really mean *very* limited? If so, I wonder what else you think is missing besides "fuzzy" name lookup. The UCD is defined by the standard, and AFAICS access to all properties is provided.
Hmmm. There's a lot of properties in Unicode, and I'm pretty sure that unicodedata does not give access to *all* of them. Here's a line from UnicodeData.txt:
http://www.unicode.org/Public/UNIDATA/UnicodeData.txt
04BF;CYRILLIC SMALL LETTER ABKHASIAN CHE WITH DESCENDER;Ll;0;L;;;;;N;CYRILLIC SMALL LETTER IE HOOK OGONEK;;04BE;;04BE
There are 15 semi-colon separated fields. The zeroth is the code point, the others are described here:
http://www.unicode.org/Public/5.1.0/ucd/UCD.html#UnicodeData.txt
I don't believe that there is any way to get access to all 14 (excluding the code point itself) fields. E.g. how do I find out the "Unicode_1_Name"?
And UnicodeData.txt is only one of many Unicode databases. See the UCD.html link above.
But the name database is only queryable using full names! I want to do unicodedata.search('clock') and get a list of dozens of glyphs
You should be able to code this as a PyPI package. I don't think it's a use case that warrants making the unicodedata module more complex.
I think it's unfortunate that unicodedata is limited in this particular way, since the database is in C, and as you point out hardly extensible. For example, as a native English speaker who enjoys wordplay I was able to guess which euphemism is the source of the name of U+1F4A9 without looking it up, but I doubt a non-native would be able to. A builtin ability to do fuzzy searches ("unicodenames.startswith('PILE OF')") would be useful.
I would love it if unicodedata exposed the full UnicodeData.txt database in some efficient format. That would allow people to scratch their own itch without having to duplicate the UnicodeData.txt database.
Failing that, the two features I miss the most are:
(1) fuzzy_lookup(glob): Return iterator which yields (ordinal, name) for each unicode code point which matches the glob.
Names beginning with a substring: fuzzy_lookup("SPAM*")
Names ending with a substring: fuzzy_lookup("*SPAM")
Names containing a substring: fuzzy_lookup("SPAM")
(2) get_data(ordinal_or_character): Return a namedtuple with 15 fields, taken directly from the UnicodeData.txt database.
The first function solves the very common problem of "I kind of know what the character is called, but not exactly", the second would allow people to code their own arbitrary lookups.