Mailman 3 Extend unicodedata with a name search - Python-ideas

Extend unicodedata with a name search

Philipp A.

Oct. 3, 2014

9:10 p.m.

I noticed that the excellent perl utility unum <http://www.fourmilab.ch/webtools/unum/> uses an obsolete unicode database. Since I’m a Pythonista, i recalled hearing about the stdlib unicodedata module, using which I either wanted to rewrite unum or extend its database. Unfortunately, unicodedata is very limited. Partly rightfully so, since you can convert codepoints and chars with chr() and ord(), and str.upper() and friends are unicode-aware. But the name database is only queryable using full names! I want to do unicodedata.search('clock') and get a list of dozens of glyphs with names like CLOCKWISE RIGHTWARDS AND LEFTWARDS OPEN CIRCLE ARROWS and CLOCK FACE THREE-THIRTY. Maybe this should spit out a list of (name, char) tuples? or a {name: char} dict? What do you mean?

Attachments:

attachment.htm (text/html — 935 bytes)

Show replies by date

M.-A. Lemburg

October 2014

9:15 p.m.

On 03.10.2014 23:10, Philipp A. wrote:

...

I noticed that the excellent perl utility unum <http://www.fourmilab.ch/webtools/unum/> uses an obsolete unicode database.

Since Im a Pythonista, i recalled hearing about the stdlib unicodedata module, using which I either wanted to rewrite unum or extend its database.

Unfortunately, unicodedata is very limited. Partly rightfully so, since you can convert codepoints and chars with chr() and ord(), and str.upper() and friends are unicode-aware.

But the name database is only queryable using full names! I want to do unicodedata.search('clock') and get a list of dozens of glyphs with names like CLOCKWISE RIGHTWARDS AND LEFTWARDS OPEN CIRCLE ARROWS and CLOCK FACE THREE-THIRTY.

Maybe this should spit out a list of (name, char) tuples? or a {name: char} dict?

What do you mean?

You should be able to code this as a PyPI package. I don't think it's a use case that warrants making the unicodedata module more complex. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Oct 03 2014)

...

...
...
Python Projects, Consulting and Support ... http://www.egenix.com/ mxODBC.Zope/Plone.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/

::::: Try our mxODBC.Connect Python Database Interface for free ! :::::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/

Stephen J. Turnbull

3:17 a.m.

M.-A. Lemburg writes:

...

On 03.10.2014 23:10, Philipp A. wrote:

...

...
Unfortunately, unicodedata is very limited.

Phillip, do you really mean *very* limited? If so, I wonder what else you think is missing besides "fuzzy" name lookup. The UCD is defined by the standard, and AFAICS access to all properties is provided.

...

...
But the name database is only queryable using full names! I want to do unicodedata.search('clock') and get a list of dozens of glyphs

...

You should be able to code this as a PyPI package. I don't think it's a use case that warrants making the unicodedata module more complex.

I think it's unfortunate that unicodedata is limited in this particular way, since the database is in C, and as you point out hardly extensible. For example, as a native English speaker who enjoys wordplay I was able to guess which euphemism is the source of the name of U+1F4A9 without looking it up, but I doubt a non-native would be able to. A builtin ability to do fuzzy searches ("unicodenames.startswith('PILE OF')") would be useful. OTOH, a little thought convinced me that I don't know the TOOWTDI for fuzzy search here: - regexp: database will be a huge string or similar - startswith, endswith, contains: probably sufficient, but I suppose one would like at least conjunction and disjunction operations: unicodematch.contains('GREEK', 'SMALL', 'ALPHA', op='and') unicodematch.startswith('PIECE OF', 'PILE OF', op='or') (OK, that's pretty horrible, but it gives an idea.) - something else?

Chris Angelico

5:50 a.m.

On Sat, Oct 4, 2014 at 1:17 PM, Stephen J. Turnbull <stephen@xemacs.org> wrote:

...

- startswith, endswith, contains: probably sufficient, but I suppose one would like at least conjunction and disjunction operations: unicodematch.contains('GREEK', 'SMALL', 'ALPHA', op='and') unicodematch.startswith('PIECE OF', 'PILE OF', op='or') (OK, that's pretty horrible, but it gives an idea.)

There's an easier way, though it would take a bit of setup work. Start by building up an actual list in RAM of [unicodedata.name(chr(i)) for i in range(sys.maxunicode+1)] and then do regular string operations. I'm fairly sure most Python programmers can figure out how to search a list of strings according to whatever rules they like - maybe using contains/startswith/endswith, or maybe regexps, or whatever. ChrisA

Steven D'Aprano

6:29 a.m.

On Sat, Oct 04, 2014 at 03:50:33PM +1000, Chris Angelico wrote:

...

On Sat, Oct 4, 2014 at 1:17 PM, Stephen J. Turnbull <stephen@xemacs.org> wrote:

...
- startswith, endswith, contains: probably sufficient, but I suppose one would like at least conjunction and disjunction operations: unicodematch.contains('GREEK', 'SMALL', 'ALPHA', op='and') unicodematch.startswith('PIECE OF', 'PILE OF', op='or') (OK, that's pretty horrible, but it gives an idea.)

There's an easier way, though it would take a bit of setup work. Start by building up an actual list in RAM of [unicodedata.name(chr(i)) for i in range(sys.maxunicode+1)] and then do regular string operations. I'm fairly sure most Python programmers can figure out how to search a list of strings according to whatever rules they like - maybe using contains/startswith/endswith, or maybe regexps, or whatever.

py> x = [unicodedata.name(chr(i)) for i in range(sys.maxunicode+1)] Traceback (most recent call last): File "<stdin>", line 1, in <module> File "<stdin>", line 1, in <listcomp> ValueError: no such name There are 1114112 such code points, and most of them are unused. Some of the used ones don't have names: py> unicodedata.name('\0') Traceback (most recent call last): File "<stdin>", line 1, in <module> ValueError: no such name But even once you deal with those complications, you'll end up duplicating information which (I presume) Python already has, and still end up needing to do a linear search in slow Python code looking for what you want. I think there are probably better solutions. Or at least, I hope there are better solutions :-) -- Steven

Stephen J. Turnbull

6:47 a.m.

Chris Angelico writes:

...

Start by building up an actual list in RAM of [unicodedata.name(chr(i)) for i in range(sys.maxunicode+1)] and then do regular string operations. I'm fairly sure most Python programmers can figure out how to search a list of strings according to whatever rules they like - maybe using contains/startswith/endswith, or maybe regexps, or whatever.

OK. Times are quite imprecise, but after importing re, sys, unicodedata

...

...
...
names = [unicodedata.name(chr(i)) for i in range(sys.maxunicode+1)] Traceback (most recent call last): File "<stdin>", line 1, in <module> File "<stdin>", line 1, in <listcomp> ValueError: no such name

oops, although you didn't actually claim that would work. :-) (BTW, chr(0) has no name. At least it was instantaneous. :-) Then

...

...
...
for i in range(sys.maxunicode+1): ... try: ... names.append(unicodedata.name(chr(i))) ... except ValueError: ... pass ...

takes between 1 and 2 seconds, while

...

...
...
names.index("PILE OF POO") 61721 "PILE OF POO" in names True

is instantaneous. Note: 61721 is *much* smaller than 0x1F4A9. And now

...

...
...
pops = [name for name in names if re.match("^P\\S* O.* P", name)] pops ['PILE OF POO']

takes just noticable time (250ms, maybe?) This on a 4-year-old 2.7GHz i7 MacBook Pro running "Mavericks". Plenty good for my use cases.

Chris Angelico

7:13 a.m.

On Sat, Oct 4, 2014 at 4:47 PM, Stephen J. Turnbull <stephen@xemacs.org> wrote:

...

...
...
...
names = [unicodedata.name(chr(i)) for i in range(sys.maxunicode+1)] Traceback (most recent call last): File "<stdin>", line 1, in <module> File "<stdin>", line 1, in <listcomp> ValueError: no such name

oops, although you didn't actually claim that would work. :-) (BTW, chr(0) has no name. At least it was instantaneous. :-)

Oops, forgot about that. Yet another case where the absence of PEP 463 forces the function to have an additional argument: names = [unicodedata.name(chr(i), '') for i in range(sys.maxunicode+1)] Now it works. Sorry for the omission, this is what happens when code is typed straight into the email without testing :)

...

Then

...
...
...
for i in range(sys.maxunicode+1): ... try: ... names.append(unicodedata.name(chr(i))) ... except ValueError: ... pass ...

I would recommend appending a shim in the ValueError branch, to allow the indexing to be correct. Which would look something like this: names = [unicodedata.name(chr(i)) except ValueError: '' for i in range(sys.maxunicode+1)] Or, since name() does indeed have a 'default' parameter, the code from above. :)

...

takes between 1 and 2 seconds, while

...
...
...
names.index("PILE OF POO") 61721 "PILE OF POO" in names True

is instantaneous. Note: 61721 is *much* smaller than 0x1F4A9.

...

...
...
names.index("PILE OF POO") 128169 hex(_).upper() '0X1F4A9'

And still instantaneous. Of course, a prefix search is a bit slower:

...

...
...
[i for i,s in enumerate(names) if s.startswith("PILE")] [128169]

Takes about 1s on my aging Windows laptop, where the building of the list takes about 4s, so it should be quicker on your system. The big downside, I guess, is the RAM usage.

...

...
...
sys.getsizeof(names) 4892352 sum(sys.getsizeof(n) for n in names) 30698194

That's ~32MB of stuff stored, just to allow these lookups. ChrisA

Steven D'Aprano

8:18 a.m.

On Sat, Oct 04, 2014 at 05:13:18PM +1000, Chris Angelico wrote: [...]

...

The big downside, I guess, is the RAM usage.

...
...
...
sys.getsizeof(names) 4892352 sum(sys.getsizeof(n) for n in names) 30698194

That's ~32MB of stuff stored, just to allow these lookups.

And presumably it is already stored, to support \N{} and unicodedata.lookup(). For reference, UnicodeData.txt is a 1.4MB text file. -- Steven

Philipp A.

10:05 a.m.

you’re right, all of this works. iterating over all of unicode simply looked to big a task for me, so i didn’t consider it, but apparently it works well enough. yet one puzzle piece is missing: blocks. python has no built-in information about unicode blocks (which are basically range()s with associated names). an API involving blocks would need a way to enumerate them, to get the range for a name, and the name for a char/codepoint. 2014-10-04 9:13 GMT+02:00 Chris Angelico <rosuav@gmail.com>:

...

On Sat, Oct 4, 2014 at 4:47 PM, Stephen J. Turnbull <stephen@xemacs.org> wrote:

...
...
...
...
names = [unicodedata.name(chr(i)) for i in range(sys.maxunicode+1)] Traceback (most recent call last): File "<stdin>", line 1, in <module> File "<stdin>", line 1, in <listcomp> ValueError: no such name

oops, although you didn't actually claim that would work. :-) (BTW, chr(0) has no name. At least it was instantaneous. :-)

Oops, forgot about that. Yet another case where the absence of PEP 463 forces the function to have an additional argument:

names = [unicodedata.name(chr(i), '') for i in range(sys.maxunicode+1)]

Now it works. Sorry for the omission, this is what happens when code is typed straight into the email without testing :)

...
Then

...
...
...
for i in range(sys.maxunicode+1): ... try: ... names.append(unicodedata.name(chr(i))) ... except ValueError: ... pass ...

I would recommend appending a shim in the ValueError branch, to allow the indexing to be correct. Which would look something like this:

names = [unicodedata.name(chr(i)) except ValueError: '' for i in range(sys.maxunicode+1)]

Or, since name() does indeed have a 'default' parameter, the code from above. :)

...
takes between 1 and 2 seconds, while

...
...
...
names.index("PILE OF POO") 61721 "PILE OF POO" in names True

is instantaneous. Note: 61721 is *much* smaller than 0x1F4A9.

...
...
...
names.index("PILE OF POO") 128169 hex(_).upper() '0X1F4A9'

And still instantaneous. Of course, a prefix search is a bit slower:

...
...
...
[i for i,s in enumerate(names) if s.startswith("PILE")] [128169]

Takes about 1s on my aging Windows laptop, where the building of the list takes about 4s, so it should be quicker on your system.

The big downside, I guess, is the RAM usage.

...
...
...
sys.getsizeof(names) 4892352 sum(sys.getsizeof(n) for n in names) 30698194

That's ~32MB of stuff stored, just to allow these lookups.

ChrisA _______________________________________________ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/

Philipp A.

12:43 p.m.

oh, and name aliases may be supported – unicodedata.lookup('BEL') works, but there’s no way for the reverse operation. so i suggest to introduce: 1. everything from https://github.com/nagisa/unicodeblocks 2. unicodedata.names(chr) → list of primary name and all aliases, possibly empty (therefore no default) 2014-10-10 12:05 GMT+02:00 Philipp A. <flying-sheep@web.de>:

...

you’re right, all of this works.

iterating over all of unicode simply looked to big a task for me, so i didn’t consider it, but apparently it works well enough.

yet one puzzle piece is missing: blocks.

python has no built-in information about unicode blocks (which are basically range()s with associated names).

an API involving blocks would need a way to enumerate them, to get the range for a name, and the name for a char/codepoint.

2014-10-04 9:13 GMT+02:00 Chris Angelico <rosuav@gmail.com>:

...
On Sat, Oct 4, 2014 at 4:47 PM, Stephen J. Turnbull <stephen@xemacs.org> wrote:

...
...
...
...
names = [unicodedata.name(chr(i)) for i in range(sys.maxunicode+1)] Traceback (most recent call last): File "<stdin>", line 1, in <module> File "<stdin>", line 1, in <listcomp> ValueError: no such name

oops, although you didn't actually claim that would work. :-) (BTW, chr(0) has no name. At least it was instantaneous. :-)

Oops, forgot about that. Yet another case where the absence of PEP 463 forces the function to have an additional argument:

names = [unicodedata.name(chr(i), '') for i in range(sys.maxunicode+1)]

Now it works. Sorry for the omission, this is what happens when code is typed straight into the email without testing :)

...
Then

...
...
...
for i in range(sys.maxunicode+1): ... try: ... names.append(unicodedata.name(chr(i))) ... except ValueError: ... pass ...

I would recommend appending a shim in the ValueError branch, to allow the indexing to be correct. Which would look something like this:

names = [unicodedata.name(chr(i)) except ValueError: '' for i in range(sys.maxunicode+1)]

Or, since name() does indeed have a 'default' parameter, the code from above. :)

...
takes between 1 and 2 seconds, while

...
...
...
names.index("PILE OF POO") 61721 "PILE OF POO" in names True

is instantaneous. Note: 61721 is *much* smaller than 0x1F4A9.

...
...
...
names.index("PILE OF POO") 128169 hex(_).upper() '0X1F4A9'

And still instantaneous. Of course, a prefix search is a bit slower:

...
...
...
[i for i,s in enumerate(names) if s.startswith("PILE")] [128169]

Takes about 1s on my aging Windows laptop, where the building of the list takes about 4s, so it should be quicker on your system.

The big downside, I guess, is the RAM usage.

...
...
...
sys.getsizeof(names) 4892352 sum(sys.getsizeof(n) for n in names) 30698194

That's ~32MB of stuff stored, just to allow these lookups.

ChrisA _______________________________________________ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/

Steven D'Aprano

6:21 a.m.

On Sat, Oct 04, 2014 at 12:17:58PM +0900, Stephen J. Turnbull wrote:

...

M.-A. Lemburg writes:

...
On 03.10.2014 23:10, Philipp A. wrote:

...
...
Unfortunately, unicodedata is very limited.

Phillip, do you really mean *very* limited? If so, I wonder what else you think is missing besides "fuzzy" name lookup. The UCD is defined by the standard, and AFAICS access to all properties is provided.

Hmmm. There's a lot of properties in Unicode, and I'm pretty sure that unicodedata does not give access to *all* of them. Here's a line from UnicodeData.txt: http://www.unicode.org/Public/UNIDATA/UnicodeData.txt 04BF;CYRILLIC SMALL LETTER ABKHASIAN CHE WITH DESCENDER;Ll;0;L;;;;;N;CYRILLIC SMALL LETTER IE HOOK OGONEK;;04BE;;04BE There are 15 semi-colon separated fields. The zeroth is the code point, the others are described here: http://www.unicode.org/Public/5.1.0/ucd/UCD.html#UnicodeData.txt I don't believe that there is any way to get access to all 14 (excluding the code point itself) fields. E.g. how do I find out the "Unicode_1_Name"? And UnicodeData.txt is only one of many Unicode databases. See the UCD.html link above.

...

...
...
But the name database is only queryable using full names! I want to do unicodedata.search('clock') and get a list of dozens of glyphs

...
You should be able to code this as a PyPI package. I don't think it's a use case that warrants making the unicodedata module more complex.

I think it's unfortunate that unicodedata is limited in this particular way, since the database is in C, and as you point out hardly extensible. For example, as a native English speaker who enjoys wordplay I was able to guess which euphemism is the source of the name of U+1F4A9 without looking it up, but I doubt a non-native would be able to. A builtin ability to do fuzzy searches ("unicodenames.startswith('PILE OF')") would be useful.

I would love it if unicodedata exposed the full UnicodeData.txt database in some efficient format. That would allow people to scratch their own itch without having to duplicate the UnicodeData.txt database. Failing that, the two features I miss the most are: (1) fuzzy_lookup(glob): Return iterator which yields (ordinal, name) for each unicode code point which matches the glob. Names beginning with a substring: fuzzy_lookup("SPAM*") Names ending with a substring: fuzzy_lookup("*SPAM") Names containing a substring: fuzzy_lookup("SPAM") (2) get_data(ordinal_or_character): Return a namedtuple with 15 fields, taken directly from the UnicodeData.txt database. The first function solves the very common problem of "I kind of know what the character is called, but not exactly", the second would allow people to code their own arbitrary lookups. -- Steven

Andrew Barnert

9:28 a.m.

On Oct 4, 2014, at 8:21, Steven D'Aprano <steve@pearwood.info> wrote:

...

1) fuzzy_lookup(glob): Return iterator which yields (ordinal, name) for each unicode code point which matches the glob.

Names beginning with a substring: fuzzy_lookup("SPAM*")

Names ending with a substring: fuzzy_lookup("*SPAM")

Names containing a substring: fuzzy_lookup("SPAM")

Surely that last one is "*SPAM*", right? Otherwise this is a weird sort of glob where * doesn't match anything on this end, it instead constrains the opposite end or something. At any rate, why would you expect glob here? There's really nothing else in Python that uses glob patterns except for glob/fnmatch, which are explicitly matching equivalent OS services. It doesn't seem any more natural to think of the database as a directory of files than as a file of text or a database of key values, so why not a regex, or a SQL like pattern, or something else?

Steven D'Aprano

10:26 a.m.

On Sat, Oct 04, 2014 at 11:28:52AM +0200, Andrew Barnert wrote:

...

On Oct 4, 2014, at 8:21, Steven D'Aprano <steve@pearwood.info> wrote:

...
1) fuzzy_lookup(glob): Return iterator which yields (ordinal, name) for each unicode code point which matches the glob.

Names beginning with a substring: fuzzy_lookup("SPAM*")

Names ending with a substring: fuzzy_lookup("*SPAM")

Names containing a substring: fuzzy_lookup("SPAM")

Surely that last one is "*SPAM*", right?

It's a fuzzy lookup, not an exact lookup, so by default it matches the substring anywhere in the string. (If you want an exact name lookup, unicodedata already supports that.) You could write "*SPAM*" of course, but the stars would be redundant. I'm not trying to match the full range of shell globs, I'm just suggesting the minimum set of features I want. The only metacharacter I can see a practical use for is *. If you can think of uses for other metacharacters, feel free to propose them.

...

Otherwise this is a weird sort of glob where * doesn't match anything on this end, it instead constrains the opposite end or something.

I don't quite understand what you are trying to say here.

...

At any rate, why would you expect glob here? There's really nothing else in Python that uses glob patterns except for glob/fnmatch, which are explicitly matching equivalent OS services. It doesn't seem any more natural to think of the database as a directory of files than as a file of text or a database of key values, so why not a regex, or a SQL like pattern, or something else?

Because globs are simpler than regexes, and easier to use. They support the most common (or at least what I think will be the most common) use-cases: matching something that contains, ends with or starts with a substring. (Globbing may be most well-known from shells, but there is nothing about glob syntax that is limited to matching file names. It's a string matching language, which the shell happens to use to match file names.) I don't see a use for supporting the full range of regexes. As far as I am concerned, globbing is complicated enough for what I need, and full support for arbitrary regexes is YAGNI. -- Steven

Paul Moore

2:23 p.m.

On 4 October 2014 11:26, Steven D'Aprano <steve@pearwood.info> wrote:

...

I don't see a use for supporting the full range of regexes. As far as I am concerned, globbing is complicated enough for what I need, and full support for arbitrary regexes is YAGNI.

I don't know how unicodedata is implemented, but would it be practical to simply expose a function that iterates over every name in the database? Then you could simply do (name for name in unicodedata.names() if name.startswith(prefix)) Paul.

Andrew Barnert

2:52 p.m.

On Oct 4, 2014, at 16:23, Paul Moore <p.f.moore@gmail.com> wrote:

...

On 4 October 2014 11:26, Steven D'Aprano <steve@pearwood.info> wrote:

...
I don't see a use for supporting the full range of regexes. As far as I am concerned, globbing is complicated enough for what I need, and full support for arbitrary regexes is YAGNI.

I don't know how unicodedata is implemented, but would it be practical to simply expose a function that iterates over every name in the database? Then you could simply do

(name for name in unicodedata.names() if name.startswith(prefix))

IIRC, the perl UCD CPAN package and the ruby unicodedata gem expose the name to code and code to name mappings as hashes. Doing the equivalent in Python would allow you to do anything you want (including exactly that same line of code).

Serhiy Storchaka

1:04 p.m.

...

I don't know how unicodedata is implemented, but would it be practical to simply expose a function that iterates over every name in the database? Then you could simply do

(name for name in unicodedata.names() if name.startswith(prefix))

(unicodedata.name(c, '') for c in map(chr, range(sys.maxunicode)) if unicodedata.name(c, '').startswith(prefix))

3781

Age (days ago)

3788

Last active (days ago)

List overview

Download

15 comments

8 participants

participants (8)

Andrew Barnert
Chris Angelico
M.-A. Lemburg
Paul Moore
Philipp A.
Serhiy Storchaka
Stephen J. Turnbull
Steven D'Aprano

Extend unicodedata with a name search

tags

participants (8)