oh, and name aliases may be supported – unicodedata.lookup('BEL') works, but there’s no way for the reverse operation.

so i suggest to introduce:

1. everything from https://github.com/nagisa/unicodeblocks
2. unicodedata.names(chr) → list of primary name and all aliases, possibly empty (therefore no default)

2014-10-10 12:05 GMT+02:00 Philipp A. <flying-sheep@web.de>:
you’re right, all of this works.

iterating over all of unicode simply looked to big a task for me, so i didn’t consider it, but apparently it works well enough.

yet one puzzle piece is missing: blocks.

python has no built-in information about unicode blocks (which are basically range()s with associated names).

an API involving blocks would need a way to enumerate them, to get the range for a name, and the name for a char/codepoint.

2014-10-04 9:13 GMT+02:00 Chris Angelico <rosuav@gmail.com>:
On Sat, Oct 4, 2014 at 4:47 PM, Stephen J. Turnbull <stephen@xemacs.org> wrote:
>>>> names = [unicodedata.name(chr(i)) for i in range(sys.maxunicode+1)]
> Traceback (most recent call last):
>   File "<stdin>", line 1, in <module>
>   File "<stdin>", line 1, in <listcomp>
> ValueError: no such name
>
> oops, although you didn't actually claim that would work. :-)  (BTW,
> chr(0) has no name.  At least it was instantaneous. :-)

Oops, forgot about that. Yet another case where the absence of PEP 463
forces the function to have an additional argument:

names = [unicodedata.name(chr(i), '') for i in range(sys.maxunicode+1)]

Now it works. Sorry for the omission, this is what happens when code
is typed straight into the email without testing :)

> Then
>
>>>> for i in range(sys.maxunicode+1):
> ...  try:
> ...   names.append(unicodedata.name(chr(i)))
> ...  except ValueError:
> ...   pass
> ...

I would recommend appending a shim in the ValueError branch, to allow
the indexing to be correct. Which would look something like this:

names = [unicodedata.name(chr(i)) except ValueError: '' for i in
range(sys.maxunicode+1)]

Or, since name() does indeed have a 'default' parameter, the code from above. :)

> takes between 1 and 2 seconds, while
>
>>>> names.index("PILE OF POO")
> 61721
>>>> "PILE OF POO" in names
> True
>
> is instantaneous.  Note: 61721 is *much* smaller than 0x1F4A9.

>>> names.index("PILE OF POO")
128169
>>> hex(_).upper()
'0X1F4A9'

And still instantaneous. Of course, a prefix search is a bit slower:

>>> [i for i,s in enumerate(names) if s.startswith("PILE")]
[128169]

Takes about 1s on my aging Windows laptop, where the building of the
list takes about 4s, so it should be quicker on your system.

The big downside, I guess, is the RAM usage.

>>> sys.getsizeof(names)
4892352
>>> sum(sys.getsizeof(n) for n in names)
30698194

That's ~32MB of stuff stored, just to allow these lookups.

ChrisA
_______________________________________________
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/