[Python-ideas] Extend unicodedata with a name search

Fri Oct 10 14:43:15 CEST 2014

oh, and name aliases may be supported – unicodedata.lookup('BEL') works,
but there’s no way for the reverse operation.

so i suggest to introduce:

1. everything from https://github.com/nagisa/unicodeblocks
2. unicodedata.names(chr) → list of primary name and all aliases, possibly
empty (therefore no default)

2014-10-10 12:05 GMT+02:00 Philipp A. <flying-sheep at web.de>:

> you’re right, all of this works.
>
> iterating over all of unicode simply looked to big a task for me, so i
> didn’t consider it, but apparently it works well enough.
>
> yet one puzzle piece is missing: blocks.
>
> python has no built-in information about unicode blocks (which are
> basically range()s with associated names).
>
> an API involving blocks would need a way to enumerate them, to get the
> range for a name, and the name for a char/codepoint.
>
> 2014-10-04 9:13 GMT+02:00 Chris Angelico <rosuav at gmail.com>:
>
>> On Sat, Oct 4, 2014 at 4:47 PM, Stephen J. Turnbull <stephen at xemacs.org>
>> wrote:
>> >>>> names = [unicodedata.name(chr(i)) for i in range(sys.maxunicode+1)]
>> > Traceback (most recent call last):
>> >   File "<stdin>", line 1, in <module>
>> >   File "<stdin>", line 1, in <listcomp>
>> > ValueError: no such name
>> >
>> > oops, although you didn't actually claim that would work. :-)  (BTW,
>> > chr(0) has no name.  At least it was instantaneous. :-)
>>
>> Oops, forgot about that. Yet another case where the absence of PEP 463
>> forces the function to have an additional argument:
>>
>> names = [unicodedata.name(chr(i), '') for i in range(sys.maxunicode+1)]
>>
>> Now it works. Sorry for the omission, this is what happens when code
>> is typed straight into the email without testing :)
>>
>> > Then
>> >
>> >>>> for i in range(sys.maxunicode+1):
>> > ...  try:
>> > ...   names.append(unicodedata.name(chr(i)))
>> > ...  except ValueError:
>> > ...   pass
>> > ...
>>
>> I would recommend appending a shim in the ValueError branch, to allow
>> the indexing to be correct. Which would look something like this:
>>
>> names = [unicodedata.name(chr(i)) except ValueError: '' for i in
>> range(sys.maxunicode+1)]
>>
>> Or, since name() does indeed have a 'default' parameter, the code from
>> above. :)
>>
>> > takes between 1 and 2 seconds, while
>> >
>> >>>> names.index("PILE OF POO")
>> > 61721
>> >>>> "PILE OF POO" in names
>> > True
>> >
>> > is instantaneous.  Note: 61721 is *much* smaller than 0x1F4A9.
>>
>> >>> names.index("PILE OF POO")
>> 128169
>> >>> hex(_).upper()
>> '0X1F4A9'
>>
>> And still instantaneous. Of course, a prefix search is a bit slower:
>>
>> >>> [i for i,s in enumerate(names) if s.startswith("PILE")]
>> [128169]
>>
>> Takes about 1s on my aging Windows laptop, where the building of the
>> list takes about 4s, so it should be quicker on your system.
>>
>> The big downside, I guess, is the RAM usage.
>>
>> >>> sys.getsizeof(names)
>> 4892352
>> >>> sum(sys.getsizeof(n) for n in names)
>> 30698194
>>
>> That's ~32MB of stuff stored, just to allow these lookups.
>>
>> ChrisA
>> _______________________________________________
>> Python-ideas mailing list
>> Python-ideas at python.org
>> https://mail.python.org/mailman/listinfo/python-ideas
>> Code of Conduct: http://python.org/psf/codeofconduct/
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-ideas/attachments/20141010/91510631/attachment-0001.html>