[Python-ideas] Extend unicodedata with a name search

Philipp A. flying-sheep at web.de
Fri Oct 10 12:05:20 CEST 2014


you’re right, all of this works.

iterating over all of unicode simply looked to big a task for me, so i
didn’t consider it, but apparently it works well enough.

yet one puzzle piece is missing: blocks.

python has no built-in information about unicode blocks (which are
basically range()s with associated names).

an API involving blocks would need a way to enumerate them, to get the
range for a name, and the name for a char/codepoint.

2014-10-04 9:13 GMT+02:00 Chris Angelico <rosuav at gmail.com>:

> On Sat, Oct 4, 2014 at 4:47 PM, Stephen J. Turnbull <stephen at xemacs.org>
> wrote:
> >>>> names = [unicodedata.name(chr(i)) for i in range(sys.maxunicode+1)]
> > Traceback (most recent call last):
> >   File "<stdin>", line 1, in <module>
> >   File "<stdin>", line 1, in <listcomp>
> > ValueError: no such name
> >
> > oops, although you didn't actually claim that would work. :-)  (BTW,
> > chr(0) has no name.  At least it was instantaneous. :-)
>
> Oops, forgot about that. Yet another case where the absence of PEP 463
> forces the function to have an additional argument:
>
> names = [unicodedata.name(chr(i), '') for i in range(sys.maxunicode+1)]
>
> Now it works. Sorry for the omission, this is what happens when code
> is typed straight into the email without testing :)
>
> > Then
> >
> >>>> for i in range(sys.maxunicode+1):
> > ...  try:
> > ...   names.append(unicodedata.name(chr(i)))
> > ...  except ValueError:
> > ...   pass
> > ...
>
> I would recommend appending a shim in the ValueError branch, to allow
> the indexing to be correct. Which would look something like this:
>
> names = [unicodedata.name(chr(i)) except ValueError: '' for i in
> range(sys.maxunicode+1)]
>
> Or, since name() does indeed have a 'default' parameter, the code from
> above. :)
>
> > takes between 1 and 2 seconds, while
> >
> >>>> names.index("PILE OF POO")
> > 61721
> >>>> "PILE OF POO" in names
> > True
> >
> > is instantaneous.  Note: 61721 is *much* smaller than 0x1F4A9.
>
> >>> names.index("PILE OF POO")
> 128169
> >>> hex(_).upper()
> '0X1F4A9'
>
> And still instantaneous. Of course, a prefix search is a bit slower:
>
> >>> [i for i,s in enumerate(names) if s.startswith("PILE")]
> [128169]
>
> Takes about 1s on my aging Windows laptop, where the building of the
> list takes about 4s, so it should be quicker on your system.
>
> The big downside, I guess, is the RAM usage.
>
> >>> sys.getsizeof(names)
> 4892352
> >>> sum(sys.getsizeof(n) for n in names)
> 30698194
>
> That's ~32MB of stuff stored, just to allow these lookups.
>
> ChrisA
> _______________________________________________
> Python-ideas mailing list
> Python-ideas at python.org
> https://mail.python.org/mailman/listinfo/python-ideas
> Code of Conduct: http://python.org/psf/codeofconduct/
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-ideas/attachments/20141010/7e6a0c2c/attachment.html>


More information about the Python-ideas mailing list