Extend unicodedata with a name/pattern/regex search for character entity references?
Ned Batchelder
ned at nedbatchelder.com
Sun Sep 4 19:40:40 EDT 2016
On Saturday, September 3, 2016 at 7:55:48 AM UTC-4, Veek. M wrote:
> https://mail.python.org/pipermail//python-ideas/2014-October/029630.htm
>
> Wanted to know if the above link idea, had been implemented and if
> there's a module that accepts a pattern like 'cap' and give you all the
> instances of unicode 'CAP' characters.
> ⋂ \bigcap
> ⊓ \sqcap
> ∩ \cap
> ♑ \capricornus
> ⪸ \succapprox
> ⪷ \precapprox
>
> (above's from tex)
>
> I found two useful modules in this regard: unicode_tex, unicodedata
> but unicodedata is a builtin which does not do globs, regexs - so it's
> kind of limiting in nature.
>
> Would be nice if you could search html/xml character entity references
> as well.
The unicodedata module has all the information you need for searching
Unicode character names. While it doesn't provide regex or globs, it's
all in-memory, so it's not bad for just iterating over the characters
and finding what you need.
But, 'CAP' appears in 'CAPITAL', which gives more than 1800 matches:
>>> for c in range(32, 0x110000):
... try:
... name = unicodedata.name(chr(c))
... except ValueError:
... continue
... if 'CAP' in name:
... print(c, name)
...
65 LATIN CAPITAL LETTER A
66 LATIN CAPITAL LETTER B
..
.. many other lines, mostly with CAPITAL in them ..
..
917593 TAG LATIN CAPITAL LETTER Y
917594 TAG LATIN CAPITAL LETTER Z
>>>
These were the character names without "CAPITAL":
8419 COMBINING ENCLOSING KEYCAP
8851 SQUARE CAP
9232 SYMBOL FOR DATA LINK ESCAPE
9243 SYMBOL FOR ESCAPE
9809 CAPRICORN
11839 CAPITULUM
41657 YI SYLLABLE CAP
52290 HANGUL SYLLABLE CAP
66003 PHAISTOS DISC SIGN CAPTIVE
119050 MUSICAL SYMBOL DA CAPO
127750 CITYSCAPE AT DUSK
127891 GRADUATION CAP
127956 SNOW CAPPED MOUNTAIN
127961 CITYSCAPE
128287 KEYCAP TEN
128846 ALCHEMICAL SYMBOL FOR CAPUT MORTUUM
--Ned.
More information about the Python-list
mailing list