[Tutor] regex: matching unicode
Steven D'Aprano
steve at pearwood.info
Sun Dec 23 05:12:47 CET 2012
On 23/12/12 07:53, Albert-Jan Roskam wrote:
> Hi,
>
> Is the code below the only/shortest way to match unicode characters?
No. You could install a more Unicode-aware regex engine, and use it instead
of Python's re module, where Unicode support is at best only partial.
Try this one:
http://pypi.python.org/pypi/regex
and report any issues to the author.
> I would like to match whatever is defined as a character in the unicode
>reference database. So letters in the broadest sense of the word,
Well, not really, actually letters in the sense of the Unicode reference
database :-)
In the above regex module, I think you could write:
'\p{Alphabetic}'
or
'\p{L}'
but don't quote me on this.
>but not digits, underscore or whitespace. Until just now, I was convinced
>that the re.UNICODE flag generalized the [a-z] class to all unicode letters,
>and that the absence of re.U was an implicit 're.ASCII'. Apparently that
>mental model was *wrong*.
> But [^\W\s\d_]+ is kind of hard to read/write.
Of course it is. It's a regex.
> import re
> s = unichr(956) # mu sign
> m = re.match(ur"[^\W\s\d_]+", s, re.I | re.U)
Unfortunately that matches too much: in Python 2.7, it matches 340 non-letter
characters. Run this to see them:
import re
import unicodedata
MAXNUM = 0x10000 # one more than maximum unichr in Python "narrow builds"
regex = re.compile("[^\W\s\d_]+", re.I | re.U)
LETTERS = 'L|Ll|Lm|Lo|Lt|Lu'.split('|')
failures = []
kinds = set()
for c in map(unichr, range(MAXNUM)):
if bool(re.match(regex, c)) != (unicodedata.category(c) in LETTERS):
failures.append(c)
kinds.add(unicodedata.category(c))
print kinds, len(failures)
The failures are all numbers with category Nl or No ("letterlike numeric
character" and "numeric character of other type"). You can see them with:
for c in failures:
print c, unicodedata.category(c), unicodedata.name(c)
I won't show the full output, but a same sample includes:
² No SUPERSCRIPT TWO
¼ No VULGAR FRACTION ONE QUARTER
৴ No BENGALI CURRENCY NUMERATOR ONE
፹ No ETHIOPIC NUMBER EIGHTY
ᛮ Nl RUNIC ARLAUG SYMBOL
Ⅲ Nl ROMAN NUMERAL THREE
so you will probably have to post-process your matching results to exclude
these false-positives. Or just accept them.
If you have a "wide build", or Python 3.3, you can extend the test to the
full Unicode range of 0x110000. When I do that, I find 684 false matches,
all in category Nl and No.
--
Steven
More information about the Tutor
mailing list