[Tutor] regex: matching unicode

Sun Dec 23 05:12:47 CET 2012

On 23/12/12 07:53, Albert-Jan Roskam wrote:
> Hi,
>
> Is the code below the only/shortest way to match unicode characters?

No. You could install a more Unicode-aware regex engine, and use it instead
of Python's re module, where Unicode support is at best only partial.

Try this one:

http://pypi.python.org/pypi/regex

and report any issues to the author.

> I would like to match whatever is defined as a character in the unicode
>reference database. So letters in the broadest sense of the word,

Well, not really, actually letters in the sense of the Unicode reference
database :-)

In the above regex module, I think you could write:

'\p{Alphabetic}'

or

'\p{L}'

but don't quote me on this.

>but not digits, underscore or whitespace. Until just now, I was convinced
>that the re.UNICODE flag generalized the [a-z] class to all unicode letters,
>and that the absence of re.U was an implicit 're.ASCII'. Apparently that
>mental model was *wrong*.
> But [^\W\s\d_]+ is kind of hard to read/write.

Of course it is. It's a regex.

> import re
> s = unichr(956)  # mu sign
> m = re.match(ur"[^\W\s\d_]+", s, re.I | re.U)

Unfortunately that matches too much: in Python 2.7, it matches 340 non-letter
characters. Run this to see them:

import re
import unicodedata
MAXNUM = 0x10000   # one more than maximum unichr in Python "narrow builds"
regex = re.compile("[^\W\s\d_]+", re.I | re.U)
LETTERS = 'L|Ll|Lm|Lo|Lt|Lu'.split('|')
failures = []
kinds = set()
for c in map(unichr, range(MAXNUM)):
     if bool(re.match(regex, c)) != (unicodedata.category(c) in LETTERS):
         failures.append(c)
         kinds.add(unicodedata.category(c))

print kinds, len(failures)

The failures are all numbers with category Nl or No ("letterlike numeric
character" and "numeric character of other type"). You can see them with:

for c in failures:
     print c, unicodedata.category(c), unicodedata.name(c)

I won't show the full output, but a same sample includes:

² No SUPERSCRIPT TWO
¼ No VULGAR FRACTION ONE QUARTER
৴ No BENGALI CURRENCY NUMERATOR ONE
፹ No ETHIOPIC NUMBER EIGHTY
ᛮ Nl RUNIC ARLAUG SYMBOL
Ⅲ Nl ROMAN NUMERAL THREE

so you will probably have to post-process your matching results to exclude
these false-positives. Or just accept them.

If you have a "wide build", or Python 3.3, you can extend the test to the
full Unicode range of 0x110000. When I do that, I find 684 false matches,
all in category Nl and No.

-- 
Steven