[Tutor] regex: matching unicode

Mon Dec 24 05:16:29 CET 2012

On Sat, Dec 22, 2012 at 11:12 PM, Steven D'Aprano <steve at pearwood.info> wrote:
>
> No. You could install a more Unicode-aware regex engine, and use it instead
> of Python's re module, where Unicode support is at best only partial.
>
> Try this one:
>
> http://pypi.python.org/pypi/regex

Looking over the old docs, I count 4 regex implementations up to 2.0:

    regexp
    regex (0.9.5)
    re / pcre (1.5)
    re / sre (2.0)

It would be nice to see Matthew Barnett's regex module added as an
option in 3.4, just as sre was added to 1.6 before taking the place of
pcre in 2.0.

> The failures are all numbers with category Nl or No ("letterlike
> numeric character" and "numeric character of other type").

The pattern basically matches any word character that's not a
decimal/underscore (the \s is redundant AFAIK). Any character that's
numeric but not decimal also matches. For example, the following are
all numeric:

    \N{SUPERSCRIPT ONE}: category "No", digit, not decimal
    \N{ROMAN NUMERAL ONE}: category "Nl", not digit, not decimal
    \u4e00 (1, CJK): category "Lo", not digit, not decimal

Regarding the latter, if the pattern shouldn't match numeric
characters in a broad sense, then it should be OK to exclude CJK
numeric ideograms in category "Lo", but it's like excluding the word
"one".