[Tutor] regex: matching unicode

Tue Dec 25 05:22:31 CET 2012

On Mon, Dec 24, 2012 at 2:51 AM, Albert-Jan Roskam <fomcl at yahoo.com> wrote:
>
> First, check if the first character is a (unicode) letter

You can use unicode.isalpha, with a caveat. On a narrow build isalpha
fails for supplementary planes. That's about 50% of all alphabetic
characters, +/- depending on the version of Unicode. But it's mostly
the less common CJK characters (over 90% this), dead languages (e.g.
Linear B, Cuneiform, Egyptian Hieroglyphs), and mathematical script.

Instead, you could check if index 0 is category 'Cs' (surrogate). If
so, check the category of the slice [:2].

> Having unicode versions of the classes \d, \w, etc (let's call them
> \ud, \uw) would be cool.

(?u) enables re.U in case you're looking to keep the flag setting in
the pattern itself.

\d and \w are defined for Unicode. It's just the available categories
are insufficient. Matthew Barnett's regex module implements level 1
(and much of level 2) of UTS #18: Unicode Regular Expressions. See
RL1.2 and Annex C:

http://unicode.org/reports/tr18/

> def isUnicodeChar(c):
>   assert len(c) == 1
>   c = c.decode("utf-8") if isinstance(c, str) else c
>   return 'L' in unicodedata.category(c)

For UTF-8, len(c) == 1 only for ASCII codes; otherwise character codes
have a leading byte and up to 3 continuation bytes. Also, on a narrow
build len(c) is 2 for codes in the supplementary planes.

Also keep in mind canonical composition/decomposition equivalence and
compatibility when thinking about 'characters' in terms of comparison,
sorting, dictionary keys, sets, etc. You might want to first normalize
a string to canonical composed form (NFC). Python 3.x uses NFKC for
identifiers. For example:

    >>> d = {}
    >>> exec("e\u0301 = 1", d)
    >>> d["e\u0301"]
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
    KeyError: 'é'

    >>> d["\xe9"]
    1
    >>> "\xe9"
    'é'

http://en.wikipedia.org/wiki/Unicode_equivalence