Unicode classes of characters in Pythons' re's like in Perl?

Martin v. Löwis loewis at informatik.hu-berlin.de
Tue Jul 30 06:27:53 EDT 2002


Roman Suzi <rnd at onego.ru> writes:

> Reading XML Schema docs I found that there are some useful extensions
> to regular expressions like an ability to specify class of characters.
> For example,
> 
> [\p{Lu}]
> 
> will match any uppercase letters.
> 
> Is the feature planned in Python re too?

Python currently supports Unicode character classes by explicitly
enumerating all characters, e.g.

r=re.compile(u"[\u0400-\u04FF]")

In addition, it extends the categories to Unicode, if the UNICODE flag
is given:

- \d (digit): Character has a 'digit value' property; covers all of Nd
              and most of No
- \s (space): bidirectional type WS, B, S, or category Zs
- \w (word):  alpha (Ll, Lu, Lt, Lo, or Lm), 
              decimal (has 'decimal value' property),
              digit,
              numeric (has 'numeric value' property),
              or '_'
- (linebreak, currently not supported in sre_parse): 
              Category Zl, or type B

There has been talk about supporting the POSIX regular expression
categories (alnum, cntrl, lower, space, alpha, digit, print, upper,
blank, graph, punct, xdigit, plus any categories defined by LC_CTYPE);
this is not implemented, yet.

So far, nobody has proposed to support Unicode categories in SRE. You
can easily implement this yourself by means of using
unicodedata.category, e.g.

import unicodedata, sys

def gencategory(cat):
    start = end = None
    result = [u"["]
    for i in range(sys.maxunicode+1):
        c = unichr(i)
        if unicodedata.category(c) == cat:
            if start is None:
                start = end = c
            else:
                end = c
        elif start:
            # XXX: special-case ] and -
            if start == end:
                result.append(start)
            else:
                result.append(start + "-" + end)
            start = None
    result.append(u"]")
    return u"".join(result)

print repr(gencategory("Lu"))

It turns out that those categories are useless for XML, since the XML
character classes (in XML 1.0) have been defined using a different
Unicode versions (XML uses the Unicode 2.0 database). The same appears
to be the case for XML Schema: They use the Unicode 3.1 database;
Python 2.2 has the Unicode 3.0 database.

So to implement XML Schema, you probably have to parse the specific
version of the Unicode database yourself, and construct the re class
from that.

Regards,
Martin



More information about the Python-list mailing list