Regex for unicode letter characters

schickb schickb at gmail.com
Sun Jan 11 02:56:14 CET 2009


I need a regex that will match strings containing only unicode letter
characters (not including numeric or the _ character). I was surprised
to find the 're' module does not include a special character class for
this already (python 2.6). Or did I miss something?

It seems like this would be a very common need. Is the following the
only option to generate the character class (based on an old post by
Martin v. Löwis )?

import unicodedata, sys

def letters():
    start = end = None
    result = []
    for index in xrange(sys.maxunicode + 1):
        c = unichr(index)
        if unicodedata.category(c)[0] == 'L':
            if start is None:
                start = end = c
            else:
                end = c
        elif start:
            if start == end:
                result.append(start)
            else:
                result.append(start + "-" + end)
            start = None
    return u'[' + u''.join(result) + u']'

Seems rather cumbersome.

-Brad



More information about the Python-list mailing list