regexps with unicode-aware characterclasses?

Tue Aug 30 09:54:26 EDT 2005

Hi all,

in a python re pattern, how do I match all unicode uppercase characters 
(in a unicode string/in a utf-8 string)?

I know that there is string.uppercase/.lowercase which are 
'locale-aware', but I don't think there is a "all locales" locale.

I know that there is a re.U switch that makes \w match all unicode word 
characters, but there are no subclasses of that ([[:upper:]] or 
preferably \u).
Or is there a module/extension to get that?

There is the module unicodedata, but it has no unicodedata.uppercase 
that would correspond to string.uppercase.

<wishful thinking>

   re.compile('|'.join([x.encode('utf8') for x in unicode.uppercase]))

or::

   re.compile('(?u)[[:upper:]]')

or::

   re.compile('(?u)\u')

for the latter two, to work on utf-8 strings, would I have to set the 
defaultencoding to utf-8?

</wishful thinking>