regexps with unicode-aware characterclasses?
Stefan Rank
stefan.rank at ofai.at
Tue Aug 30 09:54:26 EDT 2005
Hi all,
in a python re pattern, how do I match all unicode uppercase characters
(in a unicode string/in a utf-8 string)?
I know that there is string.uppercase/.lowercase which are
'locale-aware', but I don't think there is a "all locales" locale.
I know that there is a re.U switch that makes \w match all unicode word
characters, but there are no subclasses of that ([[:upper:]] or
preferably \u).
Or is there a module/extension to get that?
There is the module unicodedata, but it has no unicodedata.uppercase
that would correspond to string.uppercase.
<wishful thinking>
re.compile('|'.join([x.encode('utf8') for x in unicode.uppercase]))
or::
re.compile('(?u)[[:upper:]]')
or::
re.compile('(?u)\u')
for the latter two, to work on utf-8 strings, would I have to set the
defaultencoding to utf-8?
</wishful thinking>
More information about the Python-list
mailing list