[Tutor] Unicode and regexes
Kent Johnson
kent37 at tds.net
Sat Mar 11 14:50:15 CET 2006
Michael Broe wrote:
> Does Python support the Unicode-flavored class-specifications in
> regular expressions, e.g. \p{L} ? It doesn't work in the following
> code, any ideas?
From http://www.unicode.org/unicode/reports/tr18/ I see that \p{L} is
intended to select Unicode letters, and it is part of a large number of
selectors based on Unicode character properties.
Python doesn't support this syntax. It has limited support for Unicode
character properties as an extension of the \d, \D, \s, \S, \w and \W
sequences. For example with
numbers = re.compile(r'\d', re.UNICODE)
numbers will match any Unicode digit.
You can combine and difference the built-in categories to get more
possibilities. This thread shows how to construct a regex that finds
just Unicode letters:
http://groups.google.com/group/comp.lang.python/browse_frm/thread/6ef6736581fecaeb/a49326cb48c408ee?q=unicode+character+class&rnum=1#a49326cb48c408ee
Python does have built-in support for the Unicode character database in
the unicodedata module, so for example you can look up the character
class of a character. You can roll your own solution on top of this
data. This thread shows how to build your own regex category directly
from a property in unicodedata:
http://groups.google.com/group/comp.lang.python/browse_frm/thread/fdfdec9a0649c540/471331f518fa680f?q=unicode+character+class&rnum=2#471331f518fa680f
Kent
More information about the Tutor
mailing list