[Tutor] Unicode and regexes

Sat Mar 11 14:50:15 CET 2006

Michael Broe wrote:
> Does Python support the Unicode-flavored class-specifications in  
> regular expressions, e.g. \p{L} ? It doesn't work in the following  
> code, any ideas?

 From http://www.unicode.org/unicode/reports/tr18/ I see that \p{L} is 
intended to select Unicode letters, and it is part of a large number of 
selectors based on Unicode character properties.

Python doesn't support this syntax. It has limited support for Unicode 
character properties as an extension of the \d, \D, \s, \S, \w and \W 
sequences. For example with
   numbers = re.compile(r'\d', re.UNICODE)

numbers will match any Unicode digit.

You can combine and difference the built-in categories to get more 
possibilities. This thread shows how to construct a regex that finds 
just Unicode letters:
http://groups.google.com/group/comp.lang.python/browse_frm/thread/6ef6736581fecaeb/a49326cb48c408ee?q=unicode+character+class&rnum=1#a49326cb48c408ee

Python does have built-in support for the Unicode character database in 
the unicodedata module, so for example you can look up the character 
class of a character. You can roll your own solution on top of this 
data. This thread shows how to build your own regex category directly 
from a property in unicodedata:
http://groups.google.com/group/comp.lang.python/browse_frm/thread/fdfdec9a0649c540/471331f518fa680f?q=unicode+character+class&rnum=2#471331f518fa680f

Kent