[Tutor] Unicode and regexes
Michael Broe
mbroe at columbus.rr.com
Fri Mar 10 23:00:00 CET 2006
Does Python support the Unicode-flavored class-specifications in
regular expressions, e.g. \p{L} ? It doesn't work in the following
code, any ideas?
-----
#! /usr/local/bin/python
""" usage: ./uni_read.py file
"""
import codecs
import re
text = codecs.open(sys.argv[1], mode='r', encoding='utf-8').read()
unicode_property_pattern = re.compile(r"\p{L}")
dot_pattern = re.compile(".")
letters = unicode_property_pattern.findall(text)
characters = dot_pattern.findall(text)
print 'var letters =', letters
print 'var characters = ', characters
-----
The input file, encoded in utf-8 is
abc <followed by space, alpha, beta gamma>
The output is:
var letters = []
var characters = [u'a', u'b', u'c', u' ', u'\u03b1', u'\u03b2',
u'\u03b3']
More information about the Tutor
mailing list