Regular expressions and Unicode
skip at pobox.com
skip at pobox.com
Thu Oct 2 16:12:34 EDT 2008
Jeffrey> However, when I apply it to this Unicode string, I get only the
Jeffrey> first 3 letters of the surname:
Jeffrey> name = 'Anton\xc3\xadn Dvo\xc5\x99\xc3\xa1k'
Maybe
name = unicode('Anton\xc3\xadn Dvo\xc5\x99\xc3\xa1k', "utf-8")
? Yup, that works:
>>> name = unicode('Anton\xc3\xadn Dvo\xc5\x99\xc3\xa1k', "utf-8")
>>> name
u'Anton\xedn Dvo\u0159\xe1k'
>>> surname = r'(?u).+ (\w+)'
>>> import re
>>> surname_re = re.compile(surname)
>>> m = surname_re.search(name)
>>> m.groups()
(u'Dvo\u0159\xe1k',)
More information about the Python-list
mailing list