Regular expressions and Unicode
__peter__ at web.de
Thu Oct 2 22:11:05 CEST 2008
Jeffrey Barish wrote:
> I have a regular expression that I use to extract the surname:
> surname = r'(?u).+ (\w+)'
> However, when I apply it to this Unicode string, I get only the first 3
> letters of the surname:
> name = 'Anton\xc3\xadn Dvo\xc5\x99\xc3\xa1k'
That's a byte string. You can either modify the literal
name = u'Anton\xedn Dvo\u0159\xe1k'
or decode it with the proper encoding
name = 'Anton\xc3\xadn Dvo\xc5\x99\xc3\xa1k'
name = name.decode("utf-8")
> surname_re = re.compile(surname)
> m = surname_re.search(name)
> I suppose that there is an encoding problem, but I don't understand
> Unicode well enough to know what to do to digest properly the Unicode
> characters in the surname.
>>> name = 'Anton\xc3\xadn Dvo\xc5\x99\xc3\xa1k'
>>> re.compile(r"(?u).+ (\w+)").search(name.decode("utf-8")).groups()
>>> print _
More information about the Python-list