Regular expressions and Unicode

Thu Oct 2 16:11:05 EDT 2008

Jeffrey Barish wrote:

> I have a regular expression that I use to extract the surname:
> 
> surname = r'(?u).+ (\w+)'
> 
> However, when I apply it to this Unicode string, I get only the first 3
> letters of the surname:
> 
> name = 'Anton\xc3\xadn Dvo\xc5\x99\xc3\xa1k'

That's a byte string. You can either modify the literal

name = u'Anton\xedn Dvo\u0159\xe1k'

or decode it with the proper encoding

name = 'Anton\xc3\xadn Dvo\xc5\x99\xc3\xa1k'
name = name.decode("utf-8")

> surname_re = re.compile(surname)
> m = surname_re.search(name)
> m.groups()
> ('Dvo\xc5',)
> 
> I suppose that there is an encoding problem, but I don't understand
> Unicode well enough to know what to do to digest properly the Unicode
> characters in the surname.

>>> name = 'Anton\xc3\xadn Dvo\xc5\x99\xc3\xa1k'
>>> re.compile(r"(?u).+ (\w+)").search(name.decode("utf-8")).groups()
(u'Dvo\u0159\xe1k',)
>>> print _[0]
Dvořák

Peter