Regular expressions and Unicode

skip at pobox.com skip at pobox.com
Thu Oct 2 16:12:34 EDT 2008


    Jeffrey> However, when I apply it to this Unicode string, I get only the
    Jeffrey> first 3 letters of the surname:

    Jeffrey> name = 'Anton\xc3\xadn Dvo\xc5\x99\xc3\xa1k'

Maybe

    name = unicode('Anton\xc3\xadn Dvo\xc5\x99\xc3\xa1k', "utf-8")

?  Yup, that works:

    >>> name = unicode('Anton\xc3\xadn Dvo\xc5\x99\xc3\xa1k', "utf-8")
    >>> name
    u'Anton\xedn Dvo\u0159\xe1k'
    >>> surname = r'(?u).+ (\w+)'
    >>> import re
    >>> surname_re = re.compile(surname)
    >>> m = surname_re.search(name)
    >>> m.groups()
    (u'Dvo\u0159\xe1k',)



More information about the Python-list mailing list