unicode regex example: trouble
Peter Otten
__peter__ at web.de
Fri May 21 04:45:26 EDT 2004
marek wrote:
> trying this example to make print MatchObject reference. Fails (prints
> None). Does anybody know where I am wrong?
>
> # -*- coding: cp1251 -*-
>
> import re
>
> # pattern in Ukrainian ('привіт')
> p = '\377\376?\004@\0048\0042\004V\004B\004'
>
> # data (pattern is in the middle of the string)
> d = '\377\376t\000e\000s\000t\000?\004@\0048\0042\004V\004B\004t\000t\000'
>
> re_test = re.compile(p, re.UNICODE)
>
> print re_test.search(d, re.UNICODE)
What you have here are funny 8 bit characters, not unicode:
>>>>>> print p, d
ÿþ?@82VB ÿþtest?@82VBtt
I guess the encoding is utf-16, therefore:
>>> du = d.decode("utf-16")
>>> pu = p.decode("utf-16")
>>> r = re.compile(pu)
>>> m = r.search(du)
>>> m
<_sre.SRE_Match object at 0x40392090>
>>> print m.group(0).encode("utf-16")
ÿþ?@82VB
Works as expected :-)
Here's what the docs say about the unicode flag:
UNICODE
Make \w, \W, \b, and \B dependent on the Unicode character properties
database. New in version 2.0.
You may or may not need that when you refine your regexp.
Peter
More information about the Python-list
mailing list