unicode regex example: trouble

Fri May 21 04:45:26 EDT 2004

marek wrote:

> trying this example to make print MatchObject reference. Fails (prints
> None). Does anybody know where I am wrong?
> 
> # -*- coding: cp1251 -*-
> 
> import re
> 
> # pattern in Ukrainian ('привіт')
> p =                     '\377\376?\004@\0048\0042\004V\004B\004'
> 
> # data (pattern is in the middle of the string)
> d = '\377\376t\000e\000s\000t\000?\004@\0048\0042\004V\004B\004t\000t\000'
> 
> re_test = re.compile(p, re.UNICODE)
> 
> print re_test.search(d, re.UNICODE)

What you have here are funny 8 bit characters, not unicode:

>>>>>> print p, d
ÿþ?@82VB ÿþtest?@82VBtt

I guess the encoding is utf-16, therefore:

>>> du = d.decode("utf-16")
>>> pu = p.decode("utf-16")
>>> r = re.compile(pu)
>>> m = r.search(du)
>>> m
<_sre.SRE_Match object at 0x40392090>
>>> print m.group(0).encode("utf-16")
ÿþ?@82VB

Works as expected :-)

Here's what the docs say about the unicode flag:

UNICODE
 Make \w, \W, \b, and \B dependent on the Unicode character properties
database. New in version 2.0. 

You may or may not need that when you refine your regexp.

Peter