[Tutor] the regex boundary about chinese word

Fri May 4 12:21:13 CEST 2012

goog cheng wrote:

> Hi,  I got this problem :
> 
> #!python
> # -*- coding: utf-8 -*-
> import re
> 
> p = re.compile(ur'\bc123\b')
> print '**',p.search('no class c123 at all').group()
> 
> p = re.compile(ur'\b\u7a0b\u6770\b')
> print ur'\u7a0b\u6770'
> print '****',p.search(' 程杰 abc'.decode('utf8'))
> 
> why the  \b   boundary  can't match the word '程杰'

You need to provide the UNICODE flag:

>>> re.compile(ur"\b程杰\b").search(u" 程杰 abc")
>>> re.compile(ur"\b程杰\b", re.UNICODE).search(u" 程杰 abc")
<_sre.SRE_Match object at 0x7f0beb325f38>

See http://docs.python.org/library/re.html

"""
Note that formally, \b is defined as the boundary between a \w and a \W 
character (or vice versa), or between \w and the beginning/end of the string

...

\w
When the LOCALE and UNICODE flags are not specified, matches any 
alphanumeric character and the underscore; this is equivalent to the set [a-
zA-Z0-9_]. With LOCALE, it will match the set [0-9_] plus whatever 
characters are defined as alphanumeric for the current locale. If UNICODE is 
set, this will match the characters [0-9_] plus whatever is classified as 
alphanumeric in the Unicode character properties database.
"""