[Tutor] the regex boundary about chinese word
Peter Otten
__peter__ at web.de
Fri May 4 12:21:13 CEST 2012
goog cheng wrote:
> Hi, I got this problem :
>
> #!python
> # -*- coding: utf-8 -*-
> import re
>
> p = re.compile(ur'\bc123\b')
> print '**',p.search('no class c123 at all').group()
>
> p = re.compile(ur'\b\u7a0b\u6770\b')
> print ur'\u7a0b\u6770'
> print '****',p.search(' 程杰 abc'.decode('utf8'))
>
> why the \b boundary can't match the word '程杰'
You need to provide the UNICODE flag:
>>> re.compile(ur"\b程杰\b").search(u" 程杰 abc")
>>> re.compile(ur"\b程杰\b", re.UNICODE).search(u" 程杰 abc")
<_sre.SRE_Match object at 0x7f0beb325f38>
See http://docs.python.org/library/re.html
"""
Note that formally, \b is defined as the boundary between a \w and a \W
character (or vice versa), or between \w and the beginning/end of the string
...
\w
When the LOCALE and UNICODE flags are not specified, matches any
alphanumeric character and the underscore; this is equivalent to the set [a-
zA-Z0-9_]. With LOCALE, it will match the set [0-9_] plus whatever
characters are defined as alphanumeric for the current locale. If UNICODE is
set, this will match the characters [0-9_] plus whatever is classified as
alphanumeric in the Unicode character properties database.
"""
More information about the Tutor
mailing list