Confused by slash/escape in regexp

MRAB python at mrabarnett.plus.com
Sun Apr 11 19:27:11 EDT 2010


andrew cooke wrote:
> Is the third case here surprising to anyone else?  It doesn't make
> sense to me...
> 
> Python 2.6.2 (r262:71600, Oct 24 2009, 03:15:21)
> [GCC 4.4.1 [gcc-4_4-branch revision 150839]] on linux2
> Type "help", "copyright", "credits" or "license" for more information.
>>>> from re import compile
>>>> p1 = compile('a\x62c')

'a\x62c' is a string literal which is the same as 'abc', so re.compile
receives the characters:

     abc

as the regex, which matches the string:

     abc

>>>> p1.match('abc')
> <_sre.SRE_Match object at 0x7f4e8f93d578>
>>>> p2 = compile('a\\x62c')

'a\\x62c' is a string literal which represents the characters:

     a\x62c

so re.compile receives these characters as the regex.

The re module understands has its own set of escape sequences, most of
which are the same as Python's string escape sequences. The re module
treats \x62 like the string escape, ie it represents the character 'b',
so this regex is the same as:

     abc

>>>> p2.match('abc')
> <_sre.SRE_Match object at 0x7f4e8f93d920>
>>>> p3 = compile('a\\\x62c')

'a\\\x62c' is a string literal which is the same as 'a\\bc', so
re.compile receives the characters:

     a\bc

as the regex.

The re module treats the \b in a regex as representing a word boundary,
unless it's in a character set, eg. [\b].

The regex will try to match a word boundary sandwiched between 2
letters, which can never happen.

>>>> p3.match('a\\bc')
>>>> p3.match('abc')
>>>> p3.match('a\\\x62c')
>>>>



More information about the Python-list mailing list