Raw string substitution problem

MRAB python at mrabarnett.plus.com
Thu Dec 17 12:38:34 EST 2009


Alan G Isaac wrote:
> On 12/17/2009 11:24 AM, Richard Brodie wrote:
>> A raw string is not a distinct type from an ordinary string
>> in the same way byte strings and Unicode strings are. It
>> is a merely a notation for constants, like writing integers
>> in hexadecimal.
>>
>>>>> (r'\n', u'a', 0x16)
>> ('\\n', u'a', 22)
> 
> 
> 
> Yes, that was a mistake.  But the problem remains::
> 
>         >>> re.sub('abc', r'a\nb\n.c\a','123abcdefg') == re.sub('abc', 
> 'a\\nb\\n.c\\a',' 123abcdefg') == re.sub('abc', 'a\nb\n.c\a','123abcdefg')
>         True
>         >>> r'a\nb\n.c\a' == 'a\\nb\\n.c\\a' == 'a\nb\n.c\a'
>         False
> 
> Why are the first two strings being treated as if they are the last one?
> That is, why isn't '\\' being processed in the obvious way?
> This still seems wrong.  Why isn't it?
> 
> More simply, consider::
> 
>         >>> re.sub('abc', '\\', '123abcdefg')
>         Traceback (most recent call last):
>           File "<stdin>", line 1, in <module>
>           File "C:\Python26\lib\re.py", line 151, in sub
>             return _compile(pattern, 0).sub(repl, string, count)
>           File "C:\Python26\lib\re.py", line 273, in _subx
>             template = _compile_repl(template, pattern)
>           File "C:\Python26\lib\re.py", line 260, in _compile_repl
>             raise error, v # invalid expression
>         sre_constants.error: bogus escape (end of line)
> 
> Why is this the proper handling of what one might think would be an
> obvious substitution?
> 
Regular expressions and replacement strings have their own escaping
mechanism, which also uses backslashes.

Some of these regex escape sequences are the same as those of string
literals, eg \n represents a newline; others are different, eg \b in a
regex represents a word boundary and not a backspace as in a string
literal.

You can match a newline in a regex by either using an actual newline
character ('\n' in a string literal) or an escape sequence ('\\n' or
r'\n' in a string literal). If you want a regex to match an actual
backslash followed by a letter 'n' then you need to escape the backslash
in the regex and then either use a raw string literal or escape it again
in a non-raw string literal.

     Match characters: <newline>
     Regex: \n
     Raw string literal: r'\n'
     Non-raw string literal: '\\n'

     Match characters: \n
     Regex: \\n
     Raw string literal: r'\\n'
     Non-raw string literal: '\\\\n'

     Replace with characters: <newline>
     Replacement: \n
     Raw string literal: r'\n'
     Non-raw string literal: '\\n'

     Replace with characters: \n
     Replacement: \\n
     Raw string literal: r'\\n'
     Non-raw string literal: '\\\\n'



More information about the Python-list mailing list