DiPierro, Massimo MDiPierro at
Tue Oct 16 20:37:04 CEST 2007

Thank you this answers my question. I wanted to make sure it was actually designed this way.


From: Tim Chase [python.list at]
Sent: Tuesday, October 16, 2007 1:38 PM
To: DiPierro, Massimo
Cc: python-list at; Berthiaume, Andre
Subject: Re: re.sub

> Let me show you a very bad consequence of this...
> a=open('file1.txt','rb').read()
> b=re.sub('x',a,'x')
> open('file2.txt','wb').write(b)
> Now if file1.txt contains a \n or \" then file2.txt is not the
> same as file1.txt while it should be.

That's functioning as designed.  If you want to treat file1.txt
as a literal pattern for replacement, use re.escape() on it to
escape things you don't want.

Or, you can specially treat newlines:

   b=re.sub('x', a.replace('\n', '\\n'), 'x')

or just escape the backslashes on the incoming pattern:

   b=re.sub('x', a.replace('\\', '\\\\'), 'x')

In the help for the RE module's syntax, this is explicitly noted:
If you're not using a raw string to express the pattern, remember
that Python also uses the backslash as an escape sequence in
string literals; if the escape sequence isn't recognized by
Python's parser, the backslash and subsequent character are
included in the resulting string. However, if Python would
recognize the resulting sequence, the backslash should be
repeated twice. This is complicated and hard to understand, so
it's highly recommended that you use raw strings for all but the
simplest expressions.

The short upshot:  "it's highly recommended that you use raw
strings for all but the simplest expressions."

Thus, the string that you pass as your regexp should be a regexp.
  Not a "python interpretation a regexp before the regex engine
gets to touch it".


More information about the Python-list mailing list