inserting \ in regular expressions
Dave Angel
d at davea.name
Wed Oct 26 16:47:54 EDT 2011
On 10/26/2011 03:48 PM, Ross Boylan wrote:
> I want to replace every \ and " (the two characters for backslash and
> double quotes) with a \ and the same character, i.e.,
> \ -> \\
> " -> \"
>
> I have not been able to figure out how to do that. The documentation
> for re.sub says "repl can be a string or a function; if it is a string,
> any backslash escapes in it are processed.That is, \n is converted to a
> single newline character, \r is converted to a carriage return, and so
> forth. Unknown escapes such as \j are left alone."
>
> \\ is apparently unknown, and so is left as is. So I'm unable to get a
> single \.
>
> Here are some tries in Python 2.5.2. The document suggested the result
> of a function might not be subject to the same problem, but it seems to
> be.
>>>> def f(m):
> ... return "\\"+m.group(1)
> ...
>>>> re.sub(r"([\\\"])", f, 'Silly " quote')
> 'Silly \\" quote'
> <SNIP>
>>> re.sub(r"([\\\"])", "\\\\\\1", 'Silly " quote')
> 'Silly \\" quote'
>
> Or perhaps I'm confused about what the displayed results mean. If a
> string has a literal \, does it get shown as \\?
>
> I'd appreciate it if you cc me on the reply.
>
> Thanks.
> Ross Boylan
>
I can't really help on the regex aspect of your code, but I can tell you
a little about backslashes, quote literals, the interpreter, and python.
First, I'd scrap the interpreter and write your stuff to a file. Then
test it by running that file. The reason for that is that the
interpreter is helpfully trying to reconstruct the string you'd have to
type in order to get that result. So while you may have successfully
turned a double bacdkslash into a single one, the interpreter helpfully
does the inverse, and you don't see whether you're right or not.
Next, always assign to variables, and test those variables on a separate
line with the regex. This is probably what your document meant when it
mentioned the result of a function.
Now some details about python.
When python compiles/interprets a quote literal, the syntax parsing has
to decide where the literal stops, so quotes are treated specially.
Sometimes you can sidestep the problem of embedding quotes inside
literals by using single quotes on the outside and double inside, or
vice versa. As you did on the 'Silly " quote' example.
But the more general way to put funny characters into a quote literal is
to escape each with a backslash. So there a bunch of two-character
escapes. backslash-quote is how you can put either kind of quote into a
literal, regardless of what's being used to delimit it. backslash-n
gets a newline, which would similarly be bad syntax. backslash-t and
some others are usually less troublesome, but can be surprising. And
backslash-backslash represents a single backslash. There are also
backslash codes to represent arbitrary characters you might not have on
your keyboard. And these may use multiple characters after the backslash.
So write a bunch of lines like
a = 'this is\'nt a surprise'
print a
and experiment. Notice that if you use \n in such a string, the print
will put it on two lines. Likewise the tab is executed.
Now for a digression. The interpreter uses repr() to display strings.
You can experiment with that by doing
print a
print repr(a)
Notice the latter puts quotes around the string. They are NOT part of
the string object in a. And it re-escapes any embedded funny
characters, sometimes differently than the way you entered them.
Now, once you're confident that you can write a literal to express any
possible string, try calling your regex.
print re.sub(a, b, c)
or whatever.
Now, one way to cheat on the string if you know you'll want to put
actual backslashes is to use the raw string. That works quite well
unless you want the string to end with a backslash. There isn't a way
to enter that as a single raw literal. You'd have to do something
string like
a = r"strange\literal\with\some\stuff" + "\\"
My understanding is that no valid regex ends with a backslash, so this
may not affect you.
Now there are other ways to acquire a string object. If you got it from
a raw_input() call, it doesn't need to be escaped, but it can't have an
embedded newline, since the enter key is how the input is completed. If
you read it from a file, it doesn't need to be escaped.
Now you're ready to see what other funny requirements regex needs. You
will be escaping stuff for their purposes, and sometimes that means your
literal might have 4 or even more backslashes in a row. But hopefully
now you'll see how to separate the different problems.
--
DaveA
More information about the Python-list
mailing list