[ python-Bugs-1518406 ] re '\' char interpretation problem

Fri Jul 7 00:55:02 CEST 2006

Bugs item #1518406, was opened at 2006-07-06 21:26
Message generated for change (Comment added) made by niemeyer
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=105470&aid=1518406&group_id=5470

Please note that this message will contain a full copy of the comment thread,
including the initial issue submission, for this request,
not just the latest update.
Category: Regular Expressions
Group: Python 2.4
Status: Closed
Resolution: Invalid
Priority: 5
Submitted By: ollie oldham (ooldham)
Assigned to: Gustavo Niemeyer (niemeyer)
Summary: re '\' char interpretation problem

Initial Comment:
I've run across 2 problems having to do with '\' 
character problems with the re module.

Problem 1 does not match the re when it should have.
Problem 2 matches, when it should not have.

There is a short snippet of code attached that shows 
the problems I'm having, and the output as it occurs 
on my machine.

I'm running on Windows 2000
Python versions: 2.4b1 and 2.4.3c1 both act the same 
way.

Problem (1) : why does * work and not + ?
import re
rex = re.compile(r'[a-z]:\.*', re.IGNORECASE)
rey = re.compile(r'[a-z]:\.+', re.IGNORECASE)
path1 = r'D:\Logs'
print rex.match(path1) # Matches - as it should have.
print rey.match(path1) # FAILES to match - should have.

Problem 2) : match occurs on nonUncPath when it should 
not
import re
uncPath = r'\\someUNC\path'
nonUncPath = r'\nonUnc\path'
rew = re.compile('\\\\.+', re.IGNORECASE)
print rew.match(uncPath) # works as it should.
print rew.match(nonUncPath) # matches and it should 
NOT.

----------------------------------------------------------------------

>Comment By: Gustavo Niemeyer (niemeyer)
Date: 2006-07-06 22:55

Message:
Logged In: YES 
user_id=7887

Please, use a single way to report issues. Do not message
*and* add a comment to the bug.

I think you're missing the behavior of r'' in Python. It
changes the way the Python interpreter parses the string,
not the way the regular expression compiler/interpreter
works. r'\.' is precisely the same as '\\.', and both of
them really describe the string |\.|.

  >>> r'\.' == '\\.'
  True

  >>> print r'\.'
  \.

Escaping a dot means a real dot. Please have a look at the
re module documentation and perhaps some general regular
expression info for more details.

----------------------------------------------------------------------

Comment By: ollie oldham (ooldham)
Date: 2006-07-06 22:46

Message:
Logged In: YES 
user_id=649833

I beg to differ on problem 1)

Since â€˜râ€™ was used in the definition of both the re and 
path, the â€˜.â€™ Char is not being escaped (not supposed to be 
anyway).
And even if it is, then rex=re.compile(â€˜[a-z]:\\.+â€™, 
re.IGNORECASE) should get me what I want (in textual form:: 
char a-z colon backslash with 1 or more trailing chars).
But that does not work either.

I beg to differ on item 2) as well:
Yes - '\\\\.+' is the equivalent of r'\\.+'
BUT I then read this as: 2 backslashes with 1 or more 
chars â€“ NOT backslash with escaped â€˜.â€™

----------------------------------------------------------------------

Comment By: Gustavo Niemeyer (niemeyer)
Date: 2006-07-06 21:36

Message:
Logged In: YES 
user_id=7887

1) r'[a-z]:\.+' should not match r'D:\Logs'. r'\.+' matches
one or more dots. There's no dot in this string.

2) '\\\\.+' is the equivalent of r'\\.+', and should match
anything that starts with a '\' and has at least one char
following it, which includes r'\nonUnc\path'.

----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=105470&aid=1518406&group_id=5470