[Python-bugs-list] [ python-Bugs-429357 ] non-greedy regexp duplicating match bug

noreply@sourceforge.net noreply@sourceforge.net
Tue, 26 Jun 2001 12:11:08 -0700


Bugs item #429357, was opened at 2001-06-01 09:29
You can respond by visiting: 
http://sourceforge.net/tracker/?func=detail&atid=105470&aid=429357&group_id=5470

Category: Regular Expressions
Group: None
Status: Open
Resolution: None
Priority: 5
Submitted By: Matthew Mueller (donut)
>Assigned to: Fredrik Lundh (effbot)
Summary: non-greedy regexp duplicating match bug

Initial Comment:
I found some weird bug, where when a non-greedy match doesn't match anything,
it will duplicate the rest of the string instead of being None. 

#pyrebug.py:
import re
urlrebug=re.compile("""
	(.*?)://			#scheme
	(
		(.*?)			#user
		(?:
			:(.*)		#pass
		)?
	@)?
	(.*?)				#addr
	(?::([0-9]+))?			#port
	(/.*)?$				#path
""", re.VERBOSE)

testbad='foo://bah:81/pth'

print urlrebug.match(testbad).groups()

Bug Output:
>python2.1 pyrebug.py       
('foo', None, 'bah:81/pth', None, 'bah', '81', '/pth')
>python-cvs pyrebug.py       
('foo', None, 'bah:81/pth', None, 'bah', '81', '/pth')

Good (expected) Output:
>python1.5 pyrebug.py       
('foo', None, None, None, 'bah', '81', '/pth')



----------------------------------------------------------------------

Comment By: Matthew Mueller (donut)
Date: 2001-06-14 00:59

Message:
Logged In: YES 
user_id=65253

I think I understand what you are saying, and in the context
of the test, it doesn't seem too bad.  BUT, my original code
(and what I'd like to have) did not have the surrounding group.

So I'd just get: ('foo', 'bah:81/pth', None, 'bah', '81',
'/pth') 

Knowing the general ease of messing up regexs when writing
them, I'm sure you can image the pain I went through  before
actually realizing it was a python bug :)

----------------------------------------------------------------------

Comment By: Nobody/Anonymous (nobody)
Date: 2001-06-13 10:12

Message:
Logged In: NO 

What's happening makes sense, on one level.
When the regex engine gets to the user:pass@ part

((.*?)(?::(.*))?@)?

which fill groups 2, 3, and 4, the .*? of group 3 has
to try at every character in the rest of the string before
admitting overall defeat. In doing that, the last time
that group 3 successfully completely locally, it has the
rest of the string matched. Of course, overall, group
three is enclosed within group 2, and when group two
couldn't complete successfully, the engine knows it can
skip group two (due to the ? modifying it), so it totally
bails on groups 2, 3 and 4 to continue with the rest of
the expression.

What you'd like to happen is when that "bailing" happens
for group 2, the enclosing groups 3 and 4 would get zereoed
out (since they didn't participate in the *final* overall
match). That makes sense, and is what I would expect to
happen.  However, what *is* happening is that group 3 is
keeping the string that *it* last matched (even thought
that last match didn't contribute to the final, overall
match).

I'm not explaining this well -- I hope you can understand
it despite that. Sorry.

	Jeffrey


----------------------------------------------------------------------

You can respond by visiting: 
http://sourceforge.net/tracker/?func=detail&atid=105470&aid=429357&group_id=5470