[Python-bugs-list] [ python-Bugs-429357 ] non-greedy regexp duplicating match bug

noreply@sourceforge.net noreply@sourceforge.net
Wed, 06 Nov 2002 08:39:22 -0800


Bugs item #429357, was opened at 2001-06-01 16:29
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=105470&aid=429357&group_id=5470

Category: Regular Expressions
Group: None
>Status: Closed
>Resolution: Fixed
Priority: 5
Submitted By: Matthew Mueller (donut)
Assigned to: Fredrik Lundh (effbot)
Summary: non-greedy regexp duplicating match bug

Initial Comment:
I found some weird bug, where when a non-greedy match doesn't match anything,
it will duplicate the rest of the string instead of being None. 

#pyrebug.py:
import re
urlrebug=re.compile("""
	(.*?)://			#scheme
	(
		(.*?)			#user
		(?:
			:(.*)		#pass
		)?
	@)?
	(.*?)				#addr
	(?::([0-9]+))?			#port
	(/.*)?$				#path
""", re.VERBOSE)

testbad='foo://bah:81/pth'

print urlrebug.match(testbad).groups()

Bug Output:
>python2.1 pyrebug.py       
('foo', None, 'bah:81/pth', None, 'bah', '81', '/pth')
>python-cvs pyrebug.py       
('foo', None, 'bah:81/pth', None, 'bah', '81', '/pth')

Good (expected) Output:
>python1.5 pyrebug.py       
('foo', None, None, None, 'bah', '81', '/pth')



----------------------------------------------------------------------

>Comment By: Gustavo Niemeyer (niemeyer)
Date: 2002-11-06 16:39

Message:
Logged In: YES 
user_id=7887

This problem was fixed in the following CVS revisions: 
 
Lib/test/re_tests.py:1.30->1.31 
Lib/test/test_sre.py:1.37->1.38 
Misc/NEWS:1.511->1.512 
Modules/_sre.c:2.83->2.84 
 
Thank you! 
 

----------------------------------------------------------------------

Comment By: Matthew Mueller (donut)
Date: 2001-10-05 04:16

Message:
Logged In: YES 
user_id=65253

Ok, after poking and prodding the _sre.c code a bunch until
I  (hopefully) understand what is happening, I've created a
patch.  It passes all existing re tests as well as new ones
I added for this bug.  (I've also made a patch for the
similar, but seperate, bug #448951 which I will post there
shortly.)


----------------------------------------------------------------------

Comment By: Gregory Smith (gregsmith)
Date: 2001-08-30 16:26

Message:
Logged In: YES 
user_id=292741

This looks like the same bug I have reported (with a much simpler example)
as #448951 (missed this one before because I was looking for 'group').
What I found is consistent with Jeffrey's comments -
if you have a situation where an optional part is fully scanned before the
state machine can tell if it should actually be matched, the contained tentative
match(es) are stored in the group() even if the optional part turns out 
to fail. Presumably, such a case needs to be handled by going back and
deleting these after the s.m. determines that the optional part was not
matched. In my example, I mention a small modification to the test case where
the failure of the optional ? is decided one character later (at the end of the
() group, not beyond it);  this is enough to make it start working again.



----------------------------------------------------------------------

Comment By: Matthew Mueller (donut)
Date: 2001-06-14 07:59

Message:
Logged In: YES 
user_id=65253

I think I understand what you are saying, and in the context
of the test, it doesn't seem too bad.  BUT, my original code
(and what I'd like to have) did not have the surrounding group.

So I'd just get: ('foo', 'bah:81/pth', None, 'bah', '81',
'/pth') 

Knowing the general ease of messing up regexs when writing
them, I'm sure you can image the pain I went through  before
actually realizing it was a python bug :)

----------------------------------------------------------------------

Comment By: Nobody/Anonymous (nobody)
Date: 2001-06-13 17:12

Message:
Logged In: NO 

What's happening makes sense, on one level.
When the regex engine gets to the user:pass@ part

((.*?)(?::(.*))?@)?

which fill groups 2, 3, and 4, the .*? of group 3 has
to try at every character in the rest of the string before
admitting overall defeat. In doing that, the last time
that group 3 successfully completely locally, it has the
rest of the string matched. Of course, overall, group
three is enclosed within group 2, and when group two
couldn't complete successfully, the engine knows it can
skip group two (due to the ? modifying it), so it totally
bails on groups 2, 3 and 4 to continue with the rest of
the expression.

What you'd like to happen is when that "bailing" happens
for group 2, the enclosing groups 3 and 4 would get zereoed
out (since they didn't participate in the *final* overall
match). That makes sense, and is what I would expect to
happen.  However, what *is* happening is that group 3 is
keeping the string that *it* last matched (even thought
that last match didn't contribute to the final, overall
match).

I'm not explaining this well -- I hope you can understand
it despite that. Sorry.

	Jeffrey


----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=105470&aid=429357&group_id=5470