[Python-bugs-list] [Bug #127259] New re breaks on some '*?' matches

noreply@sourceforge.net noreply@sourceforge.net
Tue, 02 Jan 2001 15:39:21 -0800


Bug #127259, was updated on 2001-Jan-01 20:37
Here is a current snapshot of the bug.

Project: Python
Category: Regular Expressions
Status: Open
Resolution: None
Bug Group: None
Priority: 5
Submitted by: nobody
Assigned to : effbot
Summary: New re breaks on some '*?' matches

Details: New python library:

Python 2.0 (#2, Nov 29 2000, 07:33:50) 
[GCC 2.96 20000731 (Red Hat Linux 7.0)] on linux2
Type "copyright", "credits" or "license" for more information.
>>> import re
>>> re.match("a[ ]*?\ (\d+)", "a   10")
<SRE_Match object at 0x81db748>
>>> re.match("a[ ]*?\ (\d+)", "a    10")

Old python library:
Python 1.5.2 (#1, Aug 25 2000, 09:33:37)  [GCC 2.96 20000731
(experimental)] on linux-i386
Copyright 1991-1995 Stichting Mathematisch Centrum, Amsterdam
>>> import re
>>> re.match("a[ ]*?\ (\d+)", "a   10")
<re.MatchObject instance at 80dc1e0>
>>> re.match("a[ ]*?\ (\d+)", "a    10")
<re.MatchObject instance at 80d9c88>
>>> 

I see no reason why the second line should not match.  (IE.
the old regular expression library seems correct to me.)

Ok, "so don't do that" - well I encountered it in code that
autogenerates regular expressions, so it isn't so easy to
always avoid.  Besides, it doesn't look correct to me.

Thanks,
-Kevin (kevoc@bellatlantic.net)


Follow-Ups:

Date: 2001-Jan-02 15:39
By: nobody

Comment:
Two bugs in one: Kevin's problem was that SRE  didn't always reset the
internal string pointer after a failed minimizing tail match.

Pearu's problem was that the meaning of \Z was (incorrectly) changed by
(?m) -- in other words, \Z behaved like $.

Both bugs have been fixed on my box; I'll check them in as soon as I've
fixed my CVS install...

</F>
-------------------------------------------------------

Date: 2001-Jan-02 09:56
By: pearu

Comment:
Here is another example where new re produces different (incorrect) result
compared to Python 1.5.2 re module:

Python 1.5.2:
>>>re.match(r'(?ms).*?}\s*\Z(?P<rest>.*)','{}\012}\012').groupdict()
{'rest': ''}

Python 2.0:
>>>re.match(r'(?ms).*?}\s*\Z(?P<rest>.*)','{}\012}\012').groupdict()
{'rest': '\012}\012'}

which deviates from the definition of \Z:
\Z  Matches only at the end of the string.

Here group 'rest' is used only for illustration, it should always be empty
string. tim.one has suspected that this bug  could be related to `*?' part.
Though it seems to me that \Z meaning is violated here more.
Thanks,
Pearu <pearu@ioc.ee>
-------------------------------------------------------

Date: 2001-Jan-02 05:23
By: gvanrossum

Comment:
Also note that if you increase the number of spaces between 'a' and '10',
sre gives a match for an odd number of spaces only.
-------------------------------------------------------

Date: 2001-Jan-01 21:20
By: tim_one

Comment:
Reproduced and assigned to /F.  Agree it's a bug.  Also agree you shouldn't
do that <wink>.

SourceForge's treatment of whitespace is maddening.  Note that there
*should* be 3 blanks in "a   10" (first example) but 4 in "a    10"
(second, failing example).



-------------------------------------------------------

For detailed info, follow this link:
http://sourceforge.net/bugs/?func=detailbug&bug_id=127259&group_id=5470