[ #456742 ] Failing test case for .*?

Greg Chapman glchapman at earthlink.net
Mon Nov 5 08:32:29 EST 2001


On Sat, 3 Nov 2001 13:06:03 -0500, "Tim Peters" <tim.one at home.com> wrote:

>The bug is that the pattern should have found just the 'a1' tail, not all of
>s; it's the same bug in both cases:
>
>import re
>s = "a\nb\na1"
>
>m = re.search(r'[^\n]+?\d', s)
>print m and `m.group(0)`  # prints 'a\nb\na1'; should have been 'a1'
>
>m = re.search(r'([^\n]+?)\d', s)
>print m and `m.group(0)`  # also prints 'a\nb\na1'

I think this is the same bug I ran into in August (see report 456612).  Note
that newlines are not special for this:

>>> s = "acbca1"
>>> m = re.search(r"[^c]+?\d", s)
>>> m.group(0)
'acbca1'
>>> m = re.search(r"([^c]+?)\d", s)
>>> m.group(0)
'acbca1'

I did a little tracing through the SRE code, and it looked to me that the change
labled 133283 caused this.  I believe the change was made so that minimizing
repeats minimized the number of characters they match (rather than minimizing
the number of times the repeated pattern matches); see the test case (also
labeled 133283) added to re_tests.py.  This is done by finding the nearest
occurance of the pattern following the minimizing repeat (the "\d" in this
example), and then (in theory) checking to see if the intervening characters
match the minimally repeated pattern.  Unfortunately, it appears that in some
cases SRE skips the check of the intervening characters, so minimally repeated
groups end up matching anything.

I think this may also explain the matching of newlines in the absence of
re.DOTALL as reported in bug 477728.

---
Greg Chapman




More information about the Python-list mailing list