[Python-bugs-list] [ python-Bugs-833106 ] Match delimited by ^ and $ doesn't catch everything

Thu Oct 30 10:36:48 EST 2003

Bugs item #833106, was opened at 2003-10-30 10:08
Message generated for change (Comment added) made by tim_one
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=105470&aid=833106&group_id=5470

Category: Regular Expressions
Group: Python 2.3
Status: Open
Resolution: None
Priority: 5
Submitted By: Greg Kochanski (gpk)
>Assigned to: Nobody/Anonymous (nobody)
Summary: Match delimited by ^ and $ doesn't catch everything

Initial Comment:
import re

_sfp = re.compile(r'^(?:(.*?)([XY][0-9]))*(.*?)$')

print _sfp.search('testX1Y2').groups()
print _sfp.match('testX1Y2').groups()
print _sfp.search('testY1X2again').groups()
print _sfp.search('testX1').groups()
print _sfp.search('Y2').groups()

Yields

('', 'Y2', '')
('', 'Y2', '')
('', 'X2', 'again')
('test', 'X1', '')
('', 'Y2', '')

Note that in the first three outputs, the string 'test'
doesn't appear in any group.

Note also that 'X1' doesn't appear in any group in
the first two outputs.

The RE is delimited by ^ and $, so it should match
everything or fail.   It doesn't fail.
All the elements of the RE are in parentheses, so
everything should fall into one group or another.
Thus, all the text should show up in one group
or another.

----------------------------------------------------------------------

>Comment By: Tim Peters (tim_one)
Date: 2003-10-30 10:36

Message:
Logged In: YES 
user_id=31435

These results look correct to me.  When a capturing group 
matches more than once in the course of a match, only the 
last substring it matched is captured.  Start by understanding 
a simpler regexp:

>>> import re
>>> p = re.compile(r"^(a)+$")
>>> m = p.search("aaaaa")
>>> m.group(1)
'a'
>>> m.span(1)
(4, 5)
>>> # so group 1 last matched "aaaaa"[4:5], the last 'a'
# in the string

When you understand that, the behavior of your more-
complicated regexp should become clear.  In exactly 
analogous fashion, your groups capture the last substrings 
they matched:

>>> m = _sfp.search('testX1Y2')
>>> m.span(1)  # last matched empty string between 1 and Y
(6, 6)
>>> m.span(2)  # last matched Y2
(6, 8)
>>> m.span(3)  # last matched empty string at end of string
(8, 8)
>>> 

----------------------------------------------------------------------

Comment By: Greg Kochanski (gpk)
Date: 2003-10-30 10:31

Message:
Logged In: YES 
user_id=6290

Here's another example:

import re

_sfp = re.compile(r'([][][0-9])|(.*?)')

print _sfp.findall('testX1Y2')
print _sfp.findall('testY1X2again')
print _sfp.findall('testX1')
print _sfp.findall('Y2')

yields:

[('', ''), ('', ''), ('', ''), ('', ''), ('', ''), ('', ''),
('', ''), ('', ''), ('', '')]
[('', ''), ('', ''), ('', ''), ('', ''), ('', ''), ('', ''),
('', ''), ('', ''), ('', ''), ('', ''), ('', ''), ('', ''),
('', ''), ('', '')]
[('', ''), ('', ''), ('', ''), ('', ''), ('', ''), ('', ''),
('', '')]
[('', ''), ('', ''), ('', '')]

Which is weird in many dimensions.
First of all, the RE I specified could be interpreted to match
an infinite number of times.  After all, it contains a
(.*?) element, which should be happiest with a zero-length
match, so one can stack up an infinite number of zero-length
matches in any finite string.

So, why is it finite?

Probably it should throw some kind of exception in
the compile phase.

----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=105470&aid=833106&group_id=5470