[Python-bugs-list] [ python-Bugs-833106 ] Match delimited by ^ and
$ doesn't catch everything
SourceForge.net
noreply at sourceforge.net
Thu Oct 30 10:36:48 EST 2003
Bugs item #833106, was opened at 2003-10-30 10:08
Message generated for change (Comment added) made by tim_one
You can respond by visiting:
https://sourceforge.net/tracker/?func=detail&atid=105470&aid=833106&group_id=5470
Category: Regular Expressions
Group: Python 2.3
Status: Open
Resolution: None
Priority: 5
Submitted By: Greg Kochanski (gpk)
>Assigned to: Nobody/Anonymous (nobody)
Summary: Match delimited by ^ and $ doesn't catch everything
Initial Comment:
import re
_sfp = re.compile(r'^(?:(.*?)([XY][0-9]))*(.*?)$')
print _sfp.search('testX1Y2').groups()
print _sfp.match('testX1Y2').groups()
print _sfp.search('testY1X2again').groups()
print _sfp.search('testX1').groups()
print _sfp.search('Y2').groups()
Yields
('', 'Y2', '')
('', 'Y2', '')
('', 'X2', 'again')
('test', 'X1', '')
('', 'Y2', '')
Note that in the first three outputs, the string 'test'
doesn't appear in any group.
Note also that 'X1' doesn't appear in any group in
the first two outputs.
The RE is delimited by ^ and $, so it should match
everything or fail. It doesn't fail.
All the elements of the RE are in parentheses, so
everything should fall into one group or another.
Thus, all the text should show up in one group
or another.
----------------------------------------------------------------------
>Comment By: Tim Peters (tim_one)
Date: 2003-10-30 10:36
Message:
Logged In: YES
user_id=31435
These results look correct to me. When a capturing group
matches more than once in the course of a match, only the
last substring it matched is captured. Start by understanding
a simpler regexp:
>>> import re
>>> p = re.compile(r"^(a)+$")
>>> m = p.search("aaaaa")
>>> m.group(1)
'a'
>>> m.span(1)
(4, 5)
>>> # so group 1 last matched "aaaaa"[4:5], the last 'a'
# in the string
When you understand that, the behavior of your more-
complicated regexp should become clear. In exactly
analogous fashion, your groups capture the last substrings
they matched:
>>> m = _sfp.search('testX1Y2')
>>> m.span(1) # last matched empty string between 1 and Y
(6, 6)
>>> m.span(2) # last matched Y2
(6, 8)
>>> m.span(3) # last matched empty string at end of string
(8, 8)
>>>
----------------------------------------------------------------------
Comment By: Greg Kochanski (gpk)
Date: 2003-10-30 10:31
Message:
Logged In: YES
user_id=6290
Here's another example:
import re
_sfp = re.compile(r'([][][0-9])|(.*?)')
print _sfp.findall('testX1Y2')
print _sfp.findall('testY1X2again')
print _sfp.findall('testX1')
print _sfp.findall('Y2')
yields:
[('', ''), ('', ''), ('', ''), ('', ''), ('', ''), ('', ''),
('', ''), ('', ''), ('', '')]
[('', ''), ('', ''), ('', ''), ('', ''), ('', ''), ('', ''),
('', ''), ('', ''), ('', ''), ('', ''), ('', ''), ('', ''),
('', ''), ('', '')]
[('', ''), ('', ''), ('', ''), ('', ''), ('', ''), ('', ''),
('', '')]
[('', ''), ('', ''), ('', '')]
Which is weird in many dimensions.
First of all, the RE I specified could be interpreted to match
an infinite number of times. After all, it contains a
(.*?) element, which should be happiest with a zero-length
match, so one can stack up an infinite number of zero-length
matches in any finite string.
So, why is it finite?
Probably it should throw some kind of exception in
the compile phase.
----------------------------------------------------------------------
You can respond by visiting:
https://sourceforge.net/tracker/?func=detail&atid=105470&aid=833106&group_id=5470
More information about the Python-bugs-list
mailing list