[Python-bugs-list] [Bug #128830] re is greedy with non-greedy operator
noreply@sourceforge.net
noreply@sourceforge.net
Tue, 16 Jan 2001 01:07:59 -0800
Bug #128830, was updated on 2001-Jan-15 03:31
Here is a current snapshot of the bug.
Project: Python
Category: Regular Expressions
Status: Closed
Resolution: Wont Fix
Bug Group: Not a Bug
Priority: 5
Submitted by: beroul
Assigned to : effbot
Summary: re is greedy with non-greedy operator
Details: In the program below, the pattern "<!--.*?-->" is used to match an
SGML comment. Despite the use of the non-greedy operator '?', re fails to
find the shortest possible match, which would be the comment preceding
"<!ELEMENT bar..."; instead, it uses all the text preceding "<!ELEMENT
bar..." as the match for the comment pattern.
---
import re
dtd_text = """
<!--
The oranges attribute.
-->
<!ATTLIST foo
oranges CDATA #IMPLIED
>
<!--
The bar element.
-->
<!ELEMENT bar
(#PCDATA)
>
"""
element_pattern = re.compile(r"(?P<comment><!--.*?-->\s+)"
r"(?P<tag_text><!ELEMENT"
r"\s+.*?>)",
re.DOTALL)
match = element_pattern.search(dtd_text)
if match:
print "Matched comment:"
print "----------------"
print match.group("comment")
print "Matched tag text:"
print "-----------------"
print match.group("tag_text")
else:
print "No match found."
Follow-Ups:
Date: 2001-Jan-16 01:07
By: effbot
Comment:
You forgot to read the documentation for the search method: "Scan through
string looking for A LOCATION WHERE THIS REGULAR EXPRESSION PRODUCES A
MATCH, and return a corresponding MatchObject instance" (my emphasis).
The "*?" construct does exactly what the documentation says: it matches as
few characters as possible, at THAT location.
But sure, the documentation can always be improved.
-------------------------------------------------------
Date: 2001-Jan-15 17:10
By: beroul
Comment:
OK, in that case, it might be helpful to clarify the documentation, which
says that when you use ".*?", "as few characters as possible will be
matched", i.e. that you will in fact get the shortest possible match.
-------------------------------------------------------
Date: 2001-Jan-15 13:50
By: tim_one
Comment:
FYI, POSIX semantics also matches at the leftmost-possible position (and
then finds the longest (or shortest) possible match at that point, without
regard to the ordering of alternatives etc -- it's the latter point where
the semantics differ from Perl/Python/Emacs/etc). I don't know of any
regexp implementation that acts the way beroul expected. He could use
re.findall() to find all the comments and then pick the shortest himself,
though.
-------------------------------------------------------
Date: 2001-Jan-15 09:04
By: effbot
Comment:
Python's RE search method doesn't look for the shortest possible match, it
looks for the *first* possible match.
(or in other words, Python provide Perl-style semantics, not POSIX
semantics)
-------------------------------------------------------
Date: 2001-Jan-15 03:49
By: beroul
Comment:
Actually, here's a minimal example: given the string "<a><b>foo", the
pattern "<.*?>foo" will match the entire string, when it should match only
"<b>foo".
-------------------------------------------------------
For detailed info, follow this link:
http://sourceforge.net/bugs/?func=detailbug&bug_id=128830&group_id=5470