[Tutor] re.findall() weirdness. [looks like a bug!]
Danny Yoo
dyoo@hkn.eecs.berkeley.edu
Tue, 26 Jun 2001 18:43:20 -0700 (PDT)
On Wed, 27 Jun 2001, Dan Tropp wrote:
> I tried these in my python shell. Why do the last two give what they do?
>
> >>> print re.findall('<.*?>','<a> </a> <a> </a>')
> ['<a>', '</a>', '<a>', '</a>']
> >>> print re.findall('<.*?>','<1> </2> \n<3> </4>')
> ['<1>', '</2>', '<3>', '</4>']
> >>> print re.findall('<.*?>','<1> </2> \n<3> </4>', re.I|re.S)
> []
> >>> print re.findall('<.*?>','<1> </2> \n<3> </4>', re.I)
> ['</2>', '<3>', '</4>']
Now this is curious, because according to the documentation at:
http://python.org/doc/current/lib/Contents_of_Module_re.html
re.findall() is only supposed to take in two arguments. In fact, in
Python 1.52, Python complains that:
###
# in Python 1.52:
>> print re.findall('<.*?>','<1> </2> \n<3> </4>', re.I)
Traceback (innermost last):
File "<stdin>", line 1, in ?
TypeError: too many arguments; expected 2, got 3
##
Let me check if the same behavior happens in 2.1:
###
# in Python 2.1
>>> re.findall('<.*?>','<1> </2> \n<3> </4>', re.I)
['</2>', '<3>', '</4>']
###
Now that is weird! This looks like it might be a bug. Let's take a look
at the source code, to see why it's doing that.
###
## source code in sre.py
def findall(pattern, string, maxsplit=0):
"""Return a list of all non-overlapping matches in the string.
If one or more groups are present in the pattern, return a
list of groups; this will be a list of tuples if the pattern
has more than one group.
Empty matches are included in the result."""
return _compile(pattern, 0).findall(string, maxsplit)
###
Weird! findall() in its current incarnation does take in a third
argument, contrary to the HTML documentation. But this makes no sense to
me. Why should findall need a maxsplit parameter, when maxsplit is
something that the split()ing operator works with? This really looks like
a bug to me.
Hmmm... well, the definition to findall() is adjacent to split(), so
perhaps someone made a mistake and accidently added maxsplit as an
argument. I believe that the corrected code in sre.py should be:
###
def findall(pattern, string):
"""Return a list of all non-overlapping matches in the string.
If one or more groups are present in the pattern, return a
list of groups; this will be a list of tuples if the pattern
has more than one group.
Empty matches are included in the result."""
return _compile(pattern, 0).findall(string)
###
instead.
Ever since June 1, 2000, the findall() code in sre.py has contained this
weird behavior:
http://cvs.sourceforge.net/cgi-bin/viewcvs.cgi/python/python/dist/src/Lib/sre.py?rev=1.5&content-type=text/vnd.viewcvs-markup
and even in the current development sources, it still has it!
http://cvs.sourceforge.net/cgi-bin/viewcvs.cgi/python/python/dist/src/Lib/sre.py?rev=1.25.2.1&content-type=text/vnd.viewcvs-markup
Dan, I think we should report this to the Implementors and see what they
think about it. Good catch! *grin* Do you want to submit this to
sourceforge?