[Tutor] re.findall() weirdness. [looks like a bug!]

Danny Yoo dyoo@hkn.eecs.berkeley.edu
Tue, 26 Jun 2001 18:43:20 -0700 (PDT)


On Wed, 27 Jun 2001, Dan Tropp wrote:

> I tried these in my python shell. Why do the last two give what they do?
> 
> >>> print re.findall('<.*?>','<a> </a> <a> </a>')
> ['<a>', '</a>', '<a>', '</a>']
> >>> print re.findall('<.*?>','<1> </2> \n<3> </4>')
> ['<1>', '</2>', '<3>', '</4>']
> >>> print re.findall('<.*?>','<1> </2> \n<3> </4>', re.I|re.S)
> []
> >>> print re.findall('<.*?>','<1> </2> \n<3> </4>', re.I)
> ['</2>', '<3>', '</4>']

Now this is curious, because according to the documentation at:

    http://python.org/doc/current/lib/Contents_of_Module_re.html

re.findall() is only supposed to take in two arguments.  In fact, in
Python 1.52, Python complains that:

###
# in Python 1.52:
>> print re.findall('<.*?>','<1> </2> \n<3> </4>', re.I)
Traceback (innermost last):
  File "<stdin>", line 1, in ?
TypeError: too many arguments; expected 2, got 3
##


Let me check if the same behavior happens in 2.1:

###
# in Python 2.1
>>> re.findall('<.*?>','<1> </2> \n<3> </4>', re.I)
['</2>', '<3>', '</4>']
###

Now that is weird!  This looks like it might be a bug.  Let's take a look
at the source code, to see why it's doing that.


###
## source code in sre.py
def findall(pattern, string, maxsplit=0):
    """Return a list of all non-overlapping matches in the string.

    If one or more groups are present in the pattern, return a
    list of groups; this will be a list of tuples if the pattern
    has more than one group.

    Empty matches are included in the result."""
    return _compile(pattern, 0).findall(string, maxsplit)
###

Weird!  findall() in its current incarnation does take in a third
argument, contrary to the HTML documentation.  But this makes no sense to
me.  Why should findall need a maxsplit parameter, when maxsplit is
something that the split()ing operator works with?  This really looks like
a bug to me.


Hmmm... well, the definition to findall() is adjacent to split(), so
perhaps someone made a mistake and accidently added maxsplit as an
argument.  I believe that the corrected code in sre.py should be:

###
def findall(pattern, string):
    """Return a list of all non-overlapping matches in the string.

    If one or more groups are present in the pattern, return a
    list of groups; this will be a list of tuples if the pattern
    has more than one group.

    Empty matches are included in the result."""
    return _compile(pattern, 0).findall(string)
###

instead.

Ever since June 1, 2000, the findall() code in sre.py has contained this
weird behavior:

http://cvs.sourceforge.net/cgi-bin/viewcvs.cgi/python/python/dist/src/Lib/sre.py?rev=1.5&content-type=text/vnd.viewcvs-markup

and even in the current development sources, it still has it!

http://cvs.sourceforge.net/cgi-bin/viewcvs.cgi/python/python/dist/src/Lib/sre.py?rev=1.25.2.1&content-type=text/vnd.viewcvs-markup


Dan, I think we should report this to the Implementors and see what they
think about it.  Good catch!  *grin*  Do you want to submit this to
sourceforge?