Must be a bug in the re module [was: Why this result with the re module]

MRAB python at mrabarnett.plus.com
Tue Nov 2 23:01:17 EDT 2010


On 03/11/2010 01:41, Yingjie Lan wrote:
>> From: John Bond<lists at asd-group.com>
>> Subject: Re: Why this result with the re module
>> To: "Yingjie Lan"<lanyjie at yahoo.com>
>> Cc: python-list at python.org
>> Date: Tuesday, November 2, 2010, 8:09 PM
>> On 2/11/2010 12:19 PM, Yingjie Lan
>> wrote:
>>>> From: John Bond<lists at asd-group.com>
>>>> Subject: Re: Why this result with the re module
>>> Firstly, thanks a lot for your patient explanation.
>>> this time I have understood all your points
>> perfectly.
>>>
>>> Secondly, I'd like to clarify some of my points,
>> which
>>> did not get through because of my poor presentation.
>>>
>>> I suggested findall return a tuple of
>> re.MatchObject(s),
>>> with each MatchObject instance representing a match.
>>> This is consistent with the re.match() function
>> anyway.
>>> And it will eliminate the need of returning tuples,
>>> and it is much more precise and information rich.
>>>
>>> If that's not possible, and a tuple must be returned,
>>> then the whole match (not just subgroups) should
>>> always be included as the first element in the tuple,
>>> as that's group(0) or '\0'. Less surprise would
>> arise.
>>>
>>> Finally, it seems to me the algo for findall is
>> WRONG.
>>>
>>> To re.findall('(.a.)*', 'Mary has a lamb'),
>>> by reason of greediness of '*', and the requirement
>>> of non-overlapping, it should go like this
>>> (suppose an '' is at the beginning and at the end,
>>> and between two consecutive characters there is
>>> one and only one empty string ''. To show the
>>> match of empty strings clearly,
>>> I am concatenating each repeated match below):
>>>
>>> Steps for re.findall('(.a.)*', 'Mary has a lamb'):
>>>
>>> step 1: Match '' + 'Mar' + '' (gready!)
>>> step 2: skip 'y'
>>> step 3: Match ''
>>> step 4: skip ' '
>>> step 5: Match ''+'has'+' a '+'lam'+'' (greedy!)
>>> step 6: skip 'b'
>>> step 7: Match ''
>>>
>>> So there should be exactly 4 matches in total:
>>>
>>> 'Mar', '', 'has a lam', ''
>>>
>>> Also, the matches above shows
>>> that if a repeated subgroup only captures
>>> the last match, the subgroup (.a.)*
>>> should always capture '' here (see steps
>>> 1, 3, 5, 7) above.
>>>
>>> Yet the execution in Python results in 6 matches!
>>> And, the capturing subgroup with repetition
>>> sometimes got the wrong guy.
>>>
>>> So I believe the algorithm for findall must be WRONG.
>>>
>>> Regards,
>>>
>>> Yingjie
>> At a guess, I'd say what is happening is something like
>> this:
>>
>> Steps for re.findall('(.a.)*', 'Mary has a lamb'):
>>
>> step 1: Match 'Mar' at string index 0
>> step 2: Match '' at string index 3 (before 'y')
>> step 3: skip 'y'
>> step 4: Match '' at string index 4 (before ' ')
>> step 5: skip ' '
>> step 6: Match 'has a lam' at string index 5
>> step 7: Match '' at string index 14 (before 'b')
>> step 8: skip 'b'
>> step 9: Match '' at string index 15 (before EOS)
>> <EOS reached>
>>
>>
>> matches: ('Mar', '', '', 'has a lam', '', '')
>> returns: ['Mar', '', '', 'lam', '', ''] (*)
>>
>> (*) "has a " lost due to not being last repetition at that
>> match point
>>
>> Which seems about right to me! Greediness has nothing to do
>> with it, except that it causes 'has a lam' to be matched in
>> one match, instead of as three separate matches (of 'has', '
>> a ' and 'lam').
>
> Disagree in this case, where the whole regex
> matches an empty string. Greadiness will match
> as much as possible. So it will also match
> the empty strings between consecutive
> characters as much as possible, once
> we have properly defined all the unique
> empty strings. Because of greadiness,
> fewer matches should be found. In this
> case, it should find only 4 matches
> (shown in my steps) instead of 6 matches
> (shown in your steps).
>
Attempt 1 starts at the beginning of the string:
[Mar]y has a lamb
Matches once, returns 'Mar'

Attempt 2 starts where the previous match ended:
Mar[]y has a lamb
Matches an empty string, returns ''

Attempt 3 starts where the previous match ended:
Mary[] has a lamb
Matches an empty string, returns ''

Attempt 4 starts where the previous match ended:
Mary [has][ a ][lam]b
Matches 3 times, returns 'lam'

Attempt 5 starts where the previous match ended:
Mary has a lam[]b
Matches an empty string, returns ''

Attempt 6 starts where the previous match ended:
Mary has a lamb[]
Matches an empty string, returns ''

The result is therefore ['Mar', '', '', 'lam', '', '']



More information about the Python-list mailing list