Why this result with the re module
lists at asd-group.com
Tue Nov 2 17:09:59 CET 2010
On 2/11/2010 12:19 PM, Yingjie Lan wrote:
>> From: John Bond<lists at asd-group.com>
>> Subject: Re: Why this result with the re module
> Firstly, thanks a lot for your patient explanation.
> this time I have understood all your points perfectly.
> Secondly, I'd like to clarify some of my points, which
> did not get through because of my poor presentation.
> I suggested findall return a tuple of re.MatchObject(s),
> with each MatchObject instance representing a match.
> This is consistent with the re.match() function anyway.
> And it will eliminate the need of returning tuples,
> and it is much more precise and information rich.
> If that's not possible, and a tuple must be returned,
> then the whole match (not just subgroups) should
> always be included as the first element in the tuple,
> as that's group(0) or '\0'. Less surprise would arise.
> Finally, it seems to me the algo for findall is WRONG.
> To re.findall('(.a.)*', 'Mary has a lamb'),
> by reason of greediness of '*', and the requirement
> of non-overlapping, it should go like this
> (suppose an '' is at the beginning and at the end,
> and between two consecutive characters there is
> one and only one empty string ''. To show the
> match of empty strings clearly,
> I am concatenating each repeated match below):
> Steps for re.findall('(.a.)*', 'Mary has a lamb'):
> step 1: Match '' + 'Mar' + '' (gready!)
> step 2: skip 'y'
> step 3: Match ''
> step 4: skip ' '
> step 5: Match ''+'has'+' a '+'lam'+'' (greedy!)
> step 6: skip 'b'
> step 7: Match ''
> So there should be exactly 4 matches in total:
> 'Mar', '', 'has a lam', ''
> Also, the matches above shows
> that if a repeated subgroup only captures
> the last match, the subgroup (.a.)*
> should always capture '' here (see steps
> 1, 3, 5, 7) above.
> Yet the execution in Python results in 6 matches!
> And, the capturing subgroup with repetition
> sometimes got the wrong guy.
> So I believe the algorithm for findall must be WRONG.
At a guess, I'd say what is happening is something like this:
Steps for re.findall('(.a.)*', 'Mary has a lamb'):
step 1: Match 'Mar' at string index 0
step 2: Match '' at string index 3 (before 'y')
step 3: skip 'y'
step 4: Match '' at string index 4 (before ' ')
step 5: skip ' '
step 6: Match 'has a lam' at string index 5
step 7: Match '' at string index 14 (before 'b')
step 8: skip 'b'
step 9: Match '' at string index 15 (before EOS)
matches: ('Mar', '', '', 'has a lam', '', '')
returns: ['Mar', '', '', 'lam', '', ''] (*)
(*) "has a " lost due to not being last repetition at that match point
Which seems about right to me! Greediness has nothing to do with it, except that it causes 'has a lam' to be matched in one match, instead of as three separate matches (of 'has', ' a ' and 'lam').
More information about the Python-list