Why this result with the re module

Yingjie Lan lanyjie at yahoo.com
Tue Nov 2 13:19:15 CET 2010


> From: John Bond <lists at asd-group.com>
> Subject: Re: Why this result with the re module

Firstly, thanks a lot for your patient explanation.
this time I have understood all your points perfectly.

Secondly, I'd like to clarify some of my points, which 
did not get through because of my poor presentation.

I suggested findall return a tuple of re.MatchObject(s),
with each MatchObject instance representing a match.
This is consistent with the re.match() function anyway.
And it will eliminate the need of returning tuples,
and it is much more precise and information rich.

If that's not possible, and a tuple must be returned,
then the whole match (not just subgroups) should
always be included as the first element in the tuple,
as that's group(0) or '\0'. Less surprise would arise.

Finally, it seems to me the algo for findall is WRONG.

To re.findall('(.a.)*', 'Mary has a lamb'), 
by reason of greediness of '*', and the requirement 
of non-overlapping, it should go like this
(suppose an '' is at the beginning and at the end,
and between two consecutive characters there is
one and only one empty string ''. To show the
match of empty strings clearly,
I am concatenating each repeated match below):

Steps for re.findall('(.a.)*', 'Mary has a lamb'):

step 1: Match '' + 'Mar' + '' (gready!)
step 2: skip 'y'
step 3: Match ''
step 4: skip ' '
step 5: Match ''+'has'+' a '+'lam'+'' (greedy!)
step 6: skip 'b'
step 7: Match ''

So there should be exactly 4 matches in total:

'Mar', '', 'has a lam', ''

Also, the matches above shows
that if a repeated subgroup only captures 
the last match, the subgroup (.a.)* 
should always capture '' here (see steps 
1, 3, 5, 7) above.

Yet the execution in Python results in 6 matches!
And, the capturing subgroup with repetition 
sometimes got the wrong guy.

So I believe the algorithm for findall must be WRONG.

Regards,

Yingjie



      



More information about the Python-list mailing list