Why this result with the re module

John Bond lists at asd-group.com
Tue Nov 2 01:30:59 EDT 2010


On 2/11/2010 4:31 AM, Yingjie Lan wrote:
> Hi, I am rather confused by these results below.
> I am not a re expert at all. the module version
> of re is 2.2.1 with python 3.1.2
>
>>>> import re
>>>> re.findall('.a.', 'Mary has a lamb') #OK
> ['Mar', 'has', ' a ', 'lam']
>>>> re.findall('(.a.)*', 'Mary has a lamb') #??
> ['Mar', '', '', 'lam', '', '']
>>>> re.findall('(.a.)+', 'Mary has a lamb') #??
> ['Mar', 'lam']
>
>
> Thanks in advance for any comments.
>
> Yingjie
>
>
>

It's because you're using capturing groups, and because of how they work 
- specifically they only return the LAST match if used with repetition 
(and multiple matches occur).

For example, take the second example and make it non-capturing:

re.findall('(?:.a.)+', 'Mary has a lamb')

['Mar', 'has a lam']

That shows you there are two matches:
1) a three character one at the start of the string (matching one 
occurance of '.a.'), and
2) a 9 character one a bit later in the string (matching three 
occurances of '.a.')

Turn that back into a capturing group:

re.findall('(.a.)+', 'Mary has a lamb')

['Mar', 'lam']

You still have the same two matches as before, but in using the 
capturing group you're telling findall to return its value each time it 
matches (not what's actually matched overall).  That doesn't affect the 
first result as it matched a single occurance of what's in the group 
('Mar').

But the second one matched three occurances of what's in the group 
('has', ' a ', and 'lam'), and the nature of capturing groups is that 
they only return the last match, so the second returned value is now 
just 'lam'.

So - see if you can explain the first "problematic" result now.





More information about the Python-list mailing list