Why this result with the re module
John Bond
lists at asd-group.com
Tue Nov 2 01:30:59 EDT 2010
On 2/11/2010 4:31 AM, Yingjie Lan wrote:
> Hi, I am rather confused by these results below.
> I am not a re expert at all. the module version
> of re is 2.2.1 with python 3.1.2
>
>>>> import re
>>>> re.findall('.a.', 'Mary has a lamb') #OK
> ['Mar', 'has', ' a ', 'lam']
>>>> re.findall('(.a.)*', 'Mary has a lamb') #??
> ['Mar', '', '', 'lam', '', '']
>>>> re.findall('(.a.)+', 'Mary has a lamb') #??
> ['Mar', 'lam']
>
>
> Thanks in advance for any comments.
>
> Yingjie
>
>
>
It's because you're using capturing groups, and because of how they work
- specifically they only return the LAST match if used with repetition
(and multiple matches occur).
For example, take the second example and make it non-capturing:
re.findall('(?:.a.)+', 'Mary has a lamb')
['Mar', 'has a lam']
That shows you there are two matches:
1) a three character one at the start of the string (matching one
occurance of '.a.'), and
2) a 9 character one a bit later in the string (matching three
occurances of '.a.')
Turn that back into a capturing group:
re.findall('(.a.)+', 'Mary has a lamb')
['Mar', 'lam']
You still have the same two matches as before, but in using the
capturing group you're telling findall to return its value each time it
matches (not what's actually matched overall). That doesn't affect the
first result as it matched a single occurance of what's in the group
('Mar').
But the second one matched three occurances of what's in the group
('has', ' a ', and 'lam'), and the nature of capturing groups is that
they only return the last match, so the second returned value is now
just 'lam'.
So - see if you can explain the first "problematic" result now.
More information about the Python-list
mailing list