the buggy regex in Python

MRAB python at mrabarnett.plus.com
Thu Nov 25 11:30:12 EST 2010


On 25/11/2010 11:32, Yingjie Lan wrote:
> I know many experts will say I don't have understanding...but let me pay this up front as my tuition.
>
> Here are some puzzling results I have got (I am using Python 3, I suppose similar results for python 2).
>
> When I do the following, I got an exception:
>>>> re.findall('(d*)*', 'adb')
>>>> re.findall('((d)*)*', 'adb')
>
A repeated repeat can cause problems if what is repeated can match an
empty string. The "re" module tries to protect itself by forbidding
such a regex. The "regex" module (available from PyPI) accepts that
regex and returns a result.

> When I do this, I am fine but the result is wrong:
>>>> re.findall('((.d.)*)*', 'adb')
> [('', 'adb'), ('', '')]
>
> Why is it wrong?
>
> The first mactch of groups:
> ('', 'adb')
> indicates the outer group ((.d.)*) captured
> the empty string, while the inner group (.d.)
> captured 'adb', so the outer group must have
> captured the empty string at the end of the
> provided string 'adb'.
>
> Once we have matched the final empty string '',
> there should be no more matches, but we got
> another match ('', '')!!!
>
> So, findall matched the empty string in
> the end of the string twice!!!
>
re.findall performs multiple searches, each starting where the previous
one finished. The first match started at the start of the string and
finished at its end. The second match started at that point (the end of
the string) and found another match, ending at the end of the string.
It tried to match a third time, but that failed because it would have
matched an empty string again (it's not allowed to return 2 contiguous
empty strings).

> Isn't this a bug?
>
No, but it can be confusing at times! :-)



More information about the Python-list mailing list