Bug in regular expressions ?

Christophe Delord christophe.delord at free.fr
Fri May 17 11:55:27 EDT 2002


Hi,

I thought that regular expressions were greedy, so that the longuest match is returned by match().
Consider these expressions : 'a|aa', 'aa|a' and 'aa?'
These expressions may match 'a' and 'aa' and should be equivalent.
When applied on 'aa', match only sees the first 'a' when using the first regular expression ('a|aa').

>>> import re
>>> p=re.compile('a|aa')
>>> p.match('aa').span()
(0, 1)                           <- 'aa' (2 chars) should have be matched ???
>>> p=re.compile('aa|a')
>>> p.match('aa').span()
(0, 2)                           <- ok, two characters have been matched
>>> p=re.compile('aa?')
>>> p.match('aa').span()
(0, 2)                           <- ok
>>> 

So A|B and B|A are not always equivalent. When A and B match, B is ignored even if the matched text is longer.
Is this a bug in the re module?
Is there a way to tell re to be "totaly greedy"?

Thanks,

--
Christophe Delord
http://christophe.delord.free.fr/



More information about the Python-list mailing list