Nothing to repeat

Martin Gregorie martin at address-in-sig.invalid
Sun Jan 9 13:05:46 EST 2011


On Sun, 09 Jan 2011 16:49:35 +0000, Tom Anderson wrote:

> 
> Any thoughts on what i should do? Do i have to bite the bullet and apply
> some cleverness in my pattern generation to avoid situations like this?
>
This sort of works:
 
import re
f = open("test.txt")
p = re.compile("(spam*)*")
for line in f:
    print "input line: %s" % (line.strip())
    for m in p.findall(line):
        if m != "":
            print "==> %s" % (m)

when I feed it 
=======================test.txt===========================
a line with no match
spa should match
spam should match
so should all of spaspamspammspammm
and so should all of spa spam spamm spammm
no match again.
=======================test.txt===========================

it produces: 

input line: a line with no match
input line: spa should match
==> spa
input line: spam should match
==> spam
input line: so should all of spaspamspammspammm
==> spammm
input line: and so should all of spa spam spamm spammm
==> spa
==> spam
==> spamm
==> spammm
input line: no match again.

so obviously there's a problem with greedy matching where there are no 
separators between adjacent matching strings. I tried non-greedy 
matching, e.g. r'(spam*?)*', but this was worse, so I'll be interested to 
see how the real regex mavens do it.


-- 
martin@   | Martin Gregorie
gregorie. | Essex, UK
org       |



More information about the Python-list mailing list