a regular expression question

Sat Mar 22 02:31:38 EST 2003

I suppose this isn't really a python question as much a R.E. question,
but I'm using python to do it, so... I'm trying to parse link data
from a webpage that looks like this:

<a href="foo1">1</a> abc <a href="foo2">2</a> def <a href="foo3">3</a>
ghi <a href="foo4">4</a> jkl

With a regular expression like below (where the variable 'text' is the
sample above), re1 saves the numbers, but not the text.  Why is that?

If I use re2, it works, but obviously only gets the odds since there
is no overlapping.  Is there a way to modify re1 to get the text, or
is there a way to overlap with python's re engine somehow?

>>> re1 = re.compile("<a .*?>([0-9]+?)</a>(.*?)")
>>> matches = re.findall(re1,text)
>>> matches
[('1', ''), ('2', ''), ('3', ''), ('4', '')]
>>> re2 = re.compile("<a .*?>([0-9]+?)</a>(.*?)<a")
>>> matches = re.findall(re2,text)
>>> matches
[('1', ' abc '), ('3', ' ghi ')]

Thanks