a regular expression question

Alex Martelli aleax at aleax.it
Sat Mar 22 09:14:21 CET 2003


Luke wrote:

> I suppose this isn't really a python question as much a R.E. question,
> but I'm using python to do it, so... I'm trying to parse link data
> from a webpage that looks like this:
> 
> <a href="foo1">1</a> abc <a href="foo2">2</a> def <a href="foo3">3</a>
> ghi <a href="foo4">4</a> jkl

Using RE's to parse HTML, when Python already offers wonderful tools
such as HTMLParser to do that, is rather absurd, of course.

> With a regular expression like below (where the variable 'text' is the
> sample above), re1 saves the numbers, but not the text.  Why is that?
   ...
>>>> re1 = re.compile("<a .*?>([0-9]+?)</a>(.*?)")

I don't understand the question.  You're matching one or more digits
quite explicitly with [0-9]+? -- why would you expect that to match
any non-digits?  BTW, you may as well use + rather than +? here --
no difference in doing the repetition as non-greedy, since you're
then explicitly going for a non-digit.  After the </a> you match
"zero or more of any character, non-greedy" and there ends -- so
of course the last group will always match zero characters.

If I divine correctly what you're trying to do, then:

"<a[^>]*>([0-9]+)</a>([^<]*)"

may come closer to your purposes.  [^>]* means, zero or more
characters that aren't right-angle-brackets; and similarly
[^<]* means, zero or more that aren't left-angle-brackets.

But you're still better off using HTMLParser or the like.


Alex





More information about the Python-list mailing list