Diez B. Roggisch deets at
Mon Jan 19 15:50:16 CET 2009

gervaz wrote:

> Hi all, I need to find all the address in a html source page, I'm
> using:
> 'href="(?P<url>[^"]+)">(<b>)?(?P<name>[^</a>]+)(</
> b>)?</a>'
> but the [^</a>]+ pattern retrieve all the strings not containing <
> or / or a etc, although I just not want the word "</a>". How can I
> specify: 'do not search the string "blabla"?'

You should consider using BeautifulSoup or lxml2's error-tolerant parser to
work with HTML-documents. 

Sooner or later your regex-based processing is bound to fail, as documents
get more complicated. Better to use the right tool for the job.

The code should look like this (untested):

from BeautifulSoup import BeautifulSoup
html = """<html><a href="">link</a></html>"""

res = []
soup = BeautifulSoup(html)
for tag in soup.findAll("a"):
    if tag["href"].startswith(""):

Not so hard, and *much* more robust.


More information about the Python-list mailing list