Regexp
Diez B. Roggisch
deets at nospam.web.de
Mon Jan 19 09:50:16 EST 2009
gervaz wrote:
> Hi all, I need to find all the address in a html source page, I'm
> using:
> 'href="(?P<url>http://mysite.com/[^"]+)">(<b>)?(?P<name>[^</a>]+)(</
> b>)?</a>'
> but the [^</a>]+ pattern retrieve all the strings not containing <
> or / or a etc, although I just not want the word "</a>". How can I
> specify: 'do not search the string "blabla"?'
You should consider using BeautifulSoup or lxml2's error-tolerant parser to
work with HTML-documents.
Sooner or later your regex-based processing is bound to fail, as documents
get more complicated. Better to use the right tool for the job.
The code should look like this (untested):
from BeautifulSoup import BeautifulSoup
html = """<html><a href="http://mysite.com/foobar/baz">link</a></html>"""
res = []
soup = BeautifulSoup(html)
for tag in soup.findAll("a"):
if tag["href"].startswith("http://mysite.com"):
res.append(tag["href"])
Not so hard, and *much* more robust.
Diez
More information about the Python-list
mailing list