Regexp

Peter Otten __peter__ at web.de
Mon Jan 19 09:53:30 EST 2009


gervaz wrote:

> Hi all, I need to find all the address in a html source page, I'm
> using:
> 'href="(?P<url>http://mysite.com/[^"]+)">(<b>)?(?P<name>[^</a>]+)(</
> b>)?</a>'
> but the [^</a>]+ pattern retrieve all the strings not containing <
> or / or a etc, although I just not want the word "</a>". How can I
> specify: 'do not search the string "blabla"?'

Have considered BeautifulSoup?

from BeautifulSoup import BeautifulSoup
from urlparse import urlparse

for a in BeautifulSoup(page)("a"):
    try:
        href = a["href"]
    except KeyError:
        pass
    else:
        url = urlparse(href)
        if url.hostname == "mysite.com":
            print href

Peter



More information about the Python-list mailing list