paul at boddie.org.uk
Tue Oct 18 19:32:28 CEST 2005
Thorsten Kampe wrote:
> For simple things like that "BeautifulSoup" might be overkill.
I've used SGMLParser with some success before, although the SAX-style
processing is objectionable to many people. One alternative is to use
libxml2dom  and to parse documents as HTML:
import libxml2dom, urllib
url = 'http://www.python.org'
doc = libxml2dom.parse(urllib.urlopen(url), html=1)
anchors = doc.xpath("//a")
Currently, the parseURI function in libxml2dom doesn't do HTML parsing,
mostly because I haven't yet figured out what combination of parsing
options have to be set to make it happen, but a combination of urllib
and libxml2dom should perform adequately. In the above example, you'd
process the nodes in the anchors list to get the desired results.
More information about the Python-list