HTMLParser fragility
Rene Pijlman
reply.in.the.newsgroup at my.address.is.invalid
Wed Apr 5 06:45:44 EDT 2006
Lawrence D'Oliveiro:
>I've been using HTMLParser to scrape Web sites. The trouble with this
>is, there's a lot of malformed HTML out there. Real browsers have to be
>written to cope gracefully with this, but HTMLParser does not.
There are two solutions to this:
1. Tidy the source before parsing it.
http://www.egenix.com/files/python/mxTidy.html
2. Use something more foregiving, like BeautifulSoup.
http://www.crummy.com/software/BeautifulSoup/
--
René Pijlman
More information about the Python-list
mailing list