[Chicago] BeautifulSoup gone bad
Martin Maney
maney at two14.net
Fri Mar 13 00:04:57 CET 2009
On Thu, Mar 12, 2009 at 10:31:51AM -0500, Kumar McMillan wrote:
> http://codespeak.net/lxml/lxmlhtml.html
Where it says:
The normal HTML parser is capable of handling broken HTML, but for
pages that are far enough from HTML to call them 'tag soup', it may
still fail to parse the page. A way to deal with this is ElementSoup,
which deploys the well-known BeautifulSoup parser to build an lxml
HTML tree.
So when you need to parse nasty real-world web pages, you'll be using
BeautifulSoup anyway. I only ever seem to need to scrape really nasty
pages, I think. :-(
> What's really nice is that you can use full xpath expressions on
> crummy, poorly-formed HTML (the language of the Web!). For a while
> lxml was a bit unstable and hard to build on Mac but as of recent
> versions I have not had any problems.
xpath has never appealed to me, though I suppose it's just the bee's
knees for the right applications.
--
To be alive, is that not to be
again and again surprised? -- Nicholas van Rijn
More information about the Chicago
mailing list