[Chicago] BeautifulSoup gone bad
Ian Bicking
ianb at colorstudy.com
Wed Mar 18 05:52:56 CET 2009
On Thu, Mar 12, 2009 at 6:04 PM, Martin Maney <maney at two14.net> wrote:
> On Thu, Mar 12, 2009 at 10:31:51AM -0500, Kumar McMillan wrote:
>> http://codespeak.net/lxml/lxmlhtml.html
>
> Where it says:
>
> The normal HTML parser is capable of handling broken HTML, but for
> pages that are far enough from HTML to call them 'tag soup', it may
> still fail to parse the page. A way to deal with this is ElementSoup,
> which deploys the well-known BeautifulSoup parser to build an lxml
> HTML tree.
>
> So when you need to parse nasty real-world web pages, you'll be using
> BeautifulSoup anyway. I only ever seem to need to scrape really nasty
> pages, I think. :-(
There are a handful of pages out there where BS does better. In many
cases lxml does better. lxml handles almost all of the HTML found in
the wild, it's not a picky parser at all.
>> What's really nice is that you can use full xpath expressions on
>> crummy, poorly-formed HTML (the language of the Web!). For a while
>> lxml was a bit unstable and hard to build on Mac but as of recent
>> versions I have not had any problems.
>
> xpath has never appealed to me, though I suppose it's just the bee's
> knees for the right applications.
lxml also supports CSS. For many things this is simpler, e.g.:
"div.menu a" (all anchors in <div> elements with the class "menu"):
http://css2xpath.appspot.com/?format=html&css=div.menu+a
--
Ian Bicking | http://blog.ianbicking.org
More information about the Chicago
mailing list