[Chicago] BeautifulSoup gone bad

Wed Mar 18 05:52:56 CET 2009

On Thu, Mar 12, 2009 at 6:04 PM, Martin Maney <maney at two14.net> wrote:
> On Thu, Mar 12, 2009 at 10:31:51AM -0500, Kumar McMillan wrote:
>> http://codespeak.net/lxml/lxmlhtml.html
>
> Where it says:
>
>  The normal HTML parser is capable of handling broken HTML, but for
>  pages that are far enough from HTML to call them 'tag soup', it may
>  still fail to parse the page. A way to deal with this is ElementSoup,
>  which deploys the well-known BeautifulSoup parser to build an lxml
>  HTML tree.
>
> So when you need to parse nasty real-world web pages, you'll be using
> BeautifulSoup anyway.  I only ever seem to need to scrape really nasty
> pages, I think.  :-(

There are a handful of pages out there where BS does better.  In many
cases lxml does better.  lxml handles almost all of the HTML found in
the wild, it's not a picky parser at all.

>> What's really nice is that you can use full xpath expressions on
>> crummy, poorly-formed HTML (the language of the Web!).  For a while
>> lxml was a bit unstable and hard to build on Mac but as of recent
>> versions I have not had any problems.
>
> xpath has never appealed to me, though I suppose it's just the bee's
> knees for the right applications.

lxml also supports CSS.  For many things this is simpler, e.g.:
"div.menu a" (all anchors in <div> elements with the class "menu"):
http://css2xpath.appspot.com/?format=html&css=div.menu+a

-- 
Ian Bicking  |  http://blog.ianbicking.org