[Chicago] BeautifulSoup gone bad

Atul Varma varmaa at gmail.com
Wed Mar 18 07:10:34 CET 2009


Is anyone interested in accessing Firefox's DOM structures through Python?
It's doable with mozrunner/jsbridge and Xvfb and doesn't require any funky
compilation steps (unlike PyXPCOM)... If people are interested I could
probably whip up a screencast or something.  That said, though, it's
probably not super fast, as jsbridge works over TCP/IP.

- Atul

On Tue, Mar 17, 2009 at 9:52 PM, Ian Bicking <ianb at colorstudy.com> wrote:

> On Thu, Mar 12, 2009 at 6:04 PM, Martin Maney <maney at two14.net> wrote:
> > On Thu, Mar 12, 2009 at 10:31:51AM -0500, Kumar McMillan wrote:
> >> http://codespeak.net/lxml/lxmlhtml.html
> >
> > Where it says:
> >
> >  The normal HTML parser is capable of handling broken HTML, but for
> >  pages that are far enough from HTML to call them 'tag soup', it may
> >  still fail to parse the page. A way to deal with this is ElementSoup,
> >  which deploys the well-known BeautifulSoup parser to build an lxml
> >  HTML tree.
> >
> > So when you need to parse nasty real-world web pages, you'll be using
> > BeautifulSoup anyway.  I only ever seem to need to scrape really nasty
> > pages, I think.  :-(
>
> There are a handful of pages out there where BS does better.  In many
> cases lxml does better.  lxml handles almost all of the HTML found in
> the wild, it's not a picky parser at all.
>
> >> What's really nice is that you can use full xpath expressions on
> >> crummy, poorly-formed HTML (the language of the Web!).  For a while
> >> lxml was a bit unstable and hard to build on Mac but as of recent
> >> versions I have not had any problems.
> >
> > xpath has never appealed to me, though I suppose it's just the bee's
> > knees for the right applications.
>
> lxml also supports CSS.  For many things this is simpler, e.g.:
> "div.menu a" (all anchors in <div> elements with the class "menu"):
> http://css2xpath.appspot.com/?format=html&css=div.menu+a
>
> --
> Ian Bicking  |  http://blog.ianbicking.org
> _______________________________________________
> Chicago mailing list
> Chicago at python.org
> http://mail.python.org/mailman/listinfo/chicago
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/chicago/attachments/20090317/cd940236/attachment.htm>


More information about the Chicago mailing list