[Baypiggies] HTML Parsers (n00b)

Fri Jan 29 01:58:13 CET 2010

lxml is awesome, don't be fooled by the name - it has great understanding of
HTML, even malformed.

ianbicking did a great comparison years ago but it still stands:
http://blog.ianbicking.org/2008/03/30/python-html-parser-performance/

and an update:
http://blog.ianbicking.org/2008/12/10/lxml-an-underappreciated-web-scraping-library/

Basically: lxml is fast as hell, (uses libxml2 under the hood)low memory
footprint, and very forgiving of wacky html, better than Beautiful Soup.

I think pyquery actually uses lxml under the hood? or at least libxml2?

Alec

On Thu, Jan 28, 2010 at 3:43 PM, Max Slimmer <max at theslimmers.net> wrote:

>
> I like lxml
> max
>
>
> On Thu, Jan 28, 2010 at 3:23 PM, Kimball Bighorse <kbighorse at yahoo.com>wrote:
>
>> Looking at beautiful soup, html5lib and pyquery, anything else I should be
>> aware of?
>>
>> Many thanks,
>>
>> Kimball
>> _______________________________________________
>> Baypiggies mailing list
>> Baypiggies at python.org
>> To change your subscription options or unsubscribe:
>> http://mail.python.org/mailman/listinfo/baypiggies
>>
>
>
> _______________________________________________
> Baypiggies mailing list
> Baypiggies at python.org
> To change your subscription options or unsubscribe:
> http://mail.python.org/mailman/listinfo/baypiggies
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/baypiggies/attachments/20100128/13ca36a3/attachment-0001.htm>