Any equivalent to Ruby's 'hpricot' html/xpath/css selector package?
Stefan Behnel
stefan_ml at behnel.de
Tue Dec 30 08:26:37 EST 2008
Bruno Desthuilliers wrote:
>> However, what makes it really useful is that it does a good job of
>> handling the "broken" html that is so commonly found on the web.
>
> BeautifulSoup ?
> http://pypi.python.org/pypi/BeautifulSoup/3.0.7a
>
> possibly with ElementSoup ?
> http://pypi.python.org/pypi/ElementSoup/rev452
It's actually debatable if BS is any better than lxml/libxml2 when parsing
broken HTML, as lxml tends to tidy things up pretty well. The only major
difference is in encoding detection, for which you can also use a separate
tool like chardet:
http://chardet.feedparser.org/
Stefan
More information about the Python-list
mailing list