python fast HTML data extraction library

Aahz aahz at
Sat Jul 25 17:20:51 CEST 2009

In article <37da38d2-09a8-4fd2-94b4-5feae9675dcd at>,
Filip  <pinkeen at> wrote:
>I tried to fix that with BeautifulSoup + regexp filtering of some
>particular cases I encountered. That was slow and after running my
>data scraper for some time a lot of new problems (exceptions from
>xpath parser) were showing up. Not to mention that BeautifulSoup
>stripped almost all of the content from some heavily broken pages
>(50+KiB page stripped down to some few hundred bytes). Character
>encoding conversion was a hell too - even UTF-8 pages had some non-
>standard characters causing issues.

Have you tried lxml?
Aahz (aahz at           <*>

"At Resolver we've found it useful to short-circuit any doubt and just        
refer to comments in code as 'lies'. :-)"
--Michael Foord paraphrases Christian Muirhead on python-dev, 2009-03-22

More information about the Python-list mailing list