python fast HTML data extraction library

Aahz aahz at pythoncraft.com
Sat Jul 25 17:20:51 CEST 2009


In article <37da38d2-09a8-4fd2-94b4-5feae9675dcd at k1g2000yqf.googlegroups.com>,
Filip  <pinkeen at gmail.com> wrote:
>
>I tried to fix that with BeautifulSoup + regexp filtering of some
>particular cases I encountered. That was slow and after running my
>data scraper for some time a lot of new problems (exceptions from
>xpath parser) were showing up. Not to mention that BeautifulSoup
>stripped almost all of the content from some heavily broken pages
>(50+KiB page stripped down to some few hundred bytes). Character
>encoding conversion was a hell too - even UTF-8 pages had some non-
>standard characters causing issues.

Have you tried lxml?
-- 
Aahz (aahz at pythoncraft.com)           <*>         http://www.pythoncraft.com/

"At Resolver we've found it useful to short-circuit any doubt and just        
refer to comments in code as 'lies'. :-)"
--Michael Foord paraphrases Christian Muirhead on python-dev, 2009-03-22



More information about the Python-list mailing list