python fast HTML data extraction library
aahz at pythoncraft.com
Sat Jul 25 17:20:51 CEST 2009
In article <37da38d2-09a8-4fd2-94b4-5feae9675dcd at k1g2000yqf.googlegroups.com>,
Filip <pinkeen at gmail.com> wrote:
>I tried to fix that with BeautifulSoup + regexp filtering of some
>particular cases I encountered. That was slow and after running my
>data scraper for some time a lot of new problems (exceptions from
>xpath parser) were showing up. Not to mention that BeautifulSoup
>stripped almost all of the content from some heavily broken pages
>(50+KiB page stripped down to some few hundred bytes). Character
>encoding conversion was a hell too - even UTF-8 pages had some non-
>standard characters causing issues.
Have you tried lxml?
Aahz (aahz at pythoncraft.com) <*> http://www.pythoncraft.com/
"At Resolver we've found it useful to short-circuit any doubt and just
refer to comments in code as 'lies'. :-)"
--Michael Foord paraphrases Christian Muirhead on python-dev, 2009-03-22
More information about the Python-list