Overcoming python performance penalty for multicore CPU

Mon Feb 8 11:43:10 EST 2010

On Mon, 2010-02-08 at 01:10 -0800, Paul Rubin wrote:
> Stefan Behnel <stefan_ml at behnel.de> writes:
> > Well, if multi-core performance is so important here, then there's a pretty
> > simple thing the OP can do: switch to lxml.
> >
> > http://blog.ianbicking.org/2008/03/30/python-html-parser-performance/
> 
> Well, lxml is uses libxml2, a fast XML parser written in C, but AFAIK it
> only works on well-formed XML.  The point of Beautiful Soup is that it
> works on all kinds of garbage hand-written legacy HTML with mismatched
> tags and other sorts of errors.  Beautiful Soup is slower because it's
> full of special cases and hacks for that reason, and it is written in
> Python.  Writing something that complex in C to handle so much
> potentially malicious input would be quite a lot of work to write at
> all, and very difficult to ensure was really safe.  Look at the many
> browser vulnerabilities we've seen over the years due to that sort of
> problem, for example.  But, for web crawling, you really do need to
> handle the messy and wrong HTML properly.
> 

Actually, lxml has an HTML parser which does pretty well with the
standard level of broken one finds most often on the web. And, when it
falls down, it's easy to integrate BeautifulSoup as a slow backup for
when things go really wrong (as J Kenneth King mentioned earlier):

http://codespeak.net/lxml/lxmlhtml.html#parsing-html

At least in my experience, I haven't actually had to parse anything that
lxml couldn't handle yet, however.
-- 
John Krukoff <jkrukoff at ltgc.com>
Land Title Guarantee Company