Overcoming python performance penalty for multicore CPU

Mon Feb 8 04:10:07 EST 2010

Stefan Behnel <stefan_ml at behnel.de> writes:
> Well, if multi-core performance is so important here, then there's a pretty
> simple thing the OP can do: switch to lxml.
>
> http://blog.ianbicking.org/2008/03/30/python-html-parser-performance/

Well, lxml is uses libxml2, a fast XML parser written in C, but AFAIK it
only works on well-formed XML.  The point of Beautiful Soup is that it
works on all kinds of garbage hand-written legacy HTML with mismatched
tags and other sorts of errors.  Beautiful Soup is slower because it's
full of special cases and hacks for that reason, and it is written in
Python.  Writing something that complex in C to handle so much
potentially malicious input would be quite a lot of work to write at
all, and very difficult to ensure was really safe.  Look at the many
browser vulnerabilities we've seen over the years due to that sort of
problem, for example.  But, for web crawling, you really do need to
handle the messy and wrong HTML properly.