Overcoming python performance penalty for multicore CPU

Mon Feb 8 11:21:07 EST 2010

Paul Rubin <no.email at nospam.invalid> writes:

> Stefan Behnel <stefan_ml at behnel.de> writes:
>> Well, if multi-core performance is so important here, then there's a pretty
>> simple thing the OP can do: switch to lxml.
>>
>> http://blog.ianbicking.org/2008/03/30/python-html-parser-performance/
>
> Well, lxml is uses libxml2, a fast XML parser written in C, but AFAIK it
> only works on well-formed XML.  The point of Beautiful Soup is that it
> works on all kinds of garbage hand-written legacy HTML with mismatched
> tags and other sorts of errors.  Beautiful Soup is slower because it's
> full of special cases and hacks for that reason, and it is written in
> Python.  Writing something that complex in C to handle so much
> potentially malicious input would be quite a lot of work to write at
> all, and very difficult to ensure was really safe.  Look at the many
> browser vulnerabilities we've seen over the years due to that sort of
> problem, for example.  But, for web crawling, you really do need to
> handle the messy and wrong HTML properly.

If the difference is great enough, you might get a benefit from
analyzing all pages with lxml and throwing invalid pages into a bucket
for later processing with BeautifulSoup.