Overcoming python performance penalty for multicore CPU

Stefan Behnel stefan_ml at behnel.de
Mon Feb 8 03:59:42 EST 2010


Paul Rubin, 04.02.2010 02:51:
> John Nagle writes:
>> Analysis of each domain is
>> performed in a separate process, but each process uses multiple
>> threads to read process several web pages simultaneously.
>>
>>    Some of the threads go compute-bound for a second or two at a time as
>> they parse web pages.  
> 
> You're probably better off using separate processes for the different
> pages.  If I remember, you were using BeautifulSoup, which while very
> cool, is pretty doggone slow for use on large volumes of pages.  I don't
> know if there's much that can be done about that without going off on a
> fairly messy C or C++ coding adventure.  Maybe someday someone will do
> that.

Well, if multi-core performance is so important here, then there's a pretty
simple thing the OP can do: switch to lxml.

http://blog.ianbicking.org/2008/03/30/python-html-parser-performance/

Stefan



More information about the Python-list mailing list