Overcoming python performance penalty for multicore CPU
Stefan Behnel
stefan_ml at behnel.de
Mon Feb 8 03:59:42 EST 2010
Paul Rubin, 04.02.2010 02:51:
> John Nagle writes:
>> Analysis of each domain is
>> performed in a separate process, but each process uses multiple
>> threads to read process several web pages simultaneously.
>>
>> Some of the threads go compute-bound for a second or two at a time as
>> they parse web pages.
>
> You're probably better off using separate processes for the different
> pages. If I remember, you were using BeautifulSoup, which while very
> cool, is pretty doggone slow for use on large volumes of pages. I don't
> know if there's much that can be done about that without going off on a
> fairly messy C or C++ coding adventure. Maybe someday someone will do
> that.
Well, if multi-core performance is so important here, then there's a pretty
simple thing the OP can do: switch to lxml.
http://blog.ianbicking.org/2008/03/30/python-html-parser-performance/
Stefan
More information about the Python-list
mailing list