Re: [lxml-dev] Looking for performance tips for soupparser

10 Jan 2010


      On December 31, 2009, Stefan Behnel wrote:
...
...
Would any of you have some tips to share on speeding things up with
soupparser?  How hard would it be to make elements conform to the
pickling protocol?
I'd use the normal HTML parser instead, and only fall back to using the 
soupparser when things go really wrong (whatever that means in your case).
Another thing you can do (assuming that caching is helpful in your case),
is to parse the documents using soupparser and serialise them into the 
cache. Then parse them from the cache using the normal HTML parser 
(preferably with "recover=False") when you need them. A serialise-parse 
cycle is several times faster than a new parser run of BeautifulSoup, so if 
you need the documents multiple times, this will speed things up.
I implemented both ideas and it resulted in a least a 10 fold speedup.
Thanks a lot!

-- 
Yannick Gingras
http://ygingras.net
http://confoo.ca -- track coordinator
http://montrealpython.org -- lead organizer