On December 31, 2009, Stefan Behnel wrote:
Would any of you have some tips to share on speeding things up with soupparser? How hard would it be to make elements conform to the pickling protocol?
I'd use the normal HTML parser instead, and only fall back to using the soupparser when things go really wrong (whatever that means in your case).
Another thing you can do (assuming that caching is helpful in your case), is to parse the documents using soupparser and serialise them into the cache. Then parse them from the cache using the normal HTML parser (preferably with "recover=False") when you need them. A serialise-parse cycle is several times faster than a new parser run of BeautifulSoup, so if you need the documents multiple times, this will speed things up.
I implemented both ideas and it resulted in a least a 10 fold speedup. Thanks a lot! -- Yannick Gingras http://ygingras.net http://confoo.ca -- track coordinator http://montrealpython.org -- lead organizer