[lxml-dev] Looking for performance tips for soupparser

Hi, first of all, I have to say that I really like soupparser. Thanks a lot for it. I use it a lot data mining on a somewhat large document collection that I often revisit to try new ideas. Soupparser is fast but I put a lot of strain on it so I was looking for ways to speed things up. My first idea was to use beaker to cache the root Element object of every document to disk. Unfortunately, Element instances are not pickleable so I have to look for something else. Would any of you have some tips to share on speeding things up with soupparser? How hard would it be to make elements conform to the pickling protocol? -- Yannick Gingras http://ygingras.net http://confoo.ca -- track coordinator http://montrealpython.org -- lead organizer

Yannick Gingras, 31.12.2009 17:11:
Erm, no, not really. It uses BeautifulSoup as a parser backend, which really isn't that fast: http://blog.ianbicking.org/2008/03/30/python-html-parser-performance/
I'd use the normal HTML parser instead, and only fall back to using the soupparser when things go really wrong (whatever that means in your case). Another thing you can do (assuming that caching is helpful in your case), is to parse the documents using soupparser and serialise them into the cache. Then parse them from the cache using the normal HTML parser (preferably with "recover=False") when you need them. A serialise-parse cycle is several times faster than a new parser run of BeautifulSoup, so if you need the documents multiple times, this will speed things up. Stefan

On December 31, 2009, Stefan Behnel wrote:
I implemented both ideas and it resulted in a least a 10 fold speedup. Thanks a lot! -- Yannick Gingras http://ygingras.net http://confoo.ca -- track coordinator http://montrealpython.org -- lead organizer

Yannick Gingras, 31.12.2009 17:11:
Erm, no, not really. It uses BeautifulSoup as a parser backend, which really isn't that fast: http://blog.ianbicking.org/2008/03/30/python-html-parser-performance/
I'd use the normal HTML parser instead, and only fall back to using the soupparser when things go really wrong (whatever that means in your case). Another thing you can do (assuming that caching is helpful in your case), is to parse the documents using soupparser and serialise them into the cache. Then parse them from the cache using the normal HTML parser (preferably with "recover=False") when you need them. A serialise-parse cycle is several times faster than a new parser run of BeautifulSoup, so if you need the documents multiple times, this will speed things up. Stefan

On December 31, 2009, Stefan Behnel wrote:
I implemented both ideas and it resulted in a least a 10 fold speedup. Thanks a lot! -- Yannick Gingras http://ygingras.net http://confoo.ca -- track coordinator http://montrealpython.org -- lead organizer
participants (2)
-
Stefan Behnel
-
Yannick Gingras