etree.HTML suddenly eats all RAM and sets all CPU to WAIT

I'm parsing about 20 docs/second with the code below, and it runs well — until suddenly, the RAM is flooded completely. Not incrementally, just all at once, after about five to ten minutes (dependent on doc/second load). Code: def parse_og(self, data): """ lxml parsing """ try: tree = etree.HTML( data ) m = tree.xpath("//meta[@property]") for i in m: y = i.attrib['property'] x = i.attrib['content'] self.rj[y] = x except Exception: print 'lxml error: ', sys.exc_info()[1:3] pass datais first 50kb of a HTML document. Is there some cleanup I should be doing? Or could this be a bug in lxml or the underlying C lib? Thank you. OS : Ubuntu 12.10 (AWS) Python : sys.version_info(major=2, minor=7, micro=3, releaselevel='final', serial=0) lxml.etree : (3, 1, 0, 0) libxml used : (2, 7, 8) libxml compiled : (2, 7, 8) libxslt used : (1, 1, 26) libxslt compiled : (1, 1, 26)
participants (1)
-
Knut Ole Sjøli