Mailman 3 etree.HTML suddenly eats all RAM and sets all CPU to WAIT - lxml - The Python XML Toolkit

March 8, 2013

      I'm parsing about 20 docs/second with the code below, and it runs well —
until suddenly, the RAM is flooded completely. Not incrementally, just all
at once, after about five to ten minutes (dependent on doc/second load).
Code:

  def parse_og(self, data):
    """ lxml parsing """
    try:
        tree = etree.HTML( data )
        m = tree.xpath("//meta[@property]")

        for i in m:
            y = i.attrib['property']
            x = i.attrib['content']
            self.rj[y] = x

    except Exception:
        print 'lxml error: ', sys.exc_info()[1:3]
        pass

datais first 50kb of a HTML document. Is there some cleanup I should be
doing? Or could this be a bug in lxml or the underlying C lib? Thank you.

OS                   : Ubuntu 12.10 (AWS)
Python              : sys.version_info(major=2, minor=7, micro=3,
releaselevel='final', serial=0)
lxml.etree          : (3, 1, 0, 0)
libxml used         : (2, 7, 8)
libxml compiled     : (2, 7, 8)
libxslt used        : (1, 1, 26)
libxslt compiled    : (1, 1, 26)

etree.HTML suddenly eats all RAM and sets all CPU to WAIT

Knut Ole Sjøli

tags

participants (1)