Re: [lxml] performance problem with lxml

26 Nov 2012

      Hi,

please note that HTML mails don't pass well through public mailing lists. I
had to reformat your code example in order to make it readable (and hope
that I got it right). Use plain text mails instead.

Martin Mueller, 26.11.2012 00:01:
...
I have used lxml to extract attribute values specifying lemma and pos tag from some 2000 TEI encoded texts.  The program has to chew its way through ~160 million instances of the following type of w element
<w lem="be" pos="vbz" reg="is" spe="is" xml:id="K000039_000-000220">is</w>
The <w> are leaf nodes occurring at varying levels from the root, with a typical Xpath like TEI/text/body/div/div/p/w.
The critical piece of code in the program went like this:
filein= open(filename, 'r')
text = filein.read()
root = etree.fromstring(text)
What you do here is: you read the entire file into memory, then you pass it
into the parser to let it build the entire XML tree in memory. Given that
you only need one element at a time, this is a huge waste of resources.
...
for element in root.iter(tei+'w'):
I'd shorten the entire code so far into using iterparse():

    for _, element in etree.iterparse(filename, tag=tei+'w',
                                      remove_blank_text=True):
        ... do stuff with the element ...
        element.clear()    # free memory that's no longer needed
...
lemma = element.get('lem')
    pos = element.get('pos')
    spelling = element.get('spe')
    entry = '\t'.join ((filenumber, lemma, pos,  spelling))
    eccodictionary[entry] += 1
Not much to improve here. Maybe consider encoding the entry right here,
before putting it into the dict. Byte strings tend to eat less memory than
Unicode strings by a factor of 2 or 4, depending on your system. That's
been fixed in Python 3.3, but you don't have that here.
...
filein.close()
This line should have come right after reading the file contents. Resources
should be freed as early as possible. It's not needed, however, if you let
lxml open the file for you by passing it the file path.

On a related note, you should also look at the 'with' statement in Python,
which allows you to clean up resources safely even in the face of exceptions.
...
for eccoitem in sorted(eccodictionary):
    print >>fileout, eccoitem.encode('utf-8'), '\t', eccodictionary[eccoitem]
Recent versions of Python (including 2.7) have a Counter class in the
collections module which you might want to (or might already be) using here.

An alternative (assuming that you have access to unixish tools) would be to
just write out each element as it comes in (e.g. into a file "outfile.txt")
instead of collecting them in memory, and then run the Unix commands

    sort outfile.txt | uniq -c

to get the counts. The sort command is extremely efficient and handles even
huge files nicely that don't fit into memory. However, seeing your numbers
below, I don't think it's really necessary.
...
I am using lxm-l2.3.4, which is part of the Enthought Python 2.7 distribution.  It took more than twelve hours to execute, and I had to run it in three batches because the program would slow down after a while. I don't know whether this is an intrinsic scale problem or whether there are better ways of writing the code. I ran it on a desktop Macintosh with plenty of memory (20GB). I noticed that as the program went on, Python seemed to hog more and more memory, showing 400 MB after a few texts, then creeping up to 2.5 GB.
In theory, each file is closed after it has been processed. In its three passes through different sets of files, its ouput files grew to 165MB, 153MB, and 39MB.  The program clearly slowed down as Python's memory use increases. And I don't know why memory use increases so sharply. Nor do I know whether the slow-down is caused by the memory hogging. At the beginning it would process a file of one megabyte in one or two seconds. But when it had gone through 100 or so it seemed like it was slower by a factor of 3-5.
I'll be grateful for any advice. I have no idea whether I have run up against my limits as a programmer (very likely), the limits of Python, or the limits of lxml.
Neither of Python nor lxml, but certainly of your machine's resources.
Using them more efficiently will help you solve this problem.

Stefan

Re: [lxml] performance problem with lxml

Stefan Behnel