Mailman 3 lxml memory allocation failed - lxml - The Python XML Toolkit

1 May 2022

      This is about a problem I thought I had solved.

I use  lxml to update linguistically annotated  TEI texts, where each word is wrapped in a <w> element. In a typical workflow, corrections exist as a dictionary whose key is an xmlid. The script loops through a text. If an xmlid is  a dictionary key, a simple function replaces the attribute values in the <w> element.

At the end of each run, the tree of the text is written to a new document. There is a final function “sort and indent” appended below, which applies some formatting to the text  so that attributes appear in alphabetical  and text is consistently indented.

This function worked for a while, and then it started acting up.  It would run through some texts but after a short interval—some times 30 seconds, sometime five minutes—it would exit with an error message like

File "/users/martinmueller/Dropbox/earlyprint/eebochron/1473-1623/159/a/159-ane-A21328.xml", line 27
lxml.etree.XMLSyntaxError: Memory allocation failed, line 27, column 19

If you started the run again from the file on which it had exited, it would process that file properly but stumble again some files later. If you remove the function the error disappears.  Which supports two conclusions:

  1.  The error has something to do with that function
  2.  It has nothing to do with the way it handles any individual text

When I reported that problem at an earlier time, I think that Stefan advised me to introduce some step that would clear memory after each single text.  I remember that this didn’t work and I gave up on  that function, thinking that perhaps there was some problem with the way in which Python, Anaconda and lxml interacted.

Recently I got rid of Anaconda and updated to the latest versions of Python and lxml.  The problem disappeared for a little while, but then it reappeared. So I tried again two ways of clearing memory.  I add either  “tree=’None’” or “del tree” as the last command for any given file.

This made no difference. The most plausible explanation for this behaviour is that  there is some cumulative effect which aborts the program when it causes some threshold. The way I reset doesn’t work.

Oddly enough, while I have been writing this email, the script has run through 343 texts in 645 seconds and is still chugging away. One of the texts is very long, from which I gather that the cumulative length of texts processed between failures is unlikely to be the cause of failure. It finally failed after 400 texts and 721 second. The next run failed after 44 texts and 41 seconds.

The error message refers to an lxml.etree.XMLSyntaxError. I looked up some failure points in particular text, but couldn’t see any pattern. Besides, the point of failure is never a point of failure the second time round.

The problem must have something to do with the little function below. If you drop, lxml will process thousands or tens of thousands of text without any problem of building one tree, doing stuff with it, and dropping it for another tree. But what is it about this code that will work perfectly on any individual text but has cumulative effect that typically leads to failure a after two or three dozen texts.

def sort_and_indent(elem, level: int = 0):
    attrib = elem.attrib
    if len(attrib) > 1:
        attributes = sorted(attrib.items())
        attrib.clear()
        attrib.update(attributes)

    i = "\n" + " " * level

    if len(elem):
        if not elem.text or not elem.text.strip():
            elem.text = i + " "
        if not elem.tail or not elem.tail.strip():
            elem.tail = i
        for elem in elem:
            sort_and_indent(elem, level + 1)
        if not elem.tail or not elem.tail.strip():
            elem.tail = i
    else:
        if level and (not elem.tail or not elem.tail.strip()):
            elem.tail = i

lxml memory allocation failed

Martin Mueller

Charlie Clark

Adrian Bool

tags

participants (3)