This is about a problem I thought I had solved.
I use lxml to update linguistically annotated TEI texts, where each word is wrapped in a <w> element. In a typical workflow, corrections exist as a dictionary whose key is an xmlid. The script loops through
a text. If an xmlid is a dictionary key, a simple function replaces the attribute values in the <w> element.
At the end of each run, the tree of the text is written to a new document. There is a final function “sort and indent” appended below, which applies some formatting to the text so that attributes appear in
alphabetical and text is consistently indented.
This function worked for a while, and then it started acting up. It would run through some texts but after a short interval—some times 30 seconds, sometime five minutes—it would exit with an error message
like
File "/users/martinmueller/Dropbox/earlyprint/eebochron/1473-1623/159/a/159-ane-A21328.xml", line 27
lxml.etree.XMLSyntaxError: Memory allocation failed, line 27, column 19
If you started the run again from the file on which it had exited, it would process that file properly but stumble again some files later. If you remove the function the error disappears. Which supports two
conclusions:
When I reported that problem at an earlier time, I think that Stefan advised me to introduce some step that would clear memory after each single text. I remember that this didn’t work and I gave up on that
function, thinking that perhaps there was some problem with the way in which Python, Anaconda and lxml interacted.
Recently I got rid of Anaconda and updated to the latest versions of Python and lxml. The problem disappeared for a little while, but then it reappeared. So I tried again two ways of clearing memory. I add
either “tree=’None’” or “del tree” as the last command for any given file.
This made no difference. The most plausible explanation for this behaviour is that there is some cumulative effect which aborts the program when it causes some threshold. The way I reset doesn’t work.
Oddly enough, while I have been writing this email, the script has run through 343 texts in 645 seconds and is still chugging away. One of the texts is very long, from which I gather that the cumulative length
of texts processed between failures is unlikely to be the cause of failure. It finally failed after 400 texts and 721 second. The next run failed after 44 texts and 41 seconds.
The error message refers to an lxml.etree.XMLSyntaxError. I looked up some failure points in particular text, but couldn’t see any pattern. Besides, the point of failure is never a point of failure the second
time round.
The problem must have something to do with the little function below. If you drop, lxml will process thousands or tens of thousands of text without any problem of building one tree, doing stuff with it, and
dropping it for another tree. But what is it about this code that will work perfectly on any individual text but has cumulative effect that typically leads to failure a after two or three dozen texts.
def
sort_and_indent(elem, level:
int
= 0):
attrib = elem.attrib
if
len(attrib) >
1:
attributes = sorted(attrib.items())
attrib.clear()
attrib.update(attributes)
i = "\n"
+
" "
* level
if
len(elem):
if not
elem.text
or not
elem.text.strip():
elem.text = i + " "
if not
elem.tail
or not
elem.tail.strip():
elem.tail = i
for
elem
in
elem:
sort_and_indent(elem, level + 1)
if not
elem.tail
or not
elem.tail.strip():
elem.tail = i
else:
if
level
and
(not
elem.tail
or not
elem.tail.strip()):
elem.tail = i