lxml memory allocation failed
This is about a problem I thought I had solved. I use lxml to update linguistically annotated TEI texts, where each word is wrapped in a <w> element. In a typical workflow, corrections exist as a dictionary whose key is an xmlid. The script loops through a text. If an xmlid is a dictionary key, a simple function replaces the attribute values in the <w> element. At the end of each run, the tree of the text is written to a new document. There is a final function “sort and indent” appended below, which applies some formatting to the text so that attributes appear in alphabetical and text is consistently indented. This function worked for a while, and then it started acting up. It would run through some texts but after a short interval—some times 30 seconds, sometime five minutes—it would exit with an error message like File "/users/martinmueller/Dropbox/earlyprint/eebochron/1473-1623/159/a/159-ane-A21328.xml", line 27 lxml.etree.XMLSyntaxError: Memory allocation failed, line 27, column 19 If you started the run again from the file on which it had exited, it would process that file properly but stumble again some files later. If you remove the function the error disappears. Which supports two conclusions: 1. The error has something to do with that function 2. It has nothing to do with the way it handles any individual text When I reported that problem at an earlier time, I think that Stefan advised me to introduce some step that would clear memory after each single text. I remember that this didn’t work and I gave up on that function, thinking that perhaps there was some problem with the way in which Python, Anaconda and lxml interacted. Recently I got rid of Anaconda and updated to the latest versions of Python and lxml. The problem disappeared for a little while, but then it reappeared. So I tried again two ways of clearing memory. I add either “tree=’None’” or “del tree” as the last command for any given file. This made no difference. The most plausible explanation for this behaviour is that there is some cumulative effect which aborts the program when it causes some threshold. The way I reset doesn’t work. Oddly enough, while I have been writing this email, the script has run through 343 texts in 645 seconds and is still chugging away. One of the texts is very long, from which I gather that the cumulative length of texts processed between failures is unlikely to be the cause of failure. It finally failed after 400 texts and 721 second. The next run failed after 44 texts and 41 seconds. The error message refers to an lxml.etree.XMLSyntaxError. I looked up some failure points in particular text, but couldn’t see any pattern. Besides, the point of failure is never a point of failure the second time round. The problem must have something to do with the little function below. If you drop, lxml will process thousands or tens of thousands of text without any problem of building one tree, doing stuff with it, and dropping it for another tree. But what is it about this code that will work perfectly on any individual text but has cumulative effect that typically leads to failure a after two or three dozen texts. def sort_and_indent(elem, level: int = 0): attrib = elem.attrib if len(attrib) > 1: attributes = sorted(attrib.items()) attrib.clear() attrib.update(attributes) i = "\n" + " " * level if len(elem): if not elem.text or not elem.text.strip(): elem.text = i + " " if not elem.tail or not elem.tail.strip(): elem.tail = i for elem in elem: sort_and_indent(elem, level + 1) if not elem.tail or not elem.tail.strip(): elem.tail = i else: if level and (not elem.tail or not elem.tail.strip()): elem.tail = i
On 30 Apr 2022, at 22:35, Martin Mueller wrote:
When I reported that problem at an earlier time, I think that Stefan advised me to introduce some step that would clear memory after each single text.
How are you reading the source file? If it really is a large file then you could do worse than read using iterparse and write using xmlfile. If you're not using iterpase then clearing elements won't make much difference but it does when you use it. But you have to learn when to use it because it's easy to use it too aggressively and find it has cleared elements before you've processed them. This is especially important where you have recursive functions like yours does. But you should also provide more information about memory use: how much memory does your system have and how much memory is the Python process using when it crashes. Charlie -- Charlie Clark Managing Director Clark Consulting & Research German Office Sengelsweg 34 Düsseldorf D- 40489 Tel: +49-203-3925-0390 Mobile: +49-178-782-6226
Hi Martin,
On 30 Apr 2022, at 21:35, Martin Mueller <martinmueller@northwestern.edu> wrote:
for elem in elem: sort_and_indent(elem, level + 1)
I'd never write the above code — i.e. using the same variable name for the for loop's iterator as it's input. The ambiguity of what the "elem" variable refers too after this point feels too dangerous — I don't trust python's scoping rule to do what you expect. Personally I would use, for child_elem in elem: sort_and_indent(child_elem, level + 1) As a sanity check on this.... The following: elem = [1,2,3,4,5] for e in elem: print(e) print(elem) Gives the following output: ▶ ./elem.py 1 2 3 4 5 [1, 2, 3, 4, 5] In contrast: elem = [1,2,3,4,5] for elem in elem: print(elem) print(elem) Gives: ▶ ./elem.py 1 2 3 4 5 5 That final '5' shows Python's scoping isn't as tight as you hope. Kind regards, aid
participants (3)
-
Adrian Bool
-
Charlie Clark
-
Martin Mueller