A "Memory allocation failed" problem

I reported the following problem some months ago, but didn’t get (or missed) an answer. Here it is again. I’m not sure whether it is in fact an lxml problem, but it only occurs in one particular lxml script. That script ran without problems for about a year, but suddenly stopped working. It will now run properly through any individual file, but when run it on a sequence of files it will fail after a dozen or so files with a “memory allocation failed” message. If you start from the file on which it failed it will process that file properly, but fail after processing some files with the same error message. I run Python 3.7 in a conda environment in Pycharm. The failure is produced by a function that sorts attributes alphabetically and indents a TEI XML file in which every token is wrapped in a <w> element that contains between three and eight attributes. The files get edited a lot. We keep the attributes sorted to make it easier to recognize substantive changes or additions. def sort_and_indent(elem, level: int = 0): attrib = elem.attrib if len(attrib) > 1: attributes = sorted(attrib.items()) attrib.clear() attrib.update(attributes) i = "\n" + " " * level if len(elem): if not elem.text or not elem.text.strip(): elem.text = i + " " if not elem.tail or not elem.tail.strip(): elem.tail = i for elem in elem: sort_and_indent(elem, level + 1) if not elem.tail or not elem.tail.strip(): elem.tail = i else: if level and (not elem.tail or not elem.tail.strip()): elem.tail = i When the function fails it produces this error message: /Users/martinmueller/.conda/envs/earlyprintprocessing/bin/python /Users/martinmueller/Dropbox/earlyprintprocessing/rewriteree.py Traceback (most recent call last): File "/Users/martinmueller/Dropbox/earlyprintprocessing/rewriteree.py", line 71, in <module> do_etree(filename, item, counter) File "/Users/martinmueller/Dropbox/earlyprintprocessing/rewriteree.py", line 49, in do_etree tree = etree.parse(filename, parser) File "src/lxml/etree.pyx", line 3435, in lxml.etree.parse File "src/lxml/parser.pxi", line 1840, in lxml.etree._parseDocument File "src/lxml/parser.pxi", line 1866, in lxml.etree._parseDocumentFromURL File "src/lxml/parser.pxi", line 1770, in lxml.etree._parseDocFromFile File "src/lxml/parser.pxi", line 1163, in lxml.etree._BaseParser._parseDocFromFile File "src/lxml/parser.pxi", line 601, in lxml.etree._ParserContext._handleParseResultDoc File "src/lxml/parser.pxi", line 711, in lxml.etree._handleParseResult File "src/lxml/parser.pxi", line 640, in lxml.etree._raiseParseError File "/users/martinmueller/dropbox/eebochron/1470-1600/159/a/159-adp-A12229.xml", line 10137 lxml.etree.XMLSyntaxError: Memory allocation failed, line 10137, column 24 The error message appears to be generated by lxml, but it may not be an lxml problem. I checked memory usage on the Activity Monitor of my Mac, has 64GB of memory. Memory usage by Python goes beyond 2GB, but the point of failure doesn’t seem to be related to the memory usage that is reported: it keeps running at well over 2 GB in one batch of files, but in another run it fails at well below 2GB. I cannot associate the onset of this problem with any particular event. I thought it could have something to do with a Pycharm update, but I just ran the script outside of Pycharm with Python3.9 and lxml 4.6.2. I got the same error. In running the script twice from the first file in a batch, I noticed that it failed at exactly the same point in the same file. But it cannot be a function of that file because if you run the program starting with that file it will process it properly. It does appear, however, that something cumulative is going on: in moving from one file to the next, the script does not start from scratch, but keeps or fails to clear some memory that causes a failure when a trigger point is reached. I’d be very grateful for any help of advice on where to look for it. Martin Mueller Professor emeritus of English and Classics Northwestern University

Martin Mueller schrieb am 17.03.21 um 18:35:
The last part reads a bit dangerous since it overwrites the "elem" variable in the loop. It probably works ok – it's just requires at least a second look to understand what it does. And it's risky if you ever end up adding more functionality at the end of the function that still needs the original "elem" value.
The first thing I notice here is that the failure is not in the function that you showed us, but already at the point where it parses the file.
The error message appears to be generated by lxml, but it may not be an lxml problem. I checked memory usage on the Activity Monitor of my Mac, has 64GB of memory. Memory usage by Python goes beyond 2GB, but the point of failure doesn’t seem to be related to the memory usage that is reported: it keeps running at well over 2 GB in one batch of files, but in another run it fails at well below 2GB.
Not sure how Macs are set up here, but try to make sure that you are using a 64bit Python installation and not a 32bit one. A 64bit system would output this: python3 -c 'import sys; print(sys.maxsize)' 9223372036854775807 or the same for Python 2.x: python2 -c 'import sys; print(sys.maxint)' 9223372036854775807
I cannot associate the onset of this problem with any particular event. I thought it could have something to do with a Pycharm update, but I just ran the script outside of Pycharm with Python3.9 and lxml 4.6.2. I got the same error. In running the script twice from the first file in a batch, I noticed that it failed at exactly the same point in the same file. But it cannot be a function of that file because if you run the program starting with that file it will process it properly.
Is there anything special about the file that it's trying to parse here? How big are these files (uncompressed)?
It does appear, however, that something cumulative is going on: in moving from one file to the next, the script does not start from scratch, but keeps or fails to clear some memory that causes a failure when a trigger point is reached.
What happens to the data after parsing and processing one file? Does it get cleaned up before parsing the next one? You might need to "del" some variables before starting the next loop iteration, to make sure that the XML tree really gets released *before* parsing the next one, and not just by overwriting the variable *after* parsing it. That's a common issue with automatic memory management that can easily (and needlessly) lead to twice the memory usage for a program. Stefan

Thank you, Stefan, for this advice. I shared it with a friend, who is much more expert than I. He suggested a variant of your advice, adding "tree =None" after the tree.write command whenever files are processed in a loop. This appears to work On 3/26/21, 12:54 PM, "Stefan Behnel" <stefan_ml@behnel.de> wrote: Martin Mueller schrieb am 17.03.21 um 18:35: > I reported the following problem some months ago, but didn’t get (or missed) an answer. Here it is again. I’m not sure whether it is in fact an lxml problem, but it only occurs in one particular lxml script. That script ran without problems for about a year, but suddenly stopped working. It will now run properly through any individual file, but when run it on a sequence of files it will fail after a dozen or so files with a “memory allocation failed” message. If you start from the file on which it failed it will process that file properly, but fail after processing some files with the same error message. > > I run Python 3.7 in a conda environment in Pycharm. The failure is produced by a function that sorts attributes alphabetically and indents a TEI XML file in which every token is wrapped in a <w> element that contains between three and eight attributes. The files get edited a lot. We keep the attributes sorted to make it easier to recognize substantive changes or additions. > > def sort_and_indent(elem, level: int = 0): > attrib = elem.attrib > if len(attrib) > 1: > attributes = sorted(attrib.items()) > attrib.clear() > attrib.update(attributes) > > i = "\n" + " " * level > > if len(elem): > if not elem.text or not elem.text.strip(): > elem.text = i + " " > if not elem.tail or not elem.tail.strip(): > elem.tail = i > for elem in elem: > sort_and_indent(elem, level + 1) > if not elem.tail or not elem.tail.strip(): > elem.tail = i The last part reads a bit dangerous since it overwrites the "elem" variable in the loop. It probably works ok – it's just requires at least a second look to understand what it does. And it's risky if you ever end up adding more functionality at the end of the function that still needs the original "elem" value. > else: > if level and (not elem.tail or not elem.tail.strip()): > elem.tail = i > > When the function fails it produces this error message: > > /Users/martinmueller/.conda/envs/earlyprintprocessing/bin/python /Users/martinmueller/Dropbox/earlyprintprocessing/rewriteree.py > Traceback (most recent call last): > File "/Users/martinmueller/Dropbox/earlyprintprocessing/rewriteree.py", line 71, in <module> > do_etree(filename, item, counter) > File "/Users/martinmueller/Dropbox/earlyprintprocessing/rewriteree.py", line 49, in do_etree > tree = etree.parse(filename, parser) > File "src/lxml/etree.pyx", line 3435, in lxml.etree.parse > File "src/lxml/parser.pxi", line 1840, in lxml.etree._parseDocument > File "src/lxml/parser.pxi", line 1866, in lxml.etree._parseDocumentFromURL > File "src/lxml/parser.pxi", line 1770, in lxml.etree._parseDocFromFile > File "src/lxml/parser.pxi", line 1163, in lxml.etree._BaseParser._parseDocFromFile > File "src/lxml/parser.pxi", line 601, in lxml.etree._ParserContext._handleParseResultDoc > File "src/lxml/parser.pxi", line 711, in lxml.etree._handleParseResult > File "src/lxml/parser.pxi", line 640, in lxml.etree._raiseParseError > File "/users/martinmueller/dropbox/eebochron/1470-1600/159/a/159-adp-A12229.xml", line 10137 > lxml.etree.XMLSyntaxError: Memory allocation failed, line 10137, column 24 The first thing I notice here is that the failure is not in the function that you showed us, but already at the point where it parses the file. > The error message appears to be generated by lxml, but it may not be an lxml problem. I checked memory usage on the Activity Monitor of my Mac, has 64GB of memory. Memory usage by Python goes beyond 2GB, but the point of failure doesn’t seem to be related to the memory usage that is reported: it keeps running at well over 2 GB in one batch of files, but in another run it fails at well below 2GB. Not sure how Macs are set up here, but try to make sure that you are using a 64bit Python installation and not a 32bit one. A 64bit system would output this: python3 -c 'import sys; print(sys.maxsize)' 9223372036854775807 or the same for Python 2.x: python2 -c 'import sys; print(sys.maxint)' 9223372036854775807 > I cannot associate the onset of this problem with any particular event. I thought it could have something to do with a Pycharm update, but I just ran the script outside of Pycharm with Python3.9 and lxml 4.6.2. I got the same error. In running the script twice from the first file in a batch, I noticed that it failed at exactly the same point in the same file. But it cannot be a function of that file because if you run the program starting with that file it will process it properly. Is there anything special about the file that it's trying to parse here? How big are these files (uncompressed)? > It does appear, however, that something cumulative is going on: in moving from one file to the next, the script does not start from scratch, but keeps or fails to clear some memory that causes a failure when a trigger point is reached. What happens to the data after parsing and processing one file? Does it get cleaned up before parsing the next one? You might need to "del" some variables before starting the next loop iteration, to make sure that the XML tree really gets released *before* parsing the next one, and not just by overwriting the variable *after* parsing it. That's a common issue with automatic memory management that can easily (and needlessly) lead to twice the memory usage for a program. Stefan _______________________________________________ lxml - The Python XML Toolkit mailing list -- lxml@python.org To unsubscribe send an email to lxml-leave@python.org https://urldefense.com/v3/__https://mail.python.org/mailman3/lists/lxml.pyth... Member address: martinmueller@northwestern.edu

Martin Mueller schrieb am 17.03.21 um 18:35:
The last part reads a bit dangerous since it overwrites the "elem" variable in the loop. It probably works ok – it's just requires at least a second look to understand what it does. And it's risky if you ever end up adding more functionality at the end of the function that still needs the original "elem" value.
The first thing I notice here is that the failure is not in the function that you showed us, but already at the point where it parses the file.
The error message appears to be generated by lxml, but it may not be an lxml problem. I checked memory usage on the Activity Monitor of my Mac, has 64GB of memory. Memory usage by Python goes beyond 2GB, but the point of failure doesn’t seem to be related to the memory usage that is reported: it keeps running at well over 2 GB in one batch of files, but in another run it fails at well below 2GB.
Not sure how Macs are set up here, but try to make sure that you are using a 64bit Python installation and not a 32bit one. A 64bit system would output this: python3 -c 'import sys; print(sys.maxsize)' 9223372036854775807 or the same for Python 2.x: python2 -c 'import sys; print(sys.maxint)' 9223372036854775807
I cannot associate the onset of this problem with any particular event. I thought it could have something to do with a Pycharm update, but I just ran the script outside of Pycharm with Python3.9 and lxml 4.6.2. I got the same error. In running the script twice from the first file in a batch, I noticed that it failed at exactly the same point in the same file. But it cannot be a function of that file because if you run the program starting with that file it will process it properly.
Is there anything special about the file that it's trying to parse here? How big are these files (uncompressed)?
It does appear, however, that something cumulative is going on: in moving from one file to the next, the script does not start from scratch, but keeps or fails to clear some memory that causes a failure when a trigger point is reached.
What happens to the data after parsing and processing one file? Does it get cleaned up before parsing the next one? You might need to "del" some variables before starting the next loop iteration, to make sure that the XML tree really gets released *before* parsing the next one, and not just by overwriting the variable *after* parsing it. That's a common issue with automatic memory management that can easily (and needlessly) lead to twice the memory usage for a program. Stefan

Thank you, Stefan, for this advice. I shared it with a friend, who is much more expert than I. He suggested a variant of your advice, adding "tree =None" after the tree.write command whenever files are processed in a loop. This appears to work On 3/26/21, 12:54 PM, "Stefan Behnel" <stefan_ml@behnel.de> wrote: Martin Mueller schrieb am 17.03.21 um 18:35: > I reported the following problem some months ago, but didn’t get (or missed) an answer. Here it is again. I’m not sure whether it is in fact an lxml problem, but it only occurs in one particular lxml script. That script ran without problems for about a year, but suddenly stopped working. It will now run properly through any individual file, but when run it on a sequence of files it will fail after a dozen or so files with a “memory allocation failed” message. If you start from the file on which it failed it will process that file properly, but fail after processing some files with the same error message. > > I run Python 3.7 in a conda environment in Pycharm. The failure is produced by a function that sorts attributes alphabetically and indents a TEI XML file in which every token is wrapped in a <w> element that contains between three and eight attributes. The files get edited a lot. We keep the attributes sorted to make it easier to recognize substantive changes or additions. > > def sort_and_indent(elem, level: int = 0): > attrib = elem.attrib > if len(attrib) > 1: > attributes = sorted(attrib.items()) > attrib.clear() > attrib.update(attributes) > > i = "\n" + " " * level > > if len(elem): > if not elem.text or not elem.text.strip(): > elem.text = i + " " > if not elem.tail or not elem.tail.strip(): > elem.tail = i > for elem in elem: > sort_and_indent(elem, level + 1) > if not elem.tail or not elem.tail.strip(): > elem.tail = i The last part reads a bit dangerous since it overwrites the "elem" variable in the loop. It probably works ok – it's just requires at least a second look to understand what it does. And it's risky if you ever end up adding more functionality at the end of the function that still needs the original "elem" value. > else: > if level and (not elem.tail or not elem.tail.strip()): > elem.tail = i > > When the function fails it produces this error message: > > /Users/martinmueller/.conda/envs/earlyprintprocessing/bin/python /Users/martinmueller/Dropbox/earlyprintprocessing/rewriteree.py > Traceback (most recent call last): > File "/Users/martinmueller/Dropbox/earlyprintprocessing/rewriteree.py", line 71, in <module> > do_etree(filename, item, counter) > File "/Users/martinmueller/Dropbox/earlyprintprocessing/rewriteree.py", line 49, in do_etree > tree = etree.parse(filename, parser) > File "src/lxml/etree.pyx", line 3435, in lxml.etree.parse > File "src/lxml/parser.pxi", line 1840, in lxml.etree._parseDocument > File "src/lxml/parser.pxi", line 1866, in lxml.etree._parseDocumentFromURL > File "src/lxml/parser.pxi", line 1770, in lxml.etree._parseDocFromFile > File "src/lxml/parser.pxi", line 1163, in lxml.etree._BaseParser._parseDocFromFile > File "src/lxml/parser.pxi", line 601, in lxml.etree._ParserContext._handleParseResultDoc > File "src/lxml/parser.pxi", line 711, in lxml.etree._handleParseResult > File "src/lxml/parser.pxi", line 640, in lxml.etree._raiseParseError > File "/users/martinmueller/dropbox/eebochron/1470-1600/159/a/159-adp-A12229.xml", line 10137 > lxml.etree.XMLSyntaxError: Memory allocation failed, line 10137, column 24 The first thing I notice here is that the failure is not in the function that you showed us, but already at the point where it parses the file. > The error message appears to be generated by lxml, but it may not be an lxml problem. I checked memory usage on the Activity Monitor of my Mac, has 64GB of memory. Memory usage by Python goes beyond 2GB, but the point of failure doesn’t seem to be related to the memory usage that is reported: it keeps running at well over 2 GB in one batch of files, but in another run it fails at well below 2GB. Not sure how Macs are set up here, but try to make sure that you are using a 64bit Python installation and not a 32bit one. A 64bit system would output this: python3 -c 'import sys; print(sys.maxsize)' 9223372036854775807 or the same for Python 2.x: python2 -c 'import sys; print(sys.maxint)' 9223372036854775807 > I cannot associate the onset of this problem with any particular event. I thought it could have something to do with a Pycharm update, but I just ran the script outside of Pycharm with Python3.9 and lxml 4.6.2. I got the same error. In running the script twice from the first file in a batch, I noticed that it failed at exactly the same point in the same file. But it cannot be a function of that file because if you run the program starting with that file it will process it properly. Is there anything special about the file that it's trying to parse here? How big are these files (uncompressed)? > It does appear, however, that something cumulative is going on: in moving from one file to the next, the script does not start from scratch, but keeps or fails to clear some memory that causes a failure when a trigger point is reached. What happens to the data after parsing and processing one file? Does it get cleaned up before parsing the next one? You might need to "del" some variables before starting the next loop iteration, to make sure that the XML tree really gets released *before* parsing the next one, and not just by overwriting the variable *after* parsing it. That's a common issue with automatic memory management that can easily (and needlessly) lead to twice the memory usage for a program. Stefan _______________________________________________ lxml - The Python XML Toolkit mailing list -- lxml@python.org To unsubscribe send an email to lxml-leave@python.org https://urldefense.com/v3/__https://mail.python.org/mailman3/lists/lxml.pyth... Member address: martinmueller@northwestern.edu
participants (2)
-
Martin Mueller
-
Stefan Behnel