Fwd: Lxml memory usage
junos-conf-root.xml <https://drive.google.com/file/d/1mFGxoExLIE7DopNx3uHGdHvQsqHBPFAn/view?usp=drive_web> Hi All, I'm chasing an elusive memory leak and it might be related to lxml. I hope you can help me to understand it better. When I parse a large XML file, and let it get garbage collected, memory is not freed up: E.g. when I run following code: import logging import psutil import os import humanize import gc LOGGER = logging.getLogger(__name__) def get_memory_usage(process: psutil.Process) -> int: with process.oneshot(): return process.memory_full_info().data def log_mem_diff(process: psutil.Process, message: str) -> int: usage = get_memory_usage(process) LOGGER.error(f"{message}: {humanize.naturalsize(usage)}") return usage process = psutil.Process(os.getpid()) import xml.etree as etree import xml.etree.ElementTree def build_tree(xml): tree = etree.ElementTree.fromstring(xml) log_mem_diff(process, "In_scope") # tree goes out of scope here # import lxml.etree as etree # def build_tree(xml): # parser = etree.XMLParser(remove_blank_text=True, collect_ids=False) # tree = etree.XML(xml, parser) # log_mem_diff(process, "In_scope") with open("junos-conf-root.xml", "r") as f: xml = f.read() for i in range(0, 5): build_tree(xml) log_mem_diff(process, "before gc") gc.collect() log_mem_diff(process, "after gc") I get In_scope: 1.4 GB before gc: 1.4 GB after gc: 1.4 GB In_scope: 1.7 GB before gc: 1.7 GB after gc: 1.7 GB In_scope: 1.7 GB before gc: 1.7 GB after gc: 1.7 GB In_scope: 1.7 GB before gc: 1.7 GB after gc: 1.7 GB In_scope: 1.7 GB before gc: 1.7 GB after gc: 1.7 GB This is not a leak per-se, but it behaves unexpectedly in that 1. memory usage goes up 2. running the GC doesn't reduce it 2. running the code again, it doesn't keep going up. I'm trying to understand this behavior. Could you be of assistance in this? Python : sys.version_info(major=3, minor=8, micro=9, releaselevel='final', serial=0) lxml.etree : (4, 6, 3, 0) libxml used : (2, 9, 10) libxml compiled : (2, 9, 10) libxslt used : (1, 1, 34) libxslt compiled : (1, 1, 34) Wouter -- Wouter De Borger Chief Architect Inmanta +32479474994 <0479474994> wouter.deborger@inmanta.com www.inmanta.com Kapeldreef 60, 3001 Heverlee [image: twitter] <https://twitter.com/wdeborger> [image: linkedin] <https://www.linkedin.com/in/wouter-de-borger-a720507/>
On Wed, 2021-06-02 at 16:43 +0200, Wouter De Borger wrote:
I'm chasing an elusive memory leak and it might be related to lxml. I hope you can help me to understand it better. When I parse a large XML file, and let it get garbage collected, memory is not freed up:
I suspect the issue here is "freed". What you demonstrate does not indicate the memory is not being freed; if it is freed it is available to the process for re-use. You are measuring the allocation OF THE PROCESS [allocated] not of the memory being USED. malloc + free does not typically return memory to the OS, it only makes if free in the address space OF THE PROCESS. This is not a bug, it is how modern operating systems [UNIX/LINUX at least] operate within the generous address spaces of 32 & 64 bit processors. If monitoring the working size of a process was what I wanted I would choose rss of data in the process object. But no available value is really memory-used.
E.g. when I run following code:
import logging import psutil import os import humanize import gc
LOGGER = logging.getLogger(__name__)
def get_memory_usage(process: psutil.Process) -> int: with process.oneshot(): return process.memory_full_info().data
def log_mem_diff(process: psutil.Process, message: str) -> int: usage = get_memory_usage(process) LOGGER.error(f"{message}: {humanize.naturalsize(usage)}") return usage
process = psutil.Process(os.getpid())
import xml.etree as etree import xml.etree.ElementTree def build_tree(xml): tree = etree.ElementTree.fromstring(xml) log_mem_diff(process, "In_scope") # tree goes out of scope here
# import lxml.etree as etree
# def build_tree(xml): # parser = etree.XMLParser(remove_blank_text=True, collect_ids=False) # tree = etree.XML(xml, parser) # log_mem_diff(process, "In_scope")
with open("junos-conf-root.xml", "r") as f: xml = f.read()
for i in range(0, 5): build_tree(xml) log_mem_diff(process, "before gc")
gc.collect() log_mem_diff(process, "after gc")
I get
In_scope: 1.4 GB before gc: 1.4 GB after gc: 1.4 GB In_scope: 1.7 GB before gc: 1.7 GB after gc: 1.7 GB In_scope: 1.7 GB before gc: 1.7 GB after gc: 1.7 GB In_scope: 1.7 GB before gc: 1.7 GB after gc: 1.7 GB In_scope: 1.7 GB before gc: 1.7 GB after gc: 1.7 GB
This is not a leak per-se, but it behaves unexpectedly in that 1. memory usage goes up 2. running the GC doesn't reduce it 2. running the code again, it doesn't keep going up.
I'm trying to understand this behavior. Could you be of assistance in this?
Python : sys.version_info(major=3, minor=8, micro=9, releaselevel='final', serial=0) lxml.etree : (4, 6, 3, 0) libxml used : (2, 9, 10) libxml compiled : (2, 9, 10) libxslt used : (1, 1, 34) libxslt compiled : (1, 1, 34)
Wouter
_______________________________________________ lxml - The Python XML Toolkit mailing list -- lxml@python.org
To unsubscribe send an email to lxml-leave@python.org
https://mail.python.org/mailman3/lists/lxml.python.org/
Member address: awilliam@whitemice.org
-- Adam Tauno Williams <mailto:awilliam@whitemice.org> GPG D95ED383 OpenGroupware Developer <http://www.opengroupware.us/>
Hi Adam, Some more news on this: I ran valgrind, and it is indeed as you suggest: the memory is not freed to the OS. One of my colleagues found this: http://xmlsoft.org/xmlmem.html You may encounter that your process using libxml2 does not have a reduced
memory usage although you freed the trees. This is because libxml2 allocates memory in a number of small chunks. When freeing one of those chunks, the OS may decide that giving this little memory back to the kernel will cause too much overhead and delay the operation. As all chunks are this small, they get actually freed but not returned to the kernel. On systems using glibc, there is a function call "malloc_trim" from malloc.h which does this missing operation (note that it is allowed to fail). *Thus, after freeing your tree you may simply try "malloc_trim(0);"* to really get the memory back. If your OS does not provide malloc_trim, try searching for a similar function.
I added this code: import ctypes def trim_memory() -> int: libc = ctypes.CDLL("libc.so.6") return libc.malloc_trim(0) This seems to fix it! Perhaps it would be good if lxml would do this by default? Wouter On Wed, 2 Jun 2021 at 21:23, Adam Tauno Williams <awilliam@whitemice.org> wrote:
On Wed, 2021-06-02 at 16:43 +0200, Wouter De Borger wrote:
I'm chasing an elusive memory leak and it might be related to lxml. I hope you can help me to understand it better. When I parse a large XML file, and let it get garbage collected, memory is not freed up:
I suspect the issue here is "freed". What you demonstrate does not indicate the memory is not being freed; if it is freed it is available to the process for re-use. You are measuring the allocation OF THE PROCESS [allocated] not of the memory being USED.
malloc + free does not typically return memory to the OS, it only makes if free in the address space OF THE PROCESS.
This is not a bug, it is how modern operating systems [UNIX/LINUX at least] operate within the generous address spaces of 32 & 64 bit processors.
If monitoring the working size of a process was what I wanted I would choose rss of data in the process object. But no available value is really memory-used.
E.g. when I run following code:
import logging import psutil import os import humanize import gc
LOGGER = logging.getLogger(__name__)
def get_memory_usage(process: psutil.Process) -> int: with process.oneshot(): return process.memory_full_info().data
def log_mem_diff(process: psutil.Process, message: str) -> int: usage = get_memory_usage(process) LOGGER.error(f"{message}: {humanize.naturalsize(usage)}") return usage
process = psutil.Process(os.getpid())
import xml.etree as etree import xml.etree.ElementTree def build_tree(xml): tree = etree.ElementTree.fromstring(xml) log_mem_diff(process, "In_scope") # tree goes out of scope here
# import lxml.etree as etree
# def build_tree(xml): # parser = etree.XMLParser(remove_blank_text=True, collect_ids=False) # tree = etree.XML(xml, parser) # log_mem_diff(process, "In_scope")
with open("junos-conf-root.xml", "r") as f: xml = f.read()
for i in range(0, 5): build_tree(xml) log_mem_diff(process, "before gc")
gc.collect() log_mem_diff(process, "after gc")
I get
In_scope: 1.4 GB before gc: 1.4 GB after gc: 1.4 GB In_scope: 1.7 GB before gc: 1.7 GB after gc: 1.7 GB In_scope: 1.7 GB before gc: 1.7 GB after gc: 1.7 GB In_scope: 1.7 GB before gc: 1.7 GB after gc: 1.7 GB In_scope: 1.7 GB before gc: 1.7 GB after gc: 1.7 GB
This is not a leak per-se, but it behaves unexpectedly in that 1. memory usage goes up 2. running the GC doesn't reduce it 2. running the code again, it doesn't keep going up.
I'm trying to understand this behavior. Could you be of assistance in this?
Python : sys.version_info(major=3, minor=8, micro=9, releaselevel='final', serial=0) lxml.etree : (4, 6, 3, 0) libxml used : (2, 9, 10) libxml compiled : (2, 9, 10) libxslt used : (1, 1, 34) libxslt compiled : (1, 1, 34)
Wouter
_______________________________________________ lxml - The Python XML Toolkit mailing list -- lxml@python.org
To unsubscribe send an email to lxml-leave@python.org
https://mail.python.org/mailman3/lists/lxml.python.org/
Member address: awilliam@whitemice.org
-- Adam Tauno Williams <mailto:awilliam@whitemice.org> GPG D95ED383 OpenGroupware Developer <http://www.opengroupware.us/>
_______________________________________________ lxml - The Python XML Toolkit mailing list -- lxml@python.org To unsubscribe send an email to lxml-leave@python.org https://mail.python.org/mailman3/lists/lxml.python.org/ Member address: wouter@inmanta.com
-- Wouter De Borger Chief Architect Inmanta +32479474994 <0479474994> wouter.deborger@inmanta.com www.inmanta.com Kapeldreef 60, 3001 Heverlee [image: twitter] <https://twitter.com/wdeborger> [image: linkedin] <https://www.linkedin.com/in/wouter-de-borger-a720507/>
On Thu, 2021-06-03 at 10:19 +0200, Wouter De Borger wrote:
Hi Adam, Some more news on this: I ran valgrind, and it is indeed as you suggest: the memory is not freed to the OS. One of my colleagues found this: http://xmlsoft.org/xmlmem.html
You may encounter that your process using libxml2 does not have a reduced memory usage although you freed the trees. This is because libxml2 allocates memory in a number of small chunks. When freeing one of those chunks, the OS may decide that giving this little memory back to the kernel will cause too much overhead and delay the operation. As all chunks are this small, they get actually freed but not returned to the kernel. On systems using glibc, there is a function call "malloc_trim" from malloc.h which does this missing operation (note that it is allowed to fail). Thus, after freeing your tree you may simply try "malloc_trim(0);" to really get the memory back. If your OS does not provide malloc_trim, try searching for a similar function.
With huge address spaces (64 bit) the question is if it is worth the bother. The current implementation is as it is because the answer is rarely "yes". I maintain a workflow engine in Python - which regularly uses LXML to grind 4GB XML files - it runs the chains of actions in a subprocesses, when the subprocess is complete all the memory is released to the OS. In the meantime if there is actual memory pressure the OS can paghe (swap) - if there are lots of unused pages they can get shuttled out of RAM, then marked as free when the processes dies. A modern LINUX kernel is ridiculously efficient at this. The Python multiprocess module is excellent for creating schemes of worker processes.
I added this code: import ctypes def trim_memory() -> int: libc = ctypes.CDLL("libc.so.6") return libc.malloc_trim(0) This seems to fix it! Perhaps it would be good if lxml would do this by default?
It wouldn't be portable, which is likely an argument against it. I'd assume also that it could take some time in a synchronous fashion. -- Adam Tauno Williams <mailto:awilliam@whitemice.org> GPG D95ED383 OpenGroupware Developer <http://www.opengroupware.us/>
participants (2)
-
Adam Tauno Williams
-
Wouter De Borger