Re: [lxml-dev] Need feedback on Memory Errors

I just finished an app that parses a large xml file "FeedA" and appends another smaller file fragmentB to the tree from FeedA for an xpath specified parent node. All seems fine when processing a file less
The runtime vs. compile time lib difference went unrealized (missed the 500 lb. gorilla) in until my ride home last night even though it was right in front of me. The long ride home is often when things that allude me often come together. I was concerned I opened my self up to justifiably harsh scrutiny. Thanks for kindly confirming. Also thanks for the helpful insights. I generally run top to see what my code is using but will include prstat to my monitoring. I will recompile with the correct env vars and retest. I was considering the possible memory footprint of the current implementation but wanted to finish version 1. I will try altering with cElementTree and compare to the current code. I am also going to investigate an event driven parse_and_append approach since lmxl provide such a mechanism and I believe that could reduce memory usage drastically. Thanks for the very useful feedback and have a good weekend. Marc -----Original Message----- From: lxml-dev-bounces@codespeak.net [mailto:lxml-dev-bounces@codespeak.net] On Behalf Of Stefan Behnel Sent: Thursday, November 04, 2010 6:01 PM To: Graff, Marc Cc: lxml-dev@codespeak.net Subject: Re: [lxml-dev] Need feedback on Memory Errors Hi, Graff, Marc, 04.11.2010 19:41: than
500MB but anything large results in one of two errors.
You may not be aware of it, but this is huge. If that's just the size of the serialised XML, this means that the in-memory tree representation is several times that size, easily 10x or more. Depending on the text-to-tag ratio in the content, it may well reach the size of your available memory. Check the size of the Python process while it's building the tree, prstat is your friend.
All libs were built from src in my home dir and LD_LIBRARY_PATH reflects the home dir lib. Not sure if that will distort the following lib details
lxml.etree: (2, 2, 8, 0)
libxml used: (2, 7, 7)
libxml compiled: (2, 6, 23)
libxslt used: (1, 1, 26)
libxslt compiled: (1, 1, 15)
Try to build against the libraries that you use at runtime. lxml has several bug work-arounds and compile time adaptations for the various library versions. A major discrepancy between the version used at compile time and runtime, such as in your case, may have unexpected side effects. You can pass the path to the configuration scripts (xml2-config and xslt-config in the bin directories of the install destinations) using the XML2_CONFIG and XSLT_CONFIG environment variables.
There should be ample memory. This is running on a Solaris M5000 with 96GB of memory and unlimit is unlimited. The FeedA test file contains valid xml and is the same test file for 512MB, 768MB, 1.5GB and 3GB file tests.
Just over 500MB and the app returns a MemoryError on the serializer.pxi.
The serialiser needs to reallocate additional memory step by step while it's doing its work. Normally, the OS handles this by enlarging the allocated area and without copying. However, if the available memory runs low, memory fragmentation may trigger the allocation of a completely new memory area of very large size to copy the previously allocated memory into, which may easily fail since memory is low already. So even if there is some memory left in the system, it may not be enough to satisfy the memory allocation scheme at hand. Remember that your output alone is 500-3000 MB in one single piece of memory, and libxml2 can't know in advance that it will need that much. So, please monitor the memory consumption of the process. If you are really running out of memory, one thing you can try is to switch to cElementTree (xml.etree.cElementTree). It has a somewhat lighter memory footprint which may just be enough to make a difference here (although likely not for 3GB of XML). It also has less features than lxml.etree (a bit fewer less in Py2.7/ET1.3), but currently, your only real problem seems to be the memory requirement. Stefan _______________________________________________ lxml-dev mailing list lxml-dev@codespeak.net http://codespeak.net/mailman/listinfo/lxml-dev
participants (1)
-
Graff, Marc