Mailman 3 Re: [lxml-dev] Need feedback on Memory Errors - lxml - The Python XML Toolkit

Nov. 5, 2010

      ...
I just finished an app that parses a large xml file "FeedA" and
appends
another smaller file fragmentB to the tree from FeedA for an xpath
specified parent node.  All seems fine when processing a file less
The runtime vs. compile time lib difference went unrealized (missed the
500 lb. gorilla) in until my ride home last night even though it was
right in front of me.  The long ride home is often when things that
allude me often come together.  I was concerned I opened my self up to
justifiably harsh scrutiny.  Thanks for kindly confirming.  Also thanks
for the helpful insights.  I generally run top to see what my code is
using but will include prstat to my monitoring.  I will recompile with
the correct env vars and retest. 

I was considering the possible memory footprint of the current
implementation but wanted to finish version 1.  I will try altering with
cElementTree and compare to the current code.  I am also going to
investigate an event driven parse_and_append approach since lmxl provide
such a mechanism and I believe that could reduce memory usage
drastically.

Thanks for the very useful feedback and have a good weekend.
    Marc

-----Original Message-----
From: lxml-dev-bounces@codespeak.net
[mailto:lxml-dev-bounces@codespeak.net] On Behalf Of Stefan Behnel
Sent: Thursday, November 04, 2010 6:01 PM
To: Graff, Marc
Cc: lxml-dev@codespeak.net
Subject: Re: [lxml-dev] Need feedback on Memory Errors

Hi,

Graff, Marc, 04.11.2010 19:41:
than
...
500MB but anything large results in one of two errors.
You may not be aware of it, but this is huge. If that's just the size of

the serialised XML, this means that the in-memory tree representation is

several times that size, easily 10x or more. Depending on the
text-to-tag 
ratio in the content, it may well reach the size of your available
memory.

Check the size of the Python process while it's building the tree,
prstat 
is your friend.
...
All libs were built from src in my home dir and LD_LIBRARY_PATH
reflects
the home dir lib.  Not sure if that will distort the following lib
details
lxml.etree:        (2, 2, 8, 0)
libxml used:       (2, 7, 7)
libxml compiled:   (2, 6, 23)
libxslt used:      (1, 1, 26)
libxslt compiled:  (1, 1, 15)
Try to build against the libraries that you use at runtime. lxml has 
several bug work-arounds and compile time adaptations for the various 
library versions. A major discrepancy between the version used at
compile 
time and runtime, such as in your case, may have unexpected side
effects.

You can pass the path to the configuration scripts (xml2-config and 
xslt-config in the bin directories of the install destinations) using
the 
XML2_CONFIG and XSLT_CONFIG environment variables.
...
There should be ample memory.  This is running on a Solaris M5000 with
96GB of memory and unlimit is unlimited.  The FeedA test file contains
valid xml and is the same test file for 512MB, 768MB, 1.5GB and 3GB
file
tests.
Just over 500MB and the app returns a MemoryError on the
serializer.pxi.
The serialiser needs to reallocate additional memory step by step while 
it's doing its work. Normally, the OS handles this by enlarging the 
allocated area and without copying. However, if the available memory
runs 
low, memory fragmentation may trigger the allocation of a completely new

memory area of very large size to copy the previously allocated memory 
into, which may easily fail since memory is low already. So even if
there 
is some memory left in the system, it may not be enough to satisfy the 
memory allocation scheme at hand. Remember that your output alone is 
500-3000 MB in one single piece of memory, and libxml2 can't know in 
advance that it will need that much.

So, please monitor the memory consumption of the process. If you are
really 
running out of memory, one thing you can try is to switch to
cElementTree 
(xml.etree.cElementTree). It has a somewhat lighter memory footprint
which 
may just be enough to make a difference here (although likely not for
3GB 
of XML). It also has less features than lxml.etree (a bit fewer less in 
Py2.7/ET1.3), but currently, your only real problem seems to be the
memory 
requirement.

Stefan
_______________________________________________
lxml-dev mailing list
lxml-dev@codespeak.net
http://codespeak.net/mailman/listinfo/lxml-dev

Re: [lxml-dev] Need feedback on Memory Errors

Graff, Marc

tags

participants (1)