Memory usage with iterparse()

Hi, I'm trying to use iterparse to parse a large XML file without bringing it all into memory at once. However, I notice that the position of the element in the XML file makes a huge different how much memory gets used. In this case, I'm parsing a 41MB xml file with 1 "<venue>" xml element near the top of the file and others at the bottom of the file. When parsing the "<venue>" element near the top of the file, memory usage is low (~6K) but when we get down to the bottom of the file it jumps to ~150K. Any ideas how to avoid this? [rhoover@localhost]$ ps -o "rss,pmem,vsz" -p 8665 RSS %MEM VSZ 5876 0.5 11520 [rhoover@localhost]$ ps -o "rss,pmem,vsz" -p 8665 RSS %MEM VSZ 149744 14.4 155688 import time from lxml import etree def do_it(elem): time.sleep(10) def fast_iter(context, func): for event, elem in context: func(elem) elem.clear() del context context = etree.iterparse("vtop.xml", events=['end'], tag='venue') fast_iter(context, do_it)

Simplified the example to the bare minimum. context = etree.iterparse("vtop.xml", events=['end'], tag='venue') for event, elem in context: time.sleep(10) elem.clear() On Mon, Apr 4, 2011 at 5:05 PM, Roger Hoover <roger.hoover@gmail.com> wrote:
Hi,
I'm trying to use iterparse to parse a large XML file without bringing it all into memory at once. However, I notice that the position of the element in the XML file makes a huge different how much memory gets used.
In this case, I'm parsing a 41MB xml file with 1 "<venue>" xml element near the top of the file and others at the bottom of the file. When parsing the "<venue>" element near the top of the file, memory usage is low (~6K) but when we get down to the bottom of the file it jumps to ~150K.
Any ideas how to avoid this?
[rhoover@localhost]$ ps -o "rss,pmem,vsz" -p 8665 RSS %MEM VSZ 5876 0.5 11520 [rhoover@localhost]$ ps -o "rss,pmem,vsz" -p 8665 RSS %MEM VSZ 149744 14.4 155688
import time from lxml import etree
def do_it(elem): time.sleep(10)
def fast_iter(context, func): for event, elem in context: func(elem) elem.clear() del context
context = etree.iterparse("vtop.xml", events=['end'], tag='venue') fast_iter(context, do_it)

Anyone? This behavior seems to largely defeat the purpose of the iterparse() functionality. On Mon, Apr 4, 2011 at 5:11 PM, Roger Hoover <roger.hoover@gmail.com> wrote:
Simplified the example to the bare minimum.
context = etree.iterparse("vtop.xml", events=['end'], tag='venue') for event, elem in context: time.sleep(10) elem.clear()
On Mon, Apr 4, 2011 at 5:05 PM, Roger Hoover <roger.hoover@gmail.com>wrote:
Hi,
I'm trying to use iterparse to parse a large XML file without bringing it all into memory at once. However, I notice that the position of the element in the XML file makes a huge different how much memory gets used.
In this case, I'm parsing a 41MB xml file with 1 "<venue>" xml element near the top of the file and others at the bottom of the file. When parsing the "<venue>" element near the top of the file, memory usage is low (~6K) but when we get down to the bottom of the file it jumps to ~150K.
Any ideas how to avoid this?
[rhoover@localhost]$ ps -o "rss,pmem,vsz" -p 8665 RSS %MEM VSZ 5876 0.5 11520 [rhoover@localhost]$ ps -o "rss,pmem,vsz" -p 8665 RSS %MEM VSZ 149744 14.4 155688
import time from lxml import etree
def do_it(elem): time.sleep(10)
def fast_iter(context, func): for event, elem in context: func(elem) elem.clear() del context
context = etree.iterparse("vtop.xml", events=['end'], tag='venue') fast_iter(context, do_it)

Hi,
Anyone? This behavior seems to largely defeat the purpose of the iterparse() functionality.
On Mon, Apr 4, 2011 at 5:11 PM, Roger Hoover <roger.hoover@gmail.com> wrote:
Simplified the example to the bare minimum.
context = etree.iterparse("vtop.xml", events=['end'], tag='venue') for event, elem in context: time.sleep(10) elem.clear()
On Mon, Apr 4, 2011 at 5:05 PM, Roger Hoover <roger.hoover@gmail.com>wrote:
Hi,
I'm trying to use iterparse to parse a large XML file without bringing it all into memory at once. However, I notice that the position of the element in the XML file makes a huge different how much memory gets used.
In this case, I'm parsing a 41MB xml file with 1 "<venue>" xml element near the top of the file and others at the bottom of the file. When parsing the "<venue>" element near the top of the file, memory usage is low (~6K) but when we get down to the bottom of the file it jumps to ~150K.
Any ideas how to avoid this?
Note that iterparse() *builds the tree* (http://lxml.de/parsing.html#iterparse-and-iterwalk) so it seems logical that memory usage grows the more you've read from your input file. Maybe one of the other parsing options suits your use case better, e.g. the target parser (http://lxml.de/parsing.html#the-target-parser-interface)? Holger -- NEU: FreePhone - kostenlos mobil telefonieren und surfen! Jetzt informieren: http://www.gmx.net/de/go/freephone

Roger Hoover, 05.04.2011 02:11:
Simplified the example to the bare minimum.
context = etree.iterparse("vtop.xml", events=['end'], tag='venue') for event, elem in context: time.sleep(10) elem.clear()
I'm trying to use iterparse to parse a large XML file without bringing it all into memory at once. However, I notice that the position of the element in the XML file makes a huge different how much memory gets used.
In this case, I'm parsing a 41MB xml file with 1 "<venue>" xml element near the top of the file and others at the bottom of the file. When parsing the "<venue>" element near the top of the file, memory usage is low (~6K) but when we get down to the bottom of the file it jumps to ~150K.
Sure. You should be aware that iterparse() actually builds an in-memory tree for you. So if you only intercept on one single element, it will still build the entire tree up to that point. If that tree is too large for you, then your use case doesn't match the "tag" option well. Instead, do a simple "if elem.tag == X" test inside of the loop and clear elements that you do not deem interesting (usually large children of the root element) in order to trade speed for memory savings. Stefan

Thank you Stefan and Holger for the responses. I was thinking that it only constructed the part of the tree at the tag or below. The issue in my case is that the tags I'm looking for are at the bottom of the file so by the time I receive the first event, a massive tree has been brought into memory. As you've suggested, I could not use the tag parameter and free up pieces of the tree that I don't care about along the way. If I do that, it seems like the usage pattern becomes nearly identical to the xml.dom.pulldom API. On Fri, Apr 8, 2011 at 12:14 AM, Stefan Behnel <stefan_ml@behnel.de> wrote:
Roger Hoover, 05.04.2011 02:11:
Simplified the example to the bare minimum.
context = etree.iterparse("vtop.xml", events=['end'], tag='venue') for event, elem in context: time.sleep(10) elem.clear()
I'm trying to use iterparse to parse a large XML file without bringing it
all into memory at once. However, I notice that the position of the element in the XML file makes a huge different how much memory gets used.
In this case, I'm parsing a 41MB xml file with 1 "<venue>" xml element near the top of the file and others at the bottom of the file. When parsing the "<venue>" element near the top of the file, memory usage is low (~6K) but when we get down to the bottom of the file it jumps to ~150K.
Sure. You should be aware that iterparse() actually builds an in-memory tree for you. So if you only intercept on one single element, it will still build the entire tree up to that point. If that tree is too large for you, then your use case doesn't match the "tag" option well. Instead, do a simple "if elem.tag == X" test inside of the loop and clear elements that you do not deem interesting (usually large children of the root element) in order to trade speed for memory savings.
Stefan

The best case would be if iterparse() could support a list of tags just like it handles a list of events. Since that's the not the case (at least yet), I removed the tag filter and keep state myself. CPU usages spikes to near 100% but this approach is still vastly (~20X) superior to xml.dom.pulldom performance. FYI, for the 41MB file I'm testing with: lxml.etree.iterparse real 0m3.616s user 0m3.360s sys 0m0.135s xml.dom.pulldom real 0m53.115s user 0m51.240s sys 0m1.098s The code structure: context = etree.iterparse("full.xml", events=['start', 'end']) inVenue = False for event, elem in context: if not inVenue and event == 'start' and elem.tag == 'venue': inVenue = True elif inVenue and event == 'end' and elem.tag == 'venue': inVenue = False print "FOUND VENUE %s" % elem.find('id').text if not inVenue: elem.clear() while elem.getprevious() is not None: del elem.getparent()[0] On Fri, Apr 8, 2011 at 1:09 PM, Roger Hoover <roger.hoover@gmail.com> wrote:
Thank you Stefan and Holger for the responses. I was thinking that it only constructed the part of the tree at the tag or below. The issue in my case is that the tags I'm looking for are at the bottom of the file so by the time I receive the first event, a massive tree has been brought into memory. As you've suggested, I could not use the tag parameter and free up pieces of the tree that I don't care about along the way. If I do that, it seems like the usage pattern becomes nearly identical to the xml.dom.pulldom API.
On Fri, Apr 8, 2011 at 12:14 AM, Stefan Behnel <stefan_ml@behnel.de>wrote:
Roger Hoover, 05.04.2011 02:11:
Simplified the example to the bare minimum.
context = etree.iterparse("vtop.xml", events=['end'], tag='venue') for event, elem in context: time.sleep(10) elem.clear()
I'm trying to use iterparse to parse a large XML file without bringing
it all into memory at once. However, I notice that the position of the element in the XML file makes a huge different how much memory gets used.
In this case, I'm parsing a 41MB xml file with 1 "<venue>" xml element near the top of the file and others at the bottom of the file. When parsing the "<venue>" element near the top of the file, memory usage is low (~6K) but when we get down to the bottom of the file it jumps to ~150K.
Sure. You should be aware that iterparse() actually builds an in-memory tree for you. So if you only intercept on one single element, it will still build the entire tree up to that point. If that tree is too large for you, then your use case doesn't match the "tag" option well. Instead, do a simple "if elem.tag == X" test inside of the loop and clear elements that you do not deem interesting (usually large children of the root element) in order to trade speed for memory savings.
Stefan

Roger Hoover, 12.04.2011 01:13:
The best case would be if iterparse() could support a list of tags just like it handles a list of events.
Hmm, interesting. This would really make a nice and clean extension, even for all iterators that accept a 'tag' argument. Care to file a feature request?
Since that's the not the case (at least yet), I removed the tag filter and keep state myself. CPU usages spikes to near 100% but this approach is still vastly (~20X) superior to xml.dom.pulldom performance.
FYI, for the 41MB file I'm testing with:
lxml.etree.iterparse
real 0m3.616s user 0m3.360s sys 0m0.135s
xml.dom.pulldom
real 0m53.115s user 0m51.240s sys 0m1.098s
Cool, thanks for the numbers. I would have expected that, but it's always better to have other people come up with them than the project itself.
The code structure:
context = etree.iterparse("full.xml", events=['start', 'end']) inVenue = False for event, elem in context: if not inVenue and event == 'start' and elem.tag == 'venue': inVenue = True elif inVenue and event == 'end' and elem.tag == 'venue': inVenue = False print "FOUND VENUE %s" % elem.find('id').text
if not inVenue: elem.clear() while elem.getprevious() is not None: del elem.getparent()[0]
Have you tried intercepting only on 'end' events instead? That tends to be a lot faster (well, depending on the subtree-size to total size ratio), and you can still use findall() or XPath to find the "venue" elements in interesting subtrees to work with them. You'd have to test for more tags, though, and clear()/del them after the fact, in order to keep the memory noise low. Stefan

On Mon, Apr 11, 2011 at 11:12 PM, Stefan Behnel <stefan_ml@behnel.de> wrote:
Roger Hoover, 12.04.2011 01:13:
The best case would be if iterparse() could support a list of tags just
like it handles a list of events.
Hmm, interesting. This would really make a nice and clean extension, even for all iterators that accept a 'tag' argument. Care to file a feature request?
Sure thing. I just filed it on github. Is that the right place? https://github.com/lxml/lxml/issues/3
Since that's the not the case (at least yet),
I removed the tag filter and keep state myself. CPU usages spikes to near 100% but this approach is still vastly (~20X) superior to xml.dom.pulldom performance.
FYI, for the 41MB file I'm testing with:
lxml.etree.iterparse
real 0m3.616s user 0m3.360s sys 0m0.135s
xml.dom.pulldom
real 0m53.115s user 0m51.240s sys 0m1.098s
Cool, thanks for the numbers. I would have expected that, but it's always better to have other people come up with them than the project itself.
They're even better now with your suggestion below.
The code structure:
context = etree.iterparse("full.xml", events=['start', 'end']) inVenue = False for event, elem in context: if not inVenue and event == 'start' and elem.tag == 'venue': inVenue = True elif inVenue and event == 'end' and elem.tag == 'venue': inVenue = False print "FOUND VENUE %s" % elem.find('id').text
if not inVenue: elem.clear() while elem.getprevious() is not None: del elem.getparent()[0]
Have you tried intercepting only on 'end' events instead? That tends to be a lot faster (well, depending on the subtree-size to total size ratio), and you can still use findall() or XPath to find the "venue" elements in interesting subtrees to work with them. You'd have to test for more tags, though, and clear()/del them after the fact, in order to keep the memory noise low.
Great idea. Thanks. This cut the time almost in half and keep memory still low. events=['start', 'end'] real 0m3.616s user 0m3.360s sys 0m0.135s events=['end'] real 0m1.864s user 0m1.664s sys 0m0.098s context = etree.iterparse("full.xml", events=['end']) for event, elem in context: if elem.tag == 'venue': print "FOUND VENUE %s" % elem.find('id').text elem.clear() while elem.getprevious() is not None: del elem.getparent()[0] elif elem.tag == 'event': elem.clear() while elem.getprevious() is not None: del elem.getparent()[0]
Stefan

Roger Hoover, 12.04.2011 19:07:
On Mon, Apr 11, 2011 at 11:12 PM, Stefan Behnel wrote:
Roger Hoover, 12.04.2011 01:13:
The best case would be if iterparse() could support a list of tags just like it handles a list of events.
Hmm, interesting. This would really make a nice and clean extension, even for all iterators that accept a 'tag' argument. Care to file a feature request?
Sure thing. I just filed it on github. Is that the right place? https://github.com/lxml/lxml/issues/3
That's fine. The original bug tracker is on launchpad, but the one on github works a lot better. Eventually, it would be good to move everything over, but that suffers from the obvious "needs an effort" problem... Stefan
participants (3)
-
jholg@gmx.de
-
Roger Hoover
-
Stefan Behnel