Trying to parse a HUGE(1gb) xml file
Stefan Behnel
stefan_ml at behnel.de
Tue Dec 21 03:31:50 EST 2010
spaceman-spiff, 20.12.2010 21:29:
> I am sorry i left out what exactly i am trying to do.
>
> 0. Goal :I am looking for a specific element..there are several 10s/100s occurrences of that element in the 1gb xml file.
> The contents of the xml, is just a dump of config parameters from a packet switch( although imho, the contents of the xml dont matter)
>
> I need to detect them& then for each 1, i need to copy all the content b/w the element's start& end tags& create a smaller xml file.
Then cElementTree's iterparse() is your friend. It allows you to basically
iterate over the XML tags while its building an in-memory tree from them.
That way, you can either remove subtrees from the tree if you don't need
them (to safe memory) or otherwise handle them in any way you like, such as
serialising them into a new file (and then deleting them).
Also note that the iterparse implementation in lxml.etree allows you to
specify a tag name to restrict the iterator to these tags. That's usually a
lot faster, but it also means that you need to take more care to clean up
the parts of the tree that the iterator stepped over. Depending on your
requirements and the amount of manual code optimisation that you want to
invest, either cElementTree or lxml.etree may perform better for you.
It seems that you already found the article by Liza Daly about high
performance XML processing with Python. Give it another read, it has a
couple of good hints and examples that will help you here.
Stefan
More information about the Python-list
mailing list