Python parsing XML file problem with SAX
aahz at pythoncraft.com
Tue Aug 24 17:37:59 CEST 2010
In article <mailman.1895.1281422126.1673.python-list at python.org>,
Stefan Behnel <stefan_ml at behnel.de> wrote:
>Christian Heimes, 10.08.2010 01:39:
>> Am 10.08.2010 01:20, schrieb Aahz:
>>> The docs say, "Parses an XML section into an element tree incrementally".
>>> Sure sounds like it retains the entire parsed tree in RAM. Not good.
>>> Again, how do you parse an XML file larger than your available memory
>>> using something other than SAX?
>> The document at
>> http://www.ibm.com/developerworks/xml/library/x-hiperfparse/ explains it
>> one way.
>> The iterparser approach is ingenious but it doesn't work for every XML
>> format. Let's say you have a 10 GB XML file with one million<part/>
>> tags. An iterparser doesn't load the entire document. Instead it
>> iterates over the file and yields (for example) one million ElementTrees
>> for each<part/> tag and its children. You can get the nice API of
>> ElementTree with the memory efficiency of a SAX parser if you obey
>> "Listing 4".
>In the very common case that you are interested in all children of the root
>element, it's even enough to intercept on the specific tag name (lxml.etree
>has an option for that, but an 'if' block will do just fine in ET) and just
>".clear()" the child element at the end of the loop body. That results in
>very fast and simple code, but will leave the tags in the tree while only
>removing their content and attributes. Usually works well enough for
>several ten thousand elements, especially when using cElementTree.
Thanks to both of you!
Aahz (aahz at pythoncraft.com) <*> http://www.pythoncraft.com/
"...if I were on life-support, I'd rather have it run by a Gameboy than a
Windows box." --Cliff Wells
More information about the Python-list