Trying to parse a HUGE(1gb) xml file

Tue Dec 21 03:16:21 EST 2010

Adam Tauno Williams, 20.12.2010 20:49:
> On Mon, 2010-12-20 at 11:34 -0800, spaceman-spiff wrote:
>> This is a rather long post, but i wanted to include all the details&
>> everything i have tried so far myself, so please bear with me&  read
>> the entire boringly long post.
>> I am trying to parse a ginormous ( ~ 1gb) xml file.
>
> Do that hundreds of times a day.
>
>> 0. I am a python&  xml n00b, s&  have been relying on the excellent
>> beginner book DIP(Dive_Into_Python3 by MP(Mark Pilgrim).... Mark , if
>> u are readng this, you are AWESOME&  so is your witty&  humorous
>> writing style)
>> 1. Almost all exmaples pf parsing xml in python, i have seen, start off with these 4 lines of code.
>> import xml.etree.ElementTree as etree

Try

     import xml.etree.cElementTree as etree

instead. Note the leading "c", which hints at the C implementations of 
ElementTree. It's much faster and much more memory friendly than the Python 
implementation.

>> tree = etree.parse('*path_to_ginormous_xml*')
>> root = tree.getroot()  #my huge xml has 1 root at the top level
>> print root
>
> Yes, this is a terrible technique;  most examples are crap.
>
>> 2. In the 2nd line of code above, as Mark explains in DIP, the parse
>> function builds&  returns a tree object, in-memory(RAM), which
>> represents the entire document.
>> I tried this code, which works fine for a small ( ~ 1MB), but when i
>> run this simple 4 line py code in a terminal for my HUGE target file
>> (1GB), nothing happens.
>> In a separate terminal, i run the top command,&  i can see a python
>> process, with memory (the VIRT column) increasing from 100MB , all the
>> way upto 2100MB.
>
> Yes, this is using DOM.  DOM is evil and the enemy, full-stop.

Actually, ElementTree is not "DOM", it's modelled after the XML Infoset. 
While I agree that DOM is, well, maybe not "the enemy", but not exactly 
beautiful either, ElementTree is really a good thing, likely also in this case.

>> I am guessing, as this happens (over the course of 20-30 mins), the
>> tree representing is being slowly built in memory, but even after
>> 30-40 mins, nothing happens.
>> I dont get an error, seg fault or out_of_memory exception.
>
> You need to process the document as a stream of elements; aka SAX.

IMHO, this is the worst advice you can give.

Stefan