Trying to parse a HUGE(1gb) xml file
Stefan Behnel
stefan_ml at behnel.de
Tue Dec 21 03:16:21 EST 2010
Adam Tauno Williams, 20.12.2010 20:49:
> On Mon, 2010-12-20 at 11:34 -0800, spaceman-spiff wrote:
>> This is a rather long post, but i wanted to include all the details&
>> everything i have tried so far myself, so please bear with me& read
>> the entire boringly long post.
>> I am trying to parse a ginormous ( ~ 1gb) xml file.
>
> Do that hundreds of times a day.
>
>> 0. I am a python& xml n00b, s& have been relying on the excellent
>> beginner book DIP(Dive_Into_Python3 by MP(Mark Pilgrim).... Mark , if
>> u are readng this, you are AWESOME& so is your witty& humorous
>> writing style)
>> 1. Almost all exmaples pf parsing xml in python, i have seen, start off with these 4 lines of code.
>> import xml.etree.ElementTree as etree
Try
import xml.etree.cElementTree as etree
instead. Note the leading "c", which hints at the C implementations of
ElementTree. It's much faster and much more memory friendly than the Python
implementation.
>> tree = etree.parse('*path_to_ginormous_xml*')
>> root = tree.getroot() #my huge xml has 1 root at the top level
>> print root
>
> Yes, this is a terrible technique; most examples are crap.
>
>> 2. In the 2nd line of code above, as Mark explains in DIP, the parse
>> function builds& returns a tree object, in-memory(RAM), which
>> represents the entire document.
>> I tried this code, which works fine for a small ( ~ 1MB), but when i
>> run this simple 4 line py code in a terminal for my HUGE target file
>> (1GB), nothing happens.
>> In a separate terminal, i run the top command,& i can see a python
>> process, with memory (the VIRT column) increasing from 100MB , all the
>> way upto 2100MB.
>
> Yes, this is using DOM. DOM is evil and the enemy, full-stop.
Actually, ElementTree is not "DOM", it's modelled after the XML Infoset.
While I agree that DOM is, well, maybe not "the enemy", but not exactly
beautiful either, ElementTree is really a good thing, likely also in this case.
>> I am guessing, as this happens (over the course of 20-30 mins), the
>> tree representing is being slowly built in memory, but even after
>> 30-40 mins, nothing happens.
>> I dont get an error, seg fault or out_of_memory exception.
>
> You need to process the document as a stream of elements; aka SAX.
IMHO, this is the worst advice you can give.
Stefan
More information about the Python-list
mailing list