[Tutor] Trying to parse a HUGE(1gb) xml file in python

Alan Gauld alan.gauld at btinternet.com
Tue Dec 21 02:13:08 CET 2010

"ashish makani" <ashish.makani at gmail.com> wrote

> I am looking for a specific element..there are several 10s/100s 
> occurrences
> of that element in the 1gb file.
> I need to detect them & then for each 1, i need to copy all the 
> content b/w
> the element's start & end tags & create a smaller xml

This is exactly what sax and its kin are for. If you wanted to 
the xml data and recreate the original file tree based is better but 
for this
kind of one shot processing SAX will be much much faster.

The concept is simple enough if you have ever used awk to process
text files. (or the Python HTMLParser) You define a function that gets
triggered when the parser detects a matching tag.

> My hardware setup : I have a win7 pro box with 8gb of RAM & intel 
> core2 quad
> cpuq9400.
> On this i am running sun virtualbox(3.2.12), with ubuntu 
> 10.10(maverick) as
> guest os, with 23gb disk space & 2gb(2048mb) ram, assigned to the 
> guest
> ubuntu os.

Obviously running the code in the virtuial machjine is limiting your
ability to deal with the data but in this case you would be pushing
hard to build the entire tree in RAM anyway so it probably doesn't

> 4. I then investigated some streaming libraries, but am confused - 
> there is
> SAX[http://en.wikipedia.org/wiki/Simple_API_for_XML] ,

> Which one is the best for my situation ?

I've only used sax - I tried minidom once but couldn't get it to work
as I wanted so went back to sax... There are lots of examples of
xml parsing using sax, both in Python and Java - just google.

> Should i instead just open the file, & use reg ex to look for the 
> element i
> need ?

Unless the xml is very simple you would probably find yourself
creating a bigger problem. regex's are not good at handling the
kinds of recursive data structures as can be found in SGML
based languages.


Alan Gauld
Author of the Learn to Program web site

More information about the Tutor mailing list