[Tutor] Trying to parse a HUGE(1gb) xml file in python
steve at pearwood.info
Mon Dec 20 22:19:51 CET 2010
ashish makani wrote:
> Goal : I am trying to parse a ginormous ( ~ 1gb) xml file.
I sympathize with you. I wonder who thought that building a 1GB XML file
was a good thing.
Forget about using any XML parser that reads the entire file into
memory. By the time that 1GB of text is read and parsed, you will
probably have something about 6-8GB (estimated) in size.
> My hardware setup : I have a win7 pro box with 8gb of RAM & intel core2 quad
In order to access 8GB of RAM, you'll be running a 64-bit OS, correct?
In this case, you should expect double the memory usage of the XML
object to (estimated) 12-16GB.
> I am guessing, as this happens (over the course of 20-30 mins), the tree
> representing is being slowly built in memory, but even after 30-40 mins,
> nothing happens.
It's probably not finished. Leave it another hour or so and you'll get
an out of memory error.
> 4. I then investigated some streaming libraries, but am confused - there is
> SAX[http://en.wikipedia.org/wiki/Simple_API_for_XML] , the iterparse
> interface[http://effbot.org/zone/element-iterparse.htm], & several otehr
> options ( minidom)
> Which one is the best for my situation ?
You absolutely need to use a streaming library. element-iterparse still
builds the tree, so that's no use to you. I believe you should use SAX
or minidom, but that's about my limit of knowledge of streaming XML parsers.
> Should i instead just open the file, & use reg ex to look for the element i
> need ?
That's likely to need less memory than building a parse tree, but still
a huge amount of memory. And you don't know how complex the XML is, in
general you *can't* correctly parse arbitrary XML with regular
expressions (although you can for simple examples). Stick with the right
tool for the job, the streaming XML library.
More information about the Tutor