[Tutor] Trying to parse a HUGE(1gb) xml file in python

Luke Paireepinart rabidpoobear at gmail.com
Mon Dec 20 21:42:33 CET 2010


If you can assume a well formatted file I would just parse it linearly, should be much faster. Read the file in as lines if the XML is already in human readable form, or just read in blocks and append to a list and do a join() when you have a whole match.

-----------------------------
Sent from a mobile device with a bad e-mail client.
-----------------------------

On Dec 20, 2010, at 2:08 PM, ashish makani <ashish.makani at gmail.com> wrote:

> 
> Hi Python Tutor folks
> This is a rather long post, but i wanted to include all the details & everything i have tried so far myself, so please bear with me & read the entire boringly long post.
> 
> Goal : I am trying to parse a ginormous ( ~ 1gb) xml file.
> 
> I am looking for a specific element..there are several 10s/100s occurrences of that element in the 1gb file.
> 
> I need to detect them & then for each 1, i need to copy all the content b/w the element's start & end tags & create a smaller xml
> 
> 
> 0. I am a python & xml n00b, s& have been relying on the excellent beginner book DIP(Dive_Into_Python3 by MP(Mark Pilgrim).... Mark , if u are readng this, you are AWESOME & so is your witty & humorous writing style)
> 
> My hardware setup : I have a win7 pro box with 8gb of RAM & intel core2 quad cpuq9400.
> On this i am running sun virtualbox(3.2.12), with ubuntu 10.10(maverick) as guest os, with 23gb disk space & 2gb(2048mb) ram, assigned to the guest ubuntu os.
> 
> 
> 1. Almost all exmaples pf parsing xml in python, i have seen, start off with these 4 lines of code.
> 
> import xml.etree.ElementTree as etree
> tree = etree.parse('*path_to_ginormous_xml*')
> root = tree.getroot()  #my huge xml has 1 root at the top level
> print root
> 
> 2. In the 2nd line of code above, as Mark explains in DIP, the parse function builds & returns a tree object, in-memory(RAM), which represents the entire document.
> I tried this code, which works fine for a small ( ~ 1MB), but when i run this simple 4 line py code in a terminal for my HUGE target file (1GB), nothing happens.
> In a separate terminal, i run the top command, & i can see a python process, with memory (the VIRT column) increasing from 100MB , all the way upto 2100MB.
> 
> I am guessing, as this happens (over the course of 20-30 mins), the tree representing is being slowly built in memory, but even after 30-40 mins, nothing happens.
> I dont get an error, seg fault or out_of_memory exception.
> 
> 3. I also tried using lxml, but an lxml tree is much more expensive, as it retains more info about a node's context, including references to it's parent.
> 
> [http://www.ibm.com/developerworks/xml/library/x-hiperfparse/]
> 
> When i ran the same 4line code above, but with lxml's elementree ( using the import below in line1of the code above)
> import lxml.etree as lxml_etree
> 
> i can see the memory consumption of the python process(which is running the code) shoot upto ~ 2700mb & then, python(or the os ?) kills the process as it nears the total system memory(2gb)
> 
> I ran the code from 1 terminal window (screenshot :http://imgur.com/ozLkB.png)
> & ran top from another terminal (http://imgur.com/HAoHA.png)
> 
> 4. I then investigated some streaming libraries, but am confused - there is SAX[http://en.wikipedia.org/wiki/Simple_API_for_XML] , the iterparse interface[http://effbot.org/zone/element-iterparse.htm], & several otehr options ( minidom)
> 
> Which one is the best for my situation ?
> 
> Should i instead just open the file, & use reg ex to look for the element i need ?
> 
> 
> 
> Any & all code_snippets/wisdom/thoughts/ideas/suggestions/feedback/comments/ of the Python tutor community would be greatly appreciated.
> Plz feel free to email me directly too.
> 
> thanks a ton
> 
> cheers
> ashish
> 
> email : 
> ashish.makani
> domain:gmail.com
> 
> p.s.
> Other useful links on xml parsing in python
> 0. http://diveintopython3.org/xml.html
> 1. http://stackoverflow.com/questions/1513592/python-is-there-an-xml-parser-implemented-as-a-generator
> 2. http://codespeak.net/lxml/tutorial.html
> 3. https://groups.google.com/forum/?hl=en&lnk=gst&q=parsing+a+huge+xml#!topic/comp.lang.python/CMgToEnjZBk
> 4. http://www.ibm.com/developerworks/xml/library/x-hiperfparse/
> 5.http://effbot.org/zone/element-index.htm
> http://effbot.org/zone/element-iterparse.htm
> 6. SAX : http://en.wikipedia.org/wiki/Simple_API_for_XML
> 
> _______________________________________________
> Tutor maillist  -  Tutor at python.org
> To unsubscribe or change subscription options:
> http://mail.python.org/mailman/listinfo/tutor
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/tutor/attachments/20101220/2d4172f5/attachment-0001.html>


More information about the Tutor mailing list