[Tutor] Trying to parse a HUGE(1gb) xml file in python

ashish makani ashish.makani at gmail.com
Tue Dec 21 03:37:02 CET 2010

Thanks Luke, Steve, Brett, Lloyd & Alan
for your prompt responses & sharing your wisdom.

I <3 the python community... You(We ?) folks are AWESOME

I cross-posted this query on comp.lang.python
I bet most of you hang @ c.l.p too, but just in case, here is the link to
the discussion at c.l.p

Thanks again for the amazing help & advice


On Mon, Dec 20, 2010 at 5:13 PM, Alan Gauld <alan.gauld at btinternet.com>wrote:

> "ashish makani" <ashish.makani at gmail.com> wrote
>  I am looking for a specific element..there are several 10s/100s
>> occurrences
>> of that element in the 1gb file.
>> I need to detect them & then for each 1, i need to copy all the content
>> b/w
>> the element's start & end tags & create a smaller xml
> This is exactly what sax and its kin are for. If you wanted to manipulate
> the xml data and recreate the original file tree based is better but for
> this
> kind of one shot processing SAX will be much much faster.
> The concept is simple enough if you have ever used awk to process
> text files. (or the Python HTMLParser) You define a function that gets
> triggered when the parser detects a matching tag.
>  My hardware setup : I have a win7 pro box with 8gb of RAM & intel core2
>> quad
>> cpuq9400.
>> On this i am running sun virtualbox(3.2.12), with ubuntu 10.10(maverick)
>> as
>> guest os, with 23gb disk space & 2gb(2048mb) ram, assigned to the guest
>> ubuntu os.
> Obviously running the code in the virtuial machjine is limiting your
> ability to deal with the data but in this case you would be pushing
> hard to build the entire tree in RAM anyway so it probably doesn't
> matter.
>  4. I then investigated some streaming libraries, but am confused - there
>> is
>> SAX[http://en.wikipedia.org/wiki/Simple_API_for_XML] ,
>  Which one is the best for my situation ?
> I've only used sax - I tried minidom once but couldn't get it to work
> as I wanted so went back to sax... There are lots of examples of
> xml parsing using sax, both in Python and Java - just google.
>  Should i instead just open the file, & use reg ex to look for the element
>> i
>> need ?
> Unless the xml is very simple you would probably find yourself
> creating a bigger problem. regex's are not good at handling the
> kinds of recursive data structures as can be found in SGML
> based languages.
> HTH,
> --
> Alan Gauld
> Author of the Learn to Program web site
> http://www.alan-g.me.uk/
> _______________________________________________
> Tutor maillist  -  Tutor at python.org
> To unsubscribe or change subscription options:
> http://mail.python.org/mailman/listinfo/tutor


"We act as though comfort and luxury were the chief requirements of life,
when all that we need to make us happy is something to be enthusiastic
-- Albert Einstein*
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/tutor/attachments/20101220/72add300/attachment-0001.html>

More information about the Tutor mailing list