[Tutor] Trying to parse a HUGE(1gb) xml file in python
ashish.makani at gmail.com
Tue Dec 21 03:37:02 CET 2010
Thanks Luke, Steve, Brett, Lloyd & Alan
for your prompt responses & sharing your wisdom.
I <3 the python community... You(We ?) folks are AWESOME
I cross-posted this query on comp.lang.python
I bet most of you hang @ c.l.p too, but just in case, here is the link to
the discussion at c.l.p
Thanks again for the amazing help & advice
On Mon, Dec 20, 2010 at 5:13 PM, Alan Gauld <alan.gauld at btinternet.com>wrote:
> "ashish makani" <ashish.makani at gmail.com> wrote
> I am looking for a specific element..there are several 10s/100s
>> of that element in the 1gb file.
>> I need to detect them & then for each 1, i need to copy all the content
>> the element's start & end tags & create a smaller xml
> This is exactly what sax and its kin are for. If you wanted to manipulate
> the xml data and recreate the original file tree based is better but for
> kind of one shot processing SAX will be much much faster.
> The concept is simple enough if you have ever used awk to process
> text files. (or the Python HTMLParser) You define a function that gets
> triggered when the parser detects a matching tag.
> My hardware setup : I have a win7 pro box with 8gb of RAM & intel core2
>> On this i am running sun virtualbox(3.2.12), with ubuntu 10.10(maverick)
>> guest os, with 23gb disk space & 2gb(2048mb) ram, assigned to the guest
>> ubuntu os.
> Obviously running the code in the virtuial machjine is limiting your
> ability to deal with the data but in this case you would be pushing
> hard to build the entire tree in RAM anyway so it probably doesn't
> 4. I then investigated some streaming libraries, but am confused - there
>> SAX[http://en.wikipedia.org/wiki/Simple_API_for_XML] ,
> Which one is the best for my situation ?
> I've only used sax - I tried minidom once but couldn't get it to work
> as I wanted so went back to sax... There are lots of examples of
> xml parsing using sax, both in Python and Java - just google.
> Should i instead just open the file, & use reg ex to look for the element
>> need ?
> Unless the xml is very simple you would probably find yourself
> creating a bigger problem. regex's are not good at handling the
> kinds of recursive data structures as can be found in SGML
> based languages.
> Alan Gauld
> Author of the Learn to Program web site
> Tutor maillist - Tutor at python.org
> To unsubscribe or change subscription options:
"We act as though comfort and luxury were the chief requirements of life,
when all that we need to make us happy is something to be enthusiastic
-- Albert Einstein*
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the Tutor