[Tutor] Trying to parse a HUGE(1gb) xml file in python
David Hutto
smokefloat at gmail.com
Tue Dec 21 09:49:04 CET 2010
On Tue, Dec 21, 2010 at 3:44 AM, Stefan Behnel <stefan_ml at behnel.de> wrote:
> [note that this has also been posted to comp.lang.python and discussed
> separately over there]
>
> Steven D'Aprano, 20.12.2010 22:19:
>>
>> ashish makani wrote:
>>
>>> Goal : I am trying to parse a ginormous ( ~ 1gb) xml file.
>>
>> I sympathize with you. I wonder who thought that building a 1GB XML file
>> was a good thing.
David Mertz, Ph.D.
Comparator, Gnosis Software, Inc.
June 2003
http://gnosis.cx/publish/programming/xml_matters_29.html
that was just the first listing:
http://www.google.com/search?client=ubuntu&channel=fs&q=parsing+gigabyte+xml+python&ie=utf-8&oe=utf-8
>>
>> Forget about using any XML parser that reads the entire file into memory.
>> By the time that 1GB of text is read and parsed, you will probably have
>> something about 6-8GB (estimated) in size.
>
> The in-memory size is highly dependent on the data, specifically the
> text-to-structure ratio. If it's a lot of text content, the difference to
> the serialised tree will be small. If it's a lot of structure with tiny bits
> of text content, the in-memory size of the tree will be a lot larger.
>
>
>>> I am guessing, as this happens (over the course of 20-30 mins), the tree
>>> representing is being slowly built in memory, but even after 30-40 mins,
>>> nothing happens.
>>
>> It's probably not finished. Leave it another hour or so and you'll get an
>> out of memory error.
>
> Right, if it gets into wild swapping, it can slow down almost to a halt,
> even though the XML parsing itself tends to have pretty good memory locality
> (but the ever growing in-memory tree obviously doesn't).
>
>
>>> 4. I then investigated some streaming libraries, but am confused - there
>>> is
>>> SAX[http://en.wikipedia.org/wiki/Simple_API_for_XML] , the iterparse
>>> interface[http://effbot.org/zone/element-iterparse.htm], & several otehr
>>> options ( minidom)
>>>
>>> Which one is the best for my situation ?
>>
>> You absolutely need to use a streaming library. element-iterparse still
>> builds the tree, so that's no use to you.
>
> Wrong. iterparse() allows you to cut branches in the tree while it's
> growing, that's exactly what it's there for.
>
>
>> I believe you should use SAX or
>> minidom, but that's about my limit of knowledge of streaming XML parsers.
>
> With "minidom" being an advice that's even worse than SAX - SAX would at
> least solve the problem, whereas minidom wouldn't because of its intolerable
> memory requirements.
>
> Stefan
>
> _______________________________________________
> Tutor maillist - Tutor at python.org
> To unsubscribe or change subscription options:
> http://mail.python.org/mailman/listinfo/tutor
>
--
They're installing the breathalyzer on my email account next week.
More information about the Tutor
mailing list