10GB XML Blows out Memory, Suggestions?

Paul McGuire ptmcg at austin.rr._bogus_.com
Tue Jun 6 09:56:14 EDT 2006


<axwack at gmail.com> wrote in message
news:1149594519.098115.8980 at u72g2000cwu.googlegroups.com...
> I wrote a program that takes an XML file into memory using Minidom. I
> found out that the XML document is 10gb.
>
> I clearly need SAX or something else?
>

You clearly need something instead of XML.

This sounds like a case where a prototype, which worked for the developer's
simple test data set, blows up in the face of real user/production data.
XML adds lots of overhead for nested structures, when in fact, the actual
meat of the data can be relatively small.  Note also that this XML overhead
is directly related to the verbosity of the XML designer's choice of tag
names, and whether the designer was predisposed to using XML elements over
attributes.  Imagine a record structure for a 3D coordinate point (described
here in no particular coding language):

struct ThreeDimPoint:
    xValue : integer,
    yValue : integer,
    zValue : integer

Directly translated to XML gives:

<ThreeDimPoint>
    <xValue>4</xValue>
    <yValue>5</yValue>
    <zValue>6</zValue>
</ThreeDimPoint>

This expands 3 integers to a whopping 101 characters.  Throw in namespaces
for good measure, and you inflate the data even more.

Many Java folks treat XML attributes as anathema, but look how this cuts
down the data inflation:

<ThreeDimPoint xValue="4" yValue="5" zValue="6"/>

This is only 50 characters, or *only* 4 times the size of the contained data
(assuming 4-byte integers).

Try zipping your 10Gb file, and see what kind of compression you get - I'll
bet it's close to 30:1.  If so, convert the data to a real data storage
medium.  Even a SQLite database table should do better, and you can ship it
around just like a file (just can't open it up like a text file).

-- Paul





More information about the Python-list mailing list