Trying to parse a HUGE(1gb) xml file

Roy Smith roy at panix.com
Sat Dec 25 14:41:29 EST 2010


In article <mailman.285.1293297695.6505.python-list at python.org>,
 Adam Tauno Williams <awilliam at whitemice.org> wrote:

> XML works extremely well for large datasets.

Barf.  I'll agree that there are some nice points to XML.  It is 
portable.  It is (to a certain extent) human readable, and in a pinch 
you can use standard text tools to do ad-hoc queries (i.e. grep for a 
particular entry).  And, yes, there are plenty of toolsets for dealing 
with XML files.

On the other hand, the verbosity is unbelievable.  I'm currently working 
with a data feed we get from a supplier in XML.  Every day we get 
incremental updates of about 10-50 MB each.  The total data set at this 
point is 61 GB.  It's got stuff like this in it:

        <Parental-Advisory>FALSE</Parental-Advisory>

That's 54 bytes to store a single bit of information.  I'm all for 
human-readable formats, but bloating the data by a factor of 432 is 
rather excessive.  Of course, that's an extreme example.  A more 
efficient example would be:

        <Id>1173722</Id>

which is 26 bytes to store an integer.  That's only a bloat factor of 
6-1/2.

Of course, one advantage of XML is that with so much redundant text, it 
compresses well.  We typically see gzip compression ratios of 20:1.  
But, that just means you can archive them efficiently; you can't do 
anything useful until you unzip them.



More information about the Python-list mailing list