Trying to parse a HUGE(1gb) xml file
Roy Smith
roy at panix.com
Sat Dec 25 14:41:29 EST 2010
In article <mailman.285.1293297695.6505.python-list at python.org>,
Adam Tauno Williams <awilliam at whitemice.org> wrote:
> XML works extremely well for large datasets.
Barf. I'll agree that there are some nice points to XML. It is
portable. It is (to a certain extent) human readable, and in a pinch
you can use standard text tools to do ad-hoc queries (i.e. grep for a
particular entry). And, yes, there are plenty of toolsets for dealing
with XML files.
On the other hand, the verbosity is unbelievable. I'm currently working
with a data feed we get from a supplier in XML. Every day we get
incremental updates of about 10-50 MB each. The total data set at this
point is 61 GB. It's got stuff like this in it:
<Parental-Advisory>FALSE</Parental-Advisory>
That's 54 bytes to store a single bit of information. I'm all for
human-readable formats, but bloating the data by a factor of 432 is
rather excessive. Of course, that's an extreme example. A more
efficient example would be:
<Id>1173722</Id>
which is 26 bytes to store an integer. That's only a bloat factor of
6-1/2.
Of course, one advantage of XML is that with so much redundant text, it
compresses well. We typically see gzip compression ratios of 20:1.
But, that just means you can archive them efficiently; you can't do
anything useful until you unzip them.
More information about the Python-list
mailing list