Trying to parse a HUGE(1gb) xml file

BartC bc at freeuk.com
Tue Dec 28 07:56:46 EST 2010



"Stefan Behnel" <stefan_ml at behnel.de> wrote in message
news:mailman.335.1293516506.6505.python-list at python.org...
> Roy Smith, 28.12.2010 00:21:
>> To go back to my earlier example of
>>
>>          <Parental-Advisory>FALSE</Parental-Advisory>
>>
>> using 432 bits to store 1 bit of information, stuff like that doesn't
>> happen in marked-up text documents.  Most of the file is CDATA (do they
>> still use that term in XML, or was that an SGML-ism only?).  The markup
>> is a relatively small fraction of the data.  I'm happy to pay a factor
>> of 2 or 3 to get structured text that can be machine processed in useful
>> ways.  I'm not willing to pay a factor of 432 to get tabular data when
>> there's plenty of other much more reasonable ways to encode it.
>
> If the above only appears once in a large document, I don't care how much
> space it takes. If it appears all over the place, it will compress down to
> a couple of bits, so I don't care about the space, either.
>
> It's readability that counts here. Try to reverse engineer a binary format
> that stores the above information in 1 bit.

The above typically won't get much below 2 bytes (as one character plus a
separator, eg. in comma-delimited-format). So it's more like 27:1, if you're
going to stay with a text format.

Still, that's 27 times as much as it need be. Readability is fine, but why
does the full, expanded, human-readable textual format have to be stored on
disk too, and for every single instance?

What if the 'Parental-Advisory' tag was even longer? Just how long do these
things have to get before even the advocates here admit that it's getting
ridiculous?

Isn't it possible for XML to define a shorter alias for these tags? Isn't
there a shortcut available for </Parental-Advisory> in simple examples like
this (I seem to remember something like this)?

And why not use 1 and 0 for TRUE and FALSE? Even the consumer appliances in
my house have 1 and 0 on their power switches! With the advantage that they 
are internationally recognised.

-- 
Bartc 




More information about the Python-list mailing list