Trying to parse a HUGE(1gb) xml file

BartC bc at freeuk.com
Sat Dec 25 19:11:20 EST 2010


"Adam Tauno Williams" <awilliam at whitemice.org> wrote in message 
news:mailman.287.1293319780.6505.python-list at python.org...
> On Sat, 2010-12-25 at 22:34 +0000, Nobody wrote:
>> On Sat, 25 Dec 2010 14:41:29 -0500, Roy Smith wrote:
>> >> XML works extremely well for large datasets.
>> One advantage it has over many legacy formats is that there are no
>> inherent 2^31/2^32 limitations. Many binary formats inherently cannot
>> support files larger than 2GiB or 4Gib due to the use of 32-bit offsets 
>> in
>> indices.
>
> And what legacy format has support for code pages, namespaces, schema
> verification, or comments?  None.
>
>> > Of course, one advantage of XML is that with so much redundant text, it
>> > compresses well.  We typically see gzip compression ratios of 20:1.
>> > But, that just means you can archive them efficiently; you can't do
>> > anything useful until you unzip them.
>> XML is typically processed sequentially, so you don't need to create a
>> decompressed copy of the file before you start processing it.
>
> Yep.
>
>> If file size is that much of an issue,
>
> Which it isn't.

Only if you're prepared to squander resources that could be put to better 
use.

XML is so redundant, anyone (even me :-) could probably spend an afternoon 
coming up with a compression scheme to reduce it to a fraction of it's size.

It can even be an custom format, provided you also send along the few dozen 
lines of Python (or whatever language) needed to decompress. Although if 
it's done properly, it might be possible to create an XML library that works 
directly on the compressed format, and as a plug-in replacement for a 
conventional library.

That will likely save time and memory.

Anyway there seem to be existing schemes for binary XML, indicating some 
people do think it is an issue.

I'm just concerned at the waste of computer power (I used to think HTML was 
bad, for example repeating the same long-winded font name hundreds of times 
over in the same document. And PDF: years ago I was sent a 1MB document for 
a modem; perhaps some substantial user manual for it? No, just a simple 
diagram showing how to plug it into the phone socket!).

-- 
Bartc 




More information about the Python-list mailing list