Trying to parse a HUGE(1gb) xml file

Nobody nobody at nowhere.com
Sat Dec 25 17:34:02 EST 2010


On Sat, 25 Dec 2010 14:41:29 -0500, Roy Smith wrote:

>> XML works extremely well for large datasets.

One advantage it has over many legacy formats is that there are no
inherent 2^31/2^32 limitations. Many binary formats inherently cannot
support files larger than 2GiB or 4Gib due to the use of 32-bit offsets in
indices.

> Of course, one advantage of XML is that with so much redundant text, it 
> compresses well.  We typically see gzip compression ratios of 20:1.  
> But, that just means you can archive them efficiently; you can't do 
> anything useful until you unzip them.

XML is typically processed sequentially, so you don't need to create a
decompressed copy of the file before you start processing it.

If file size is that much of an issue, eventually we'll see a standard for
compressing XML. This could easily result in smaller files than using a
dedicated format compressed with general-purpose compression algorithms,
as a widely-used format such as XML merits more effort than any
application-specific format.




More information about the Python-list mailing list