Trying to parse a HUGE(1gb) xml file
Nobody
nobody at nowhere.com
Sat Dec 25 17:34:02 EST 2010
On Sat, 25 Dec 2010 14:41:29 -0500, Roy Smith wrote:
>> XML works extremely well for large datasets.
One advantage it has over many legacy formats is that there are no
inherent 2^31/2^32 limitations. Many binary formats inherently cannot
support files larger than 2GiB or 4Gib due to the use of 32-bit offsets in
indices.
> Of course, one advantage of XML is that with so much redundant text, it
> compresses well. We typically see gzip compression ratios of 20:1.
> But, that just means you can archive them efficiently; you can't do
> anything useful until you unzip them.
XML is typically processed sequentially, so you don't need to create a
decompressed copy of the file before you start processing it.
If file size is that much of an issue, eventually we'll see a standard for
compressing XML. This could easily result in smaller files than using a
dedicated format compressed with general-purpose compression algorithms,
as a widely-used format such as XML merits more effort than any
application-specific format.
More information about the Python-list
mailing list