Trying to parse a HUGE(1gb) xml file

Adam Tauno Williams awilliam at whitemice.org
Sat Dec 25 18:29:15 EST 2010


On Sat, 2010-12-25 at 22:34 +0000, Nobody wrote:
> On Sat, 25 Dec 2010 14:41:29 -0500, Roy Smith wrote:
> >> XML works extremely well for large datasets.
> One advantage it has over many legacy formats is that there are no
> inherent 2^31/2^32 limitations. Many binary formats inherently cannot
> support files larger than 2GiB or 4Gib due to the use of 32-bit offsets in
> indices.

And what legacy format has support for code pages, namespaces, schema
verification, or comments?  None.

> > Of course, one advantage of XML is that with so much redundant text, it 
> > compresses well.  We typically see gzip compression ratios of 20:1.  
> > But, that just means you can archive them efficiently; you can't do 
> > anything useful until you unzip them.
> XML is typically processed sequentially, so you don't need to create a
> decompressed copy of the file before you start processing it.

Yep.

> If file size is that much of an issue,

Which it isn't.

>  eventually we'll see a standard for
> compressing XML. This could easily result in smaller files than using a
> dedicated format compressed with general-purpose compression algorithms,
> as a widely-used format such as XML merits more effort than any
> application-specific format.

Agree; and there actually already is a standard compression scheme -
HTTP compression [supported by every modern web-server]; so the data is
compressed at the only point where it matters [during transfer].

Again: "XML works extremely well for large datasets".




More information about the Python-list mailing list