[Tutor] Trying to parse a HUGE(1gb) xml file in python

Tue Dec 21 16:03:09 CET 2010

Alan Gauld, 21.12.2010 15:11:
> "Stefan Behnel" wrote
>>> And I thought a 1G file was extreme... Do these people stop to think that
>>> with XML as much as 80% of their "data" is just description (ie the tags).
>>
>> As I already said, it compresses well. In run-length compressed XML
>> files, the tags can easily take up a negligible amount of space compared
>> to the more widely varying data content
>
> I understand how compression helps with the data transmission aspect.
>
>> compress rather well). And depending on how fast your underlying storage
>> is, decompressing and parsing the file may still be faster than parsing a
>> huge uncompressed file directly.
>
> But I don't understand how uncompressing a file before parsing it can
> be faster than parsing the original uncompressed file?

I didn't say "uncompressing a file *before* parsing it". I meant 
uncompressing the data *while* parsing it. Just like you have to decode it 
for parsing, it's just an additional step to decompress it before decoding. 
Depending on the performance relation between I/O speed and decompression 
speed, it can be faster to load the compressed data and decompress it into 
the parser on the fly. lxml.etree (or rather libxml2) internally does that 
for you, for example, if it detects compressed input when parsing from a file.

Note that these performance differences are tricky to prove in benchmarks, 
as repeating the benchmark usually means that the file is already cached in 
memory after the first run, so the decompression overhead will dominate in 
the second run. That's not what you will see in a clean run or for huge 
files, though.

Stefan