[Tutor] Trying to parse a HUGE(1gb) xml file in python

Tue Dec 21 22:06:34 CET 2010

On 21 December 2010 14:11, Alan Gauld <alan.gauld at btinternet.com> wrote:

> But I don't understand how uncompressing a file before parsing it can
> be faster than parsing the original uncompressed file?
>

Because of IO overhead/benefits.  It's not so much that the parsing aspect
of it is faster of course (it is what it is), it's that the total time taken
to (read+decompress+parse) is faster than just (read+parse), because the
time to actually read the compressed data is less than the time it takes to
decompress that data into RAM.  Generally speaking, compared to your CPU and
memory, with respect to IO your disk is always going to be the culprit,
though of course it does depend on exactly how much data we're talking
about, how fast your CPU is, etc.

In general computing this is less of an issue nowadays than perhaps a few
years ago, and the gains can be as you say small, or sometimes not so small,
depending exactly how much data you've got, how highly compressed it's
become, how fast/efficient the decompresser is, how slow your I/O channel is
etc, but the point nevertheless stands.  Case in point, it's perhaps
interesting to note that this technique is used  regularly on the web in
general -- most web servers actually stream their HTML content as LZ
compressed data streams, since (as above) it's quicker to compress, stream,
decompress and parse than it is to just stream the data direct.  (And, of
course, thanks to zlib + urllib one can even use this feature from Python
should you wish to do so.)

Anyway, just my $0.02!

Walter
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/tutor/attachments/20101221/fc2d1603/attachment.html>