Trying to parse a HUGE(1gb) xml file

Tim Harig usernet at ilthio.net
Sun Dec 26 04:01:54 EST 2010


On 2010-12-26, Nobody <nobody at nowhere.com> wrote:
> On Sun, 26 Dec 2010 01:05:53 +0000, Tim Harig wrote:
>
>>> XML is typically processed sequentially, so you don't need to create a
>>> decompressed copy of the file before you start processing it.
>> 
>> Sometimes XML is processed sequentially.  When the markup footprint is
>> large enough it must be.  Quite often, as in the case of the OP, you only
>> want to extract a small piece out of the total data.  In those cases,
>> being forced to read all of the data sequentially is both inconvenient and
>> and a performance penalty unless there is some way to address the data you
>> want directly.
>
> OTOH, formats designed for random access tend to be more limited in their
> utility. You can only perform random access based upon criteria which
> match the format's indexing. Once you step outside that, you often have to
> walk the entire file anyhow.

That may be true and it may not.  Even assuming that you have to walk
through a large number of top level elements there may be an advantage
to being able to directly access the next element as opposed to having
to parse through the entire current element once you have determined it
isn't one which you are looking for.  To be fair, this may be invalid
preoptimization without taking into account how the hard drive buffers;
but, I would suspect that there is a threshold where the amount of
data skipped starts to outweigh the penalty of overreaching the hard
drives buffers.



More information about the Python-list mailing list