Trying to parse a HUGE(1gb) xml file

Sun Dec 26 15:15:16 EST 2010

On 2010-12-26, Stefan Behnel <stefan_ml at behnel.de> wrote:
> Tim Harig, 26.12.2010 10:22:
>> On 2010-12-26, Stefan Behnel wrote:
>>> Tim Harig, 26.12.2010 02:05:
>>>> On 2010-12-25, Nobody wrote:
>>>>> On Sat, 25 Dec 2010 14:41:29 -0500, Roy Smith wrote:
>>>>>> Of course, one advantage of XML is that with so much redundant text, it
>>>>>> compresses well.  We typically see gzip compression ratios of 20:1.
>>>>>> But, that just means you can archive them efficiently; you can't do
>>>>>> anything useful until you unzip them.
>>>>>
>>>>> XML is typically processed sequentially, so you don't need to create a
>>>>> decompressed copy of the file before you start processing it.
>>>>
>>>> Sometimes XML is processed sequentially.  When the markup footprint is
>>>> large enough it must be.  Quite often, as in the case of the OP, you only
>>>> want to extract a small piece out of the total data.  In those cases, being
>>>> forced to read all of the data sequentially is both inconvenient and and a
>>>> performance penalty unless there is some way to address the data you want
>>>> directly.
>>>  [...]
>>> If you do it a lot, you will have to find a way to make the access
>>> efficient for your specific use case. So the file format doesn't matter
>>> either, because the data will most likely end up in a fast data base after
>>> reading it in sequentially *once*, just as in the case above.
>>
>> If the data is just going to end up in a database anyway; then why not
>> send it as a database to begin with and save the trouble of having to
>> convert it?
>
> I don't think anyone would object to using a native format when copying 
> data from one database 1:1 to another one. But if the database formats are 
> different on both sides, it's a lot easier to map XML formatted data to a 
> given schema than to map a SQL dump, for example. Matter of use cases, not 
> of data size.

Your assumption keeps hinging on the fact that I should want to dump
the data into a database in the first place.  Very often I don't.
I just want to rip out the small portion of information that happens to
be important to me.  I may not even want to archive my little piece of
the information once I have processed it.

Even assuming that I want to dump all the data into a database,
walking through a bunch of database records to translate them into the
schema for another database is no more difficult then walking through a
bunch of XML elements.  In fact, it is even easier since I can use the
relational model to reconstruct the information in an organization that
better fits how the data is actually structured in my database instead
of being constrained by how somebody else wanted to organize their XML.
There is no need to "map a[sic] SQL dump."

XML is great when the data is set is small enough that parsing the
whole tree has negligable costs.  I can choose whether I want to parse
it sequentially or use XPath/DOM/Etree etc to make it appear as though
I am making random accesses.  When the data set grows so that parsing
it is expensive I loose that choice even if my use case would otherwise
prefer a random access paradigm.  When that happens, there are better
ways of communicating that data that doesn't force me into using a high
overhead method of extracting my data.

The problem is that XML has become such a defacto standard that it
used automatically, without thought, even when there are much better
alternatives available.