Trying to parse a HUGE(1gb) xml file

Stefan Behnel stefan_ml at behnel.de
Sun Dec 26 04:44:35 EST 2010


Tim Harig, 26.12.2010 10:22:
> On 2010-12-26, Stefan Behnel wrote:
>> Tim Harig, 26.12.2010 02:05:
>>> On 2010-12-25, Nobody wrote:
>>>> On Sat, 25 Dec 2010 14:41:29 -0500, Roy Smith wrote:
>>>>> Of course, one advantage of XML is that with so much redundant text, it
>>>>> compresses well.  We typically see gzip compression ratios of 20:1.
>>>>> But, that just means you can archive them efficiently; you can't do
>>>>> anything useful until you unzip them.
>>>>
>>>> XML is typically processed sequentially, so you don't need to create a
>>>> decompressed copy of the file before you start processing it.
>>>
>>> Sometimes XML is processed sequentially.  When the markup footprint is
>>> large enough it must be.  Quite often, as in the case of the OP, you only
>>> want to extract a small piece out of the total data.  In those cases, being
>>> forced to read all of the data sequentially is both inconvenient and and a
>>> performance penalty unless there is some way to address the data you want
>>> directly.
>>  [...]
>> If you do it a lot, you will have to find a way to make the access
>> efficient for your specific use case. So the file format doesn't matter
>> either, because the data will most likely end up in a fast data base after
>> reading it in sequentially *once*, just as in the case above.
>
> If the data is just going to end up in a database anyway; then why not
> send it as a database to begin with and save the trouble of having to
> convert it?

I don't think anyone would object to using a native format when copying 
data from one database 1:1 to another one. But if the database formats are 
different on both sides, it's a lot easier to map XML formatted data to a 
given schema than to map a SQL dump, for example. Matter of use cases, not 
of data size.

Stefan




More information about the Python-list mailing list