Trying to parse a HUGE(1gb) xml file
Stefan Behnel
stefan_ml at behnel.de
Sun Dec 26 04:44:35 EST 2010
Tim Harig, 26.12.2010 10:22:
> On 2010-12-26, Stefan Behnel wrote:
>> Tim Harig, 26.12.2010 02:05:
>>> On 2010-12-25, Nobody wrote:
>>>> On Sat, 25 Dec 2010 14:41:29 -0500, Roy Smith wrote:
>>>>> Of course, one advantage of XML is that with so much redundant text, it
>>>>> compresses well. We typically see gzip compression ratios of 20:1.
>>>>> But, that just means you can archive them efficiently; you can't do
>>>>> anything useful until you unzip them.
>>>>
>>>> XML is typically processed sequentially, so you don't need to create a
>>>> decompressed copy of the file before you start processing it.
>>>
>>> Sometimes XML is processed sequentially. When the markup footprint is
>>> large enough it must be. Quite often, as in the case of the OP, you only
>>> want to extract a small piece out of the total data. In those cases, being
>>> forced to read all of the data sequentially is both inconvenient and and a
>>> performance penalty unless there is some way to address the data you want
>>> directly.
>> [...]
>> If you do it a lot, you will have to find a way to make the access
>> efficient for your specific use case. So the file format doesn't matter
>> either, because the data will most likely end up in a fast data base after
>> reading it in sequentially *once*, just as in the case above.
>
> If the data is just going to end up in a database anyway; then why not
> send it as a database to begin with and save the trouble of having to
> convert it?
I don't think anyone would object to using a native format when copying
data from one database 1:1 to another one. But if the database formats are
different on both sides, it's a lot easier to map XML formatted data to a
given schema than to map a SQL dump, for example. Matter of use cases, not
of data size.
Stefan
More information about the Python-list
mailing list