[Tutor] Trying to parse a HUGE(1gb) xml file in python

David Hutto smokefloat at gmail.com
Tue Dec 21 11:29:07 CET 2010


On Tue, Dec 21, 2010 at 5:19 AM, Stefan Behnel <stefan_ml at behnel.de> wrote:
> Alan Gauld, 21.12.2010 10:58:
>>
>> "David Hutto" wrote
>>>
>>>
>>> http://www.google.com/search?client=ubuntu&channel=fs&q=parsing+gigabyte+xml+python&ie=utf-8&oe=utf-8
>>
>> Eeek! One of the listings says:
>>
>>> 22 Jan 2009 ... Stripping Illegal Characters from XML in Python >>
>>
>> ... I'd be asking Python to process 6.4 gigabytes of CSV into
>> 6.5 gigabytes of XML 1. ..... In fact, what happened was that
>> the parsing didn't work and the whole db was ...
>>
>> And I thought a 1G file was extreme... Do these people stop to think that
>> with XML as much as 80% of their "data" is just description (ie the tags).
>
> As I already said, it compresses well. In run-length compressed XML files,
> the tags can easily take up a negligible amount of space compared to the
> more widely varying data content (although that also commonly tends to
> compress rather well). And depending on how fast your underlying storage is,
> decompressing and parsing the file may still be faster than parsing a huge
> uncompressed file directly. So, again, the shear uncompressed file size is
> *not* a very interesting argument.
>

However, could they (as mentioned elsewhere, and by other in another
form)mitigate the damage by using smaller tags exclusively?  And also
compressed is formatted, even for the tags, correct?


More information about the Tutor mailing list