[Tutor] Trying to parse a HUGE(1gb) xml file in python

Tue Dec 21 11:29:07 CET 2010

On Tue, Dec 21, 2010 at 5:19 AM, Stefan Behnel <stefan_ml at behnel.de> wrote:
> Alan Gauld, 21.12.2010 10:58:
>>
>> "David Hutto" wrote
>>>
>>>
>>> http://www.google.com/search?client=ubuntu&channel=fs&q=parsing+gigabyte+xml+python&ie=utf-8&oe=utf-8
>>
>> Eeek! One of the listings says:
>>
>>> 22 Jan 2009 ... Stripping Illegal Characters from XML in Python >>
>>
>> ... I'd be asking Python to process 6.4 gigabytes of CSV into
>> 6.5 gigabytes of XML 1. ..... In fact, what happened was that
>> the parsing didn't work and the whole db was ...
>>
>> And I thought a 1G file was extreme... Do these people stop to think that
>> with XML as much as 80% of their "data" is just description (ie the tags).
>
> As I already said, it compresses well. In run-length compressed XML files,
> the tags can easily take up a negligible amount of space compared to the
> more widely varying data content (although that also commonly tends to
> compress rather well). And depending on how fast your underlying storage is,
> decompressing and parsing the file may still be faster than parsing a huge
> uncompressed file directly. So, again, the shear uncompressed file size is
> *not* a very interesting argument.
>

However, could they (as mentioned elsewhere, and by other in another
form)mitigate the damage by using smaller tags exclusively?  And also
compressed is formatted, even for the tags, correct?