Trying to parse a HUGE(1gb) xml file

Roy Smith roy at panix.com
Tue Dec 28 11:02:59 EST 2010


In article <ifcmru$abn$1 at news.eternal-september.org>,
 "BartC" <bc at freeuk.com> wrote:

> Still, that's 27 times as much as it need be. Readability is fine, but why
> does the full, expanded, human-readable textual format have to be stored on
> disk too, and for every single instance?

Well, I know the answer to that one.  The particular XML feed I'm 
working with is a dump from an SQL database.  The element names in the 
XML are exactly the same as the column names in the SQL database.

The difference being that in the database, the string 
"Parental-Advisory" appears in exactly one place, in some schema 
metadata table.  In the XML, it appears (doubled!) once per row.

It's still obscene.  That fact that I understand the cause of the 
obscenity doesn't make it any less so.

Another problem with XML is that some people don't use real XML tools to 
write their XML files.  DTD?  What's that?  So you end up with tag soup 
that the real XML tools can't parse on the other end.



More information about the Python-list mailing list