[XML-SIG] XML processing

Sun Feb 15 09:11:26 CET 2009

Hi,

Bill Kinnersley wrote:
> Can anyone add any substance to this remark?  With today's typical
> system RAM of 2GB to 3GB, is it even worth consideration any more that a
> document might not fit in memory?

At least, it allows you to parse pretty large documents. But think of
parallel handling of more than one document. In that case, you'd still want
to make sure things don't hit the swap disk.

> Offhand I'd guess the size of the XML file and the size of the DOM tree
> would be in the same ballpark.  So unless I've got more than 500MB of
> XML to read, I'm clear.  Right or wrong?

Wrong. Especially the stdlib's minidom is terribly memory hungry. Fredrik
has some benchmarks and memory size hints on his cElementTree page.

http://effbot.org/zone/celementtree.htm#benchmarks

Here are some other benchmarks from Ian Bicking on HTML parsers:

http://blog.ianbicking.org/2008/03/30/python-html-parser-performance/

Sadly, I do not know of any direct comparison of lxml.etree and
cElementTree regarding memory usage, but my guess is that cET is still a
bit better than lxml.etree (which is impressively memory friendly already).
A quick comparison for a 3.4MB XML file with a lot of text and very short
tag names (the old testament in English) gave me almost exactly the same
time for parsing. When done, I had a 17MB Python interpreter for lxml.etree
and a 10MB interpreter for cET. Depending on your XML, this may change in
any kind of way, as both optimise their time and memory usage very differently.

For minidom, I get about 60MB, where Fredrik got 80MB. That's still about a
factor of 17-23 compared to the serialised XML file, whereas lxml and cET
end up with a factor of 3-5. Your assumption that you can use a system with
3GB of RAM to parse a 500MB XML file into an in-memory tree can easily turn
wrong for XML files with more tags and shorter text content (say, numbers),
or for documents with non-european languages.

Stefan