minidom memory performance

Geoff Gerrietts geoff at gerrietts.net
Fri May 23 15:19:25 EDT 2003


Quoting Peter Hansen (peter at engcorp.com):
> I'd say yes.  Anyway, it's very unlikely that using the DOM on a
> 28MB XML file is a good idea.  Using SAX might be more effective, or
> a much lighter-weight tree-based approach like PyRXP.

Quoting John Wilson (tug at wilson.co.uk):
> Generally large XML files are many repetitions of the same complex
> element. It might be possible to do some black magic which builds a
> minidom instance for each of these elements, process the instance
> then junk it and build a minidom instance for the next element. A
> better approach is to use SAX and process the document on the fly.

I think DOM was chosen for ease of use after the fact, to meet a tight
deadline. SAX is considerably harder to work with, just in terms of
the corner-cases the programmer is expected to work with. 4DOM's
cDomlette yielded better memory performance, but has byte-alignment
problems when dealing with unicode on the Alpha. I've recommended
PyRXP, but PyRXP's license (GPL) may present some problems; the work
is being developed to support a client's application, and the client
has not (to my understanding) given consent to open source any of the
code. I have not yet suggested pulldom, but that might work for them.

SAX may be the simplest way, it's just harder to write the code to do
what little really needs to be done.

I am reporting all this second-hand, keep in mind, so any perceived
stupidity may be the fault of my translation.

Quoting Peter Hansen (peter at engcorp.com):
>  What's in this massive file?  Can it just be processed chunk by
>  chunk?

Quoting John Wilson (tug at wilson.co.uk):
> Can you give us some idea of the document structure and what you want to do
> with it?

Again, this isn't my document, or my client. I don't know what kind of
details I can give you. I suspect that it's correct to say the
document is a collection of smaller, repeated elements, but I've never
seen the data format.

I'm not really looking to have the problem solved -- it's not my
problem to solve, though I'm always willing to pass pointers along to
my friend. I was just curious if that's really for real, 10000%
overhead in minidom. That's pretty spectacular, even for a
memory-heavy API like DOM.

Anyway, thanks for the comments.

--G.

-- 
Geoff Gerrietts <geoff at gerrietts dot net>
-rw-rw-rw-:         permissions of the beast





More information about the Python-list mailing list