Re: [lxml-dev] Some benchmarks

10 Mar 2008

      Hi Ian,

Ian Bicking wrote:
...
For the curious, I've attached some benchmarks.  These are preliminary,
I'm putting together the numbers for my HTML talk at PyCon.
Those /are/ pretty impressive numbers. Go, get some lxml ads up on PyCon. :)
...
One thing that I'd like to test is the memory use for documents.  To do
this I'm parsing about 4.5Mb of documents and keeping them in memory,
and looking at the VSZ/RSS sizes reported by ps before and after.  I
don't think this is the right/best way to do this.  For instance,
transient memory use by some parsers makes Python grab a bunch of
memory, but it might be free after parsing, and usable for other things.
 Also, I don't know if VSZ/RSS is valid at all.  I get the impression it
isn't that valid.  And the increases I'm seeing for lxml don't seem to
be sufficient; at least the process should grow by 4.5Mb, right? lxml
can't be that much more efficient than the serialized form of these files.
:) Didn't you see the code snippet in lxml's parser that sneaks all documents
into dark memory?

I noticed that you calculate the initial size /after/ parsing in the
--serialize case. If I move it before that, I get reasonable numbers for lxml:
+17M for 2.5MB of documents on a 32bit machine.

I don't mind having a bit of setup-time memory in those numbers, as the
absolute numbers are dominated by the document size. They very much depend on
your specific documents anyway (amount of text vs. tags, for example). So if
two libraries are close here, either of them might win for a specific input.
And if they are far away, well, then it's obvious enough which is better. A
meg more or less is of no value.
...
Another clear indication that we're measuring transient stuff is that
when using the BeautifulSoup or html5 parser with an lxml document the
memory increases substantially.  So any ideas on how to test memory
would be much appreciated.
Somewhat hard to do across libraries. For example, the way the ElementSoup
parser (i.e. BS on lxml) works, is: parse the document with BS, and then
recursively translate the tree into an lxml tree. So you temporarily use about
twice the memory. You'd have to intercept the tree builder process at the end
(before releasing the BS tree) and measure there in order to get the maximum
amount of memory used. I'd run it a couple of times and just watch top while
it's running. That way, you can figure out something close to the maximum
yourself.

On the other hand, I don't know if temporary memory is of that much value for
a comparison. If it takes more space while parsing - so what? You'll likely
keep the document tree in memory much longer than the parsing takes, so that's
the dominating factor.

Stefan

Re: [lxml-dev] Some benchmarks

Stefan Behnel