Hi Ian, Ian Bicking wrote:
For the curious, I've attached some benchmarks. These are preliminary, I'm putting together the numbers for my HTML talk at PyCon.
Those /are/ pretty impressive numbers. Go, get some lxml ads up on PyCon. :)
One thing that I'd like to test is the memory use for documents. To do this I'm parsing about 4.5Mb of documents and keeping them in memory, and looking at the VSZ/RSS sizes reported by ps before and after. I don't think this is the right/best way to do this. For instance, transient memory use by some parsers makes Python grab a bunch of memory, but it might be free after parsing, and usable for other things. Also, I don't know if VSZ/RSS is valid at all. I get the impression it isn't that valid. And the increases I'm seeing for lxml don't seem to be sufficient; at least the process should grow by 4.5Mb, right? lxml can't be that much more efficient than the serialized form of these files.
:) Didn't you see the code snippet in lxml's parser that sneaks all documents into dark memory? I noticed that you calculate the initial size /after/ parsing in the --serialize case. If I move it before that, I get reasonable numbers for lxml: +17M for 2.5MB of documents on a 32bit machine. I don't mind having a bit of setup-time memory in those numbers, as the absolute numbers are dominated by the document size. They very much depend on your specific documents anyway (amount of text vs. tags, for example). So if two libraries are close here, either of them might win for a specific input. And if they are far away, well, then it's obvious enough which is better. A meg more or less is of no value.
Another clear indication that we're measuring transient stuff is that when using the BeautifulSoup or html5 parser with an lxml document the memory increases substantially. So any ideas on how to test memory would be much appreciated.
Somewhat hard to do across libraries. For example, the way the ElementSoup parser (i.e. BS on lxml) works, is: parse the document with BS, and then recursively translate the tree into an lxml tree. So you temporarily use about twice the memory. You'd have to intercept the tree builder process at the end (before releasing the BS tree) and measure there in order to get the maximum amount of memory used. I'd run it a couple of times and just watch top while it's running. That way, you can figure out something close to the maximum yourself. On the other hand, I don't know if temporary memory is of that much value for a comparison. If it takes more space while parsing - so what? You'll likely keep the document tree in memory much longer than the parsing takes, so that's the dominating factor. Stefan