Stefan Behnel wrote:
I noticed that you calculate the initial size /after/ parsing in the --serialize case. If I move it before that, I get reasonable numbers for lxml: +17M for 2.5MB of documents on a 32bit machine.
I didn't intend to include the --serialize option, but must have done so. Though I don't know why they weren't *all* messed up then? Anyway, I get 25MB, which seems quite reasonable. Here's the revised numbers: VSZ / RSS lxml : 25908 / 26232 bs : 82508 / 82168 html5_cet : 54616 / 54760 html5_et : 64688 / 64964 html5_lxml : 49056 / 49124 html5_minidom : 194352 / 192936 html5_simple : 99772 / 98016 lxml_bs : 104916 / 104856 htmlparser : 4440 / 4448 I also tried allocating random strings until the size increased, to see if there was lots of allocated but free memory (the unused amount is an estimate, as I'm unsure what the exact internal representation of a list of strings is). The results were peculiar: VSZ RSS (used) lxml : 26952 / 26211 (unused: 5) bs : 83408 / 82156 (unused: 0) html5_cet : 55640 / 54745 (unused: 19) html5_et : 65712 / 64946 (unused: 14) html5_lxml : 50072 / 48986 (unused: 134) html5_minidom : 195372 / 192914 (unused: 14) html5_simple : 99772 / 97999 (unused: 17) lxml_bs : 104644 / 73037 (unused: 31783) htmlparser : 4448 / 4433 (unused: 19) I guess I'm not surprised that lxml_bs (lxml.html.ElementSoup) has lots of free memory left over at the end. I am surprised that the others don't, at least html5_lxml should be similar I'd think (though I guess if you take into account the unused memory then html5_lxml and lxml_bs are similar). I don't actually know if BS is better than lxml in parsing... anything. I haven't looked hard (yet, at least). The example on the ElementSoup page parses *slightly* better with BS, but lxml parses it very similarly to how html5lib parses it, which I'd consider the better standard. html5lib has the advantage of being a kind of standard. If I had a good collection of crappy HTML, that would probably be an interesting test to see how differently html5lib, BS, and lxml parse it. I'm not sure where to find a good collection like that. Maybe html5lib's tests, I guess.
Another clear indication that we're measuring transient stuff is that when using the BeautifulSoup or html5 parser with an lxml document the memory increases substantially. So any ideas on how to test memory would be much appreciated.
Somewhat hard to do across libraries. For example, the way the ElementSoup parser (i.e. BS on lxml) works, is: parse the document with BS, and then recursively translate the tree into an lxml tree. So you temporarily use about twice the memory. You'd have to intercept the tree builder process at the end (before releasing the BS tree) and measure there in order to get the maximum amount of memory used. I'd run it a couple of times and just watch top while it's running. That way, you can figure out something close to the maximum yourself.
I'm pretty sure what you end up with after is the maximum use, as Python doesn't release memory back to the operating system after its allocated it. (Or at least Python 2.4 doesn't.) So instead you have a pool of memory that Python isn't using, but the OS doesn't know that. I guess the assumption is that if Python never needs to use it again, at least the OS can move it to virtual memory.
On the other hand, I don't know if temporary memory is of that much value for a comparison. If it takes more space while parsing - so what? You'll likely keep the document tree in memory much longer than the parsing takes, so that's the dominating factor.
Right, I'm more interested in the memory the finished document takes. Intermediate memory use shows up in the performance numbers anyway. Though I don't know if all that memory use might also lead to fragmentation, slowing down later allocations? This is beyond my understanding of Python performance. Ian