[lxml-dev] Some benchmarks
For the curious, I've attached some benchmarks. These are preliminary, I'm putting together the numbers for my HTML talk at PyCon. One thing that I'd like to test is the memory use for documents. To do this I'm parsing about 4.5Mb of documents and keeping them in memory, and looking at the VSZ/RSS sizes reported by ps before and after. I don't think this is the right/best way to do this. For instance, transient memory use by some parsers makes Python grab a bunch of memory, but it might be free after parsing, and usable for other things. Also, I don't know if VSZ/RSS is valid at all. I get the impression it isn't that valid. And the increases I'm seeing for lxml don't seem to be sufficient; at least the process should grow by 4.5Mb, right? lxml can't be that much more efficient than the serialized form of these files. Another clear indication that we're measuring transient stuff is that when using the BeautifulSoup or html5 parser with an lxml document the memory increases substantially. So any ideas on how to test memory would be much appreciated. (Maybe I could look at ps, and then start creating Python objects until the memory use increases, so that I know I've used up any extra allocated memory?) I've also attached the script, though you'll need to grab your own HTML files. html_lxml is broken; I patched it locally to work (http://code.google.com/p/html5lib/issues/detail?id=65). Ian Parsing 355 files, 4524Kb (ripped from python.org) lxml = lxml.html bs = BeautifulSoup html5_cet = html5 parser with cElementTree model html5_et = html5 parser with ElementTree model html5_lxml = html5 parser with lxml.html model html5_minidom = html5 parser with minidom model html5_simple = html5 parser with internal simple_tree model lxml_bs = BeautifulSoup parser with lxml model htmlparser = HTMLParser, with no parser actions, document string is its own model python tester.py --no-gc lxml : 0.5156 sec ( 100% of lxml) bs : 10.3816 sec (2013% of lxml) html5_cet : 29.5829 sec (5737% of lxml) html5_et : 30.2433 sec (5865% of lxml) html5_lxml : 31.7533 sec (6158% of lxml) html5_minidom : 34.2963 sec (6651% of lxml) html5_simple : 28.7421 sec (5574% of lxml) lxml_bs : 12.2269 sec (2371% of lxml) htmlparser : 3.0968 sec ( 600% of lxml) python tester.py --no-gc --serialize lxml : 0.2704 sec ( 100% of lxml) bs : 1.8265 sec ( 675% of lxml) html5_cet : 1.5960 sec ( 590% of lxml) html5_et : 1.7677 sec ( 653% of lxml) html5_lxml : 0.2755 sec ( 101% of lxml) html5_minidom : 3.4696 sec (1283% of lxml) html5_simple : 1.4929 sec ( 552% of lxml) lxml_bs : 0.2834 sec ( 104% of lxml) VSZ/RSS increase: lxml: 1168 / 120 bs: 82508 / 82176 html5_cet: 54620 / 54756 html5_et: 64688 / 64960 html5_lxml: 49076 / 49124 html5_minidom: 194304 / 192928 html5_simple: 98608 / 98004 lxml_bs: 104920 / 104852 htmlparser: 5412 / 4456 Note: htmlparser keeps all the strings of the documents in memory.
Hi Ian, Ian Bicking wrote:
For the curious, I've attached some benchmarks. These are preliminary, I'm putting together the numbers for my HTML talk at PyCon.
Those /are/ pretty impressive numbers. Go, get some lxml ads up on PyCon. :)
One thing that I'd like to test is the memory use for documents. To do this I'm parsing about 4.5Mb of documents and keeping them in memory, and looking at the VSZ/RSS sizes reported by ps before and after. I don't think this is the right/best way to do this. For instance, transient memory use by some parsers makes Python grab a bunch of memory, but it might be free after parsing, and usable for other things. Also, I don't know if VSZ/RSS is valid at all. I get the impression it isn't that valid. And the increases I'm seeing for lxml don't seem to be sufficient; at least the process should grow by 4.5Mb, right? lxml can't be that much more efficient than the serialized form of these files.
:) Didn't you see the code snippet in lxml's parser that sneaks all documents into dark memory? I noticed that you calculate the initial size /after/ parsing in the --serialize case. If I move it before that, I get reasonable numbers for lxml: +17M for 2.5MB of documents on a 32bit machine. I don't mind having a bit of setup-time memory in those numbers, as the absolute numbers are dominated by the document size. They very much depend on your specific documents anyway (amount of text vs. tags, for example). So if two libraries are close here, either of them might win for a specific input. And if they are far away, well, then it's obvious enough which is better. A meg more or less is of no value.
Another clear indication that we're measuring transient stuff is that when using the BeautifulSoup or html5 parser with an lxml document the memory increases substantially. So any ideas on how to test memory would be much appreciated.
Somewhat hard to do across libraries. For example, the way the ElementSoup parser (i.e. BS on lxml) works, is: parse the document with BS, and then recursively translate the tree into an lxml tree. So you temporarily use about twice the memory. You'd have to intercept the tree builder process at the end (before releasing the BS tree) and measure there in order to get the maximum amount of memory used. I'd run it a couple of times and just watch top while it's running. That way, you can figure out something close to the maximum yourself. On the other hand, I don't know if temporary memory is of that much value for a comparison. If it takes more space while parsing - so what? You'll likely keep the document tree in memory much longer than the parsing takes, so that's the dominating factor. Stefan
Stefan Behnel wrote:
I noticed that you calculate the initial size /after/ parsing in the --serialize case. If I move it before that, I get reasonable numbers for lxml: +17M for 2.5MB of documents on a 32bit machine.
I didn't intend to include the --serialize option, but must have done so. Though I don't know why they weren't *all* messed up then? Anyway, I get 25MB, which seems quite reasonable. Here's the revised numbers: VSZ / RSS lxml : 25908 / 26232 bs : 82508 / 82168 html5_cet : 54616 / 54760 html5_et : 64688 / 64964 html5_lxml : 49056 / 49124 html5_minidom : 194352 / 192936 html5_simple : 99772 / 98016 lxml_bs : 104916 / 104856 htmlparser : 4440 / 4448 I also tried allocating random strings until the size increased, to see if there was lots of allocated but free memory (the unused amount is an estimate, as I'm unsure what the exact internal representation of a list of strings is). The results were peculiar: VSZ RSS (used) lxml : 26952 / 26211 (unused: 5) bs : 83408 / 82156 (unused: 0) html5_cet : 55640 / 54745 (unused: 19) html5_et : 65712 / 64946 (unused: 14) html5_lxml : 50072 / 48986 (unused: 134) html5_minidom : 195372 / 192914 (unused: 14) html5_simple : 99772 / 97999 (unused: 17) lxml_bs : 104644 / 73037 (unused: 31783) htmlparser : 4448 / 4433 (unused: 19) I guess I'm not surprised that lxml_bs (lxml.html.ElementSoup) has lots of free memory left over at the end. I am surprised that the others don't, at least html5_lxml should be similar I'd think (though I guess if you take into account the unused memory then html5_lxml and lxml_bs are similar). I don't actually know if BS is better than lxml in parsing... anything. I haven't looked hard (yet, at least). The example on the ElementSoup page parses *slightly* better with BS, but lxml parses it very similarly to how html5lib parses it, which I'd consider the better standard. html5lib has the advantage of being a kind of standard. If I had a good collection of crappy HTML, that would probably be an interesting test to see how differently html5lib, BS, and lxml parse it. I'm not sure where to find a good collection like that. Maybe html5lib's tests, I guess.
Another clear indication that we're measuring transient stuff is that when using the BeautifulSoup or html5 parser with an lxml document the memory increases substantially. So any ideas on how to test memory would be much appreciated.
Somewhat hard to do across libraries. For example, the way the ElementSoup parser (i.e. BS on lxml) works, is: parse the document with BS, and then recursively translate the tree into an lxml tree. So you temporarily use about twice the memory. You'd have to intercept the tree builder process at the end (before releasing the BS tree) and measure there in order to get the maximum amount of memory used. I'd run it a couple of times and just watch top while it's running. That way, you can figure out something close to the maximum yourself.
I'm pretty sure what you end up with after is the maximum use, as Python doesn't release memory back to the operating system after its allocated it. (Or at least Python 2.4 doesn't.) So instead you have a pool of memory that Python isn't using, but the OS doesn't know that. I guess the assumption is that if Python never needs to use it again, at least the OS can move it to virtual memory.
On the other hand, I don't know if temporary memory is of that much value for a comparison. If it takes more space while parsing - so what? You'll likely keep the document tree in memory much longer than the parsing takes, so that's the dominating factor.
Right, I'm more interested in the memory the finished document takes. Intermediate memory use shows up in the performance numbers anyway. Though I don't know if all that memory use might also lead to fragmentation, slowing down later allocations? This is beyond my understanding of Python performance. Ian
Hi, Ian Bicking wrote:
I get 25MB, which seems quite reasonable. Here's the revised numbers:
VSZ / RSS lxml : 25908 / 26232 bs : 82508 / 82168 html5_cet : 54616 / 54760 html5_et : 64688 / 64964 html5_lxml : 49056 / 49124 html5_minidom : 194352 / 192936 html5_simple : 99772 / 98016 lxml_bs : 104916 / 104856 htmlparser : 4440 / 4448
Still pretty good for lxml. That actually surprises me, cET is more memory friendly by itself (due to its simpler tree model), so it must be html5lib that takes its bite here.
I also tried allocating random strings until the size increased, to see if there was lots of allocated but free memory (the unused amount is an estimate, as I'm unsure what the exact internal representation of a list of strings is). The results were peculiar:
VSZ RSS (used) lxml : 26952 / 26211 (unused: 5) bs : 83408 / 82156 (unused: 0) html5_cet : 55640 / 54745 (unused: 19) html5_et : 65712 / 64946 (unused: 14) html5_lxml : 50072 / 48986 (unused: 134) html5_minidom : 195372 / 192914 (unused: 14) html5_simple : 99772 / 97999 (unused: 17) lxml_bs : 104644 / 73037 (unused: 31783) htmlparser : 4448 / 4433 (unused: 19)
I guess I'm not surprised that lxml_bs (lxml.html.ElementSoup) has lots of free memory left over at the end. I am surprised that the others don't, at least html5_lxml should be similar I'd think (though I guess if you take into account the unused memory then html5_lxml and lxml_bs are similar).
That's a somewhat unfair comparison though. lxml (read: libxml2) doesn't use Python's memory management, so memory that is freed by the parser is really freed to the OS, not just left as a growing interpreter heap. I think that's the main reason why html5_lxml ends up below html5_cet in your test. (Please correct me :)
I don't actually know if BS is better than lxml in parsing... anything.
When I tried it on the generated libxml2 HTML documentation (2.5 MB), BS crashed with an encoding error, while lxml worked just fine. But you might argue that libxml2 should be able to parse its own documentation. ;)
I haven't looked hard (yet, at least). The example on the ElementSoup page parses *slightly* better with BS, but lxml parses it very similarly to how html5lib parses it, which I'd consider the better standard. html5lib has the advantage of being a kind of standard.
If I had a good collection of crappy HTML, that would probably be an interesting test to see how differently html5lib, BS, and lxml parse it. I'm not sure where to find a good collection like that. Maybe html5lib's tests, I guess.
There seem to be a fair amount of HTML browser compliance test suites on the web, but I didn't find any test suites for broken HTML at a first glance.
ElementSoup parser (i.e. BS on lxml) works, is: parse the document with BS, and then recursively translate the tree into an lxml tree. So you temporarily use about twice the memory. You'd have to intercept the tree builder process at the end (before releasing the BS tree) and measure there in order to get the maximum amount of memory used. I'd run it a couple of times and just watch top while it's running. That way, you can figure out something close to the maximum yourself.
I'm pretty sure what you end up with after is the maximum use, as Python doesn't release memory back to the operating system after its allocated it. (Or at least Python 2.4 doesn't.) So instead you have a pool of memory that Python isn't using, but the OS doesn't know that. I guess the assumption is that if Python never needs to use it again, at least the OS can move it to virtual memory.
Again, unfair advantage for lxml. What about running a shell script in parallel to the parser tests that dumps the program's current RAM usage to a file as fast as it can. Then run it through "sort -n -r | head -1" to get the peak and use that?
On the other hand, I don't know if temporary memory is of that much value for a comparison. If it takes more space while parsing - so what? You'll likely keep the document tree in memory much longer than the parsing takes, so that's the dominating factor.
Right, I'm more interested in the memory the finished document takes. Intermediate memory use shows up in the performance numbers anyway. Though I don't know if all that memory use might also lead to fragmentation, slowing down later allocations? This is beyond my understanding of Python performance.
My guess is that there is enough memory overhead involved in a dynamic language like Python to keep the impact of memory fragmentation on the parser performance rather low in comparison. But that's just a guess. Stefan
On Mon, 10 Mar 2008 21:15:58 +0100 Stefan Behnel <stefan_ml@behnel.de> wrote:
I also tried allocating random strings until the size increased, to see if there was lots of allocated but free memory (the unused amount is an estimate, as I'm unsure what the exact internal representation of a list of strings is). The results were peculiar:
VSZ RSS (used) lxml : 26952 / 26211 (unused: 5) bs : 83408 / 82156 (unused: 0) html5_cet : 55640 / 54745 (unused: 19) html5_et : 65712 / 64946 (unused: 14) html5_lxml : 50072 / 48986 (unused: 134) html5_minidom : 195372 / 192914 (unused: 14) html5_simple : 99772 / 97999 (unused: 17) lxml_bs : 104644 / 73037 (unused: 31783) htmlparser : 4448 / 4433 (unused: 19)
I guess I'm not surprised that lxml_bs (lxml.html.ElementSoup) has lots of free memory left over at the end. I am surprised that the others don't, at least html5_lxml should be similar I'd think (though I guess if you take into account the unused memory then html5_lxml and lxml_bs are similar).
That's a somewhat unfair comparison though. lxml (read: libxml2) doesn't use Python's memory management, so memory that is freed by the parser is really freed to the OS, not just left as a growing interpreter heap.
Not necessarily. libxml2 uses the c libraries free/malloc. Historically, on Unix systems the C library free/malloc don't return the memory to the OS, but keep it in an internal heap. Systems that are Not Unix tend to do otherwise, creating some confusion for people moving from those systems to unix.
I haven't looked hard (yet, at least). The example on the ElementSoup page parses *slightly* better with BS, but lxml parses it very similarly to how html5lib parses it, which I'd consider the better standard. html5lib has the advantage of being a kind of standard.
If I had a good collection of crappy HTML, that would probably be an interesting test to see how differently html5lib, BS, and lxml parse it. I'm not sure where to find a good collection like that. Maybe html5lib's tests, I guess.
There seem to be a fair amount of HTML browser compliance test suites on the web, but I didn't find any test suites for broken HTML at a first glance.
I think google has a nice collection of broken html :-). <mike -- Mike Meyer <mwm@mired.org> http://www.mired.org/consulting.html Independent Network/Unix/Perforce consultant, email for more information.
Hi, Mike Meyer wrote:
Not necessarily. libxml2 uses the c libraries free/malloc. Historically, on Unix systems the C library free/malloc don't return the memory to the OS, but keep it in an internal heap. Systems that are Not Unix tend to do otherwise, creating some confusion for people moving from those systems to unix.
I tend to consider libc a part of the OS. But technically you are right and it even makes a difference here.
I haven't looked hard (yet, at least). The example on the ElementSoup page parses *slightly* better with BS, but lxml parses it very similarly to how html5lib parses it, which I'd consider the better standard. html5lib has the advantage of being a kind of standard.
If I had a good collection of crappy HTML, that would probably be an interesting test to see how differently html5lib, BS, and lxml parse it. I'm not sure where to find a good collection like that. Maybe html5lib's tests, I guess. There seem to be a fair amount of HTML browser compliance test suites on the web, but I didn't find any test suites for broken HTML at a first glance.
I think google has a nice collection of broken html :-).
Hmmm, do you want us to ask them? Or maybe ask their cache instead? I just don't know how to write a Google search query for broken HTML pages... :) Anyway, I'm not sure they actually keep the broken HTML pages around. I would expect them to send them through a sanitizer before doing anything else with them (including local caching). Stefan
participants (3)
-
Ian Bicking
-
Mike Meyer
-
Stefan Behnel