For the curious, I've attached some benchmarks. These are preliminary, I'm putting together the numbers for my HTML talk at PyCon. One thing that I'd like to test is the memory use for documents. To do this I'm parsing about 4.5Mb of documents and keeping them in memory, and looking at the VSZ/RSS sizes reported by ps before and after. I don't think this is the right/best way to do this. For instance, transient memory use by some parsers makes Python grab a bunch of memory, but it might be free after parsing, and usable for other things. Also, I don't know if VSZ/RSS is valid at all. I get the impression it isn't that valid. And the increases I'm seeing for lxml don't seem to be sufficient; at least the process should grow by 4.5Mb, right? lxml can't be that much more efficient than the serialized form of these files. Another clear indication that we're measuring transient stuff is that when using the BeautifulSoup or html5 parser with an lxml document the memory increases substantially. So any ideas on how to test memory would be much appreciated. (Maybe I could look at ps, and then start creating Python objects until the memory use increases, so that I know I've used up any extra allocated memory?) I've also attached the script, though you'll need to grab your own HTML files. html_lxml is broken; I patched it locally to work (http://code.google.com/p/html5lib/issues/detail?id=65). Ian Parsing 355 files, 4524Kb (ripped from python.org) lxml = lxml.html bs = BeautifulSoup html5_cet = html5 parser with cElementTree model html5_et = html5 parser with ElementTree model html5_lxml = html5 parser with lxml.html model html5_minidom = html5 parser with minidom model html5_simple = html5 parser with internal simple_tree model lxml_bs = BeautifulSoup parser with lxml model htmlparser = HTMLParser, with no parser actions, document string is its own model python tester.py --no-gc lxml : 0.5156 sec ( 100% of lxml) bs : 10.3816 sec (2013% of lxml) html5_cet : 29.5829 sec (5737% of lxml) html5_et : 30.2433 sec (5865% of lxml) html5_lxml : 31.7533 sec (6158% of lxml) html5_minidom : 34.2963 sec (6651% of lxml) html5_simple : 28.7421 sec (5574% of lxml) lxml_bs : 12.2269 sec (2371% of lxml) htmlparser : 3.0968 sec ( 600% of lxml) python tester.py --no-gc --serialize lxml : 0.2704 sec ( 100% of lxml) bs : 1.8265 sec ( 675% of lxml) html5_cet : 1.5960 sec ( 590% of lxml) html5_et : 1.7677 sec ( 653% of lxml) html5_lxml : 0.2755 sec ( 101% of lxml) html5_minidom : 3.4696 sec (1283% of lxml) html5_simple : 1.4929 sec ( 552% of lxml) lxml_bs : 0.2834 sec ( 104% of lxml) VSZ/RSS increase: lxml: 1168 / 120 bs: 82508 / 82176 html5_cet: 54620 / 54756 html5_et: 64688 / 64960 html5_lxml: 49076 / 49124 html5_minidom: 194304 / 192928 html5_simple: 98608 / 98004 lxml_bs: 104920 / 104852 htmlparser: 5412 / 4456 Note: htmlparser keeps all the strings of the documents in memory.