
Martin Mueller, 26.05.2014 16:18:
On 5/26/14, 8:02, "Stefan Behnel" wrote:
Martin Mueller, 25.05.2014 22:49:
Using the option "huge_tree=True" on the parser works, but has performance issues that grow progressively worse.
The default setting for parser seems to be about 1.2 million ID. Up to that point it processes a play with ~20,000 words and associated IDs at the rate of 1 a second. By the time the program has worked through its 9.7 million words, the checking of IDs takes 8 seconds per play: performance degrades by almost an order of magnitude. If progress were linear, the program should take about 500 seconds. In fact, it took 2400 seconds.
Yes, I noticed that, too. I'm currently coding up a new parser option, working title "collect_ids=False", that would allow you to disable the ID hash table building. When I do that, performance jumps from ~52 seconds per million IDs to 2 seconds on my side for an extreme test case.
However, it's not ready for a release yet. I get test failures in other areas with this change, so it needs a bit more work. I can publish a 3.4 alpha when it's ready, might take a couple of days, though.
That will be a great feature! I look forward to it, and I'll be happy to help test it.
Here's an implementation: https://github.com/lxml/lxml/commit/35316b052af48921657813bb68563fe4a301d1b8 I attached a little test program that I used for benchmarking. It stress tests the XML ID handling by parsing lots of elements with different IDs and discarding them right after parsing. The new implementation performs 5x better with the normal parser and about 50x better with the new collect_ids=False option. Given how rare the usage of the XML ID hash table should be in real code, this makes me wonder if the option should not be switched off by default, however backwards incompatible that is. Can you test it from the latest github version? BTW, the lxml homepage has a Paypal link to allow for sponsorship of lxml's development, just in case this wasn't generally known. :) Stefan