
Martin Mueller, 25.05.2014 22:49:
Using the option "huge_tree=True" on the parser works, but has performance issues that grow progressively worse.
The default setting for parser seems to be about 1.2 million ID. Up to that point it processes a play with ~20,000 words and associated IDs at the rate of 1 a second. By the time the program has worked through its 9.7 million words, the checking of IDs takes 8 seconds per play: performance degrades by almost an order of magnitude. If progress were linear, the program should take about 500 seconds. In fact, it took 2400 seconds.
Yes, I noticed that, too. I'm currently coding up a new parser option, working title "collect_ids=False", that would allow you to disable the ID hash table building. When I do that, performance jumps from ~52 seconds per million IDs to 2 seconds on my side for an extreme test case.
Can't we just purge this table on finish? And could you possibly name some relevant locations/symbols besides your findings to save others' time groveling through code?
However, it's not ready for a release yet. I get test failures in other areas with this change, so it needs a bit more work. I can publish a 3.4 alpha when it's ready, might take a couple of days, though.
Stefan
_________________________________________________________________ Mailing list for the lxml Python XML toolkit - http://lxml.de/ lxml@lxml.de https://mailman-mail5.webfaction.com/listinfo/lxml
-- Best regards, Ivan mailto:vano@mail.mipt.ru