Martin Mueller, 25.05.2014 22:49:
Using the option "huge_tree=True" on the parser works, but has performance issues that grow progressively worse.
The default setting for parser seems to be about 1.2 million ID. Up to that point it processes a play with ~20,000 words and associated IDs at the rate of 1 a second. By the time the program has worked through its 9.7 million words, the checking of IDs takes 8 seconds per play: performance degrades by almost an order of magnitude. If progress were linear, the program should take about 500 seconds. In fact, it took 2400 seconds.
Yes, I noticed that, too. I'm currently coding up a new parser option, working title "collect_ids=False", that would allow you to disable the ID hash table building. When I do that, performance jumps from ~52 seconds per million IDs to 2 seconds on my side for an extreme test case. However, it's not ready for a release yet. I get test failures in other areas with this change, so it needs a bit more work. I can publish a 3.4 alpha when it's ready, might take a couple of days, though. Stefan