
Ivan Pozdeev, 26.05.2014 19:20:
Martin Mueller, 25.05.2014 22:49:
Using the option "huge_tree=True" on the parser works, but has performance issues that grow progressively worse.
The default setting for parser seems to be about 1.2 million ID. Up to that point it processes a play with ~20,000 words and associated IDs at the rate of 1 a second. By the time the program has worked through its 9.7 million words, the checking of IDs takes 8 seconds per play: performance degrades by almost an order of magnitude. If progress were linear, the program should take about 500 seconds. In fact, it took 2400 seconds.
Yes, I noticed that, too. I'm currently coding up a new parser option, working title "collect_ids=False", that would allow you to disable the ID hash table building. When I do that, performance jumps from ~52 seconds per million IDs to 2 seconds on my side for an extreme test case.
Can't we just purge this table on finish?
No. The hash table is local to a parser run, so it gets discarded anyway. The problem is that the ID strings are interned.
And could you possibly name some relevant locations/symbols besides your findings to save others' time groveling through code?
Sure. Take a look at xmlAddID() in valid.c and its usages in SAX2.c in libxml2's sources. The idea is to set the XML_SKIP_IDS flag. Stefan