Re: [lxml] Lxml aborts with an odd error message

May 26, 2014


      Ivan Pozdeev, 26.05.2014 19:20:
...
...
Martin Mueller, 25.05.2014 22:49:
...
Using the option "huge_tree=True" on the parser works, but has performance
issues that grow progressively worse.
The default setting for parser seems to be about 1.2 million ID. Up to
that point it processes a play with ~20,000 words and associated IDs at
the rate of 1 a second. By the time the program has worked through its 9.7
million words, the checking of IDs takes 8 seconds per play: performance
degrades by almost an order of magnitude. If progress were linear, the
program should take about 500 seconds. In fact, it took 2400 seconds.
...
Yes, I noticed that, too. I'm currently coding up a new parser option,
working title "collect_ids=False", that would allow you to disable the ID
hash table building. When I do that, performance jumps from ~52 seconds per
million IDs to 2 seconds on my side for an extreme test case.
Can't we just purge this table on finish?
No. The hash table is local to a parser run, so it gets discarded anyway.
The problem is that the ID strings are interned.
...
And could you possibly name some relevant locations/symbols besides your findings
to save others' time groveling through code?
Sure. Take a look at xmlAddID() in valid.c and its usages in SAX2.c in
libxml2's sources. The idea is to set the XML_SKIP_IDS flag.

Stefan

Re: [lxml] Lxml aborts with an odd error message

Stefan Behnel