Re: [lxml] Lxml aborts with an odd error message

May 26, 2014


      ...
Martin Mueller, 25.05.2014 22:49:
...
Using the option "huge_tree=True" on the parser works, but has performance
issues that grow progressively worse.
The default setting for parser seems to be about 1.2 million ID. Up to
that point it processes a play with ~20,000 words and associated IDs at
the rate of 1 a second. By the time the program has worked through its 9.7
million words, the checking of IDs takes 8 seconds per play: performance
degrades by almost an order of magnitude. If progress were linear, the
program should take about 500 seconds. In fact, it took 2400 seconds.
...
Yes, I noticed that, too. I'm currently coding up a new parser option,
working title "collect_ids=False", that would allow you to disable the ID
hash table building. When I do that, performance jumps from ~52 seconds per
million IDs to 2 seconds on my side for an extreme test case.
Can't we just purge this table on finish?

And could you possibly name some relevant locations/symbols besides your findings
to save others' time groveling through code?
...
However, it's not ready for a release yet. I get test failures in other
areas with this change, so it needs a bit more work. I can publish a 3.4
alpha when it's ready, might take a couple of days, though.
...
Stefan
...
_________________________________________________________________
Mailing list for the lxml Python XML toolkit - http://lxml.de/
lxml@lxml.de
https://mailman-mail5.webfaction.com/listinfo/lxml
-- 
Best regards,
 Ivan                            mailto:vano@mail.mipt.ru

Re: [lxml] Lxml aborts with an odd error message

Ivan Pozdeev