Re: [lxml] Lxml aborts with an odd error message

May 25, 2014

      Using the option "huge_tree=True" on the parser works, but has performance
issues that grow progressively worse.

The default setting for parser seems to be about 1.2 million ID. Up to
that point it processes a play with ~20,000 words and associated IDs at
the rate of 1 a second. By the time the program has worked through its 9.7
million words, the checking of IDs takes 8 seconds per play: performance
degrades by almost an order of magnitude. If progress were linear, the
program should take about 500 seconds. In fact, it took 2400 seconds.

I'm not sure how many folks out there are likely to encounter this
problem. But it's a serious limit in lib-xml for the work I'm doing with a
corpus of eventually 70,000 texts (~3 billion words) that are tokenized
and have unique IDs so that you can rummage around in them and and figure
out what is where.  So it would be great if there were  a workaround.

A dumb workaround would consist in a script that just shuts off and
restarts python periodically and writes out its finds to text files in an
'append' mode.  That's what I've been doing manually, and it scales up to
a point. 

With many thanks for your diagnosis of the problem.

MM

Martin Mueller
Professor emeritus of English and Classics
Northwestern University

On 5/25/14, 7:35, "Stefan Behnel" <stefan_ml@behnel.de> wrote:
...
Martin Mueller, 25.05.2014 03:34:
...
I have run into an odd problem with the current version of lxml running
on
Python3.4 on a six-year old Mac Pro laptop with 8GB of memory.
I want to loop through ~500 TEI encoded plays, where each word token has
an xml:id, like this:
<w lemma="act" n="1-b-0140" ana="#vvn" reg="acted"
xml:id="A07064-000200">acted</w>
where the ID is composed from a text id (A07064) and a wordcounter
The basic program goes like
plays = os.walk(sourcePlayDirectory)
for directory in plays:
  for item in directory[2]:
      filename = directory[0] + '/' + item
tree = etree.parse(filename, parser)
      for element in tree.iter(tei + 'w'):
      	{some conditions}
This works, but after not quite a minute--about 50 plays and probably
about looking at a million <w> elements, the program stops and produces
an
error message like this:
tree = etree.parse(filename, parser)
  File "lxml.etree.pyx", line 3239, in lxml.etree.parse
(src/lxml/lxml.etree.c:69970)
  File "parser.pxi", line 1749, in lxml.etree._parseDocument
(src/lxml/lxml.etree.c:102081)
  File "parser.pxi", line 1775, in lxml.etree._parseDocumentFromURL
(src/lxml/lxml.etree.c:102345)
  File "parser.pxi", line 1679, in lxml.etree._parseDocFromFile
(src/lxml/lxml.etree.c:101380)
  File "parser.pxi", line 1110, in
lxml.etree._BaseParser._parseDocFromFile (src/lxml/lxml.etree.c:96832)
  File "parser.pxi", line 582, in
lxml.etree._ParserContext._handleParseResultDoc
(src/lxml/lxml.etree.c:91290)
  File "parser.pxi", line 683, in lxml.etree._handleParseResult
(src/lxml/lxml.etree.c:92476)
  File "parser.pxi", line 622, in lxml.etree._raiseParseError
(src/lxml/lxml.etree.c:91772)
lxml.etree.XMLSyntaxError: ID A07897-308750 already defined, line 59347,
column 40
Now there is nothing wrong with ID A07897-308750. It hasn't been used
previously, and if you start the program again with, say, the play
previous to the one that raised the exception, it will sail right
through
text A07897 and continue for not quite  a minute but produce the same
error message with a different ID.
So it's not the ID.
Agreed. Since you're parsing each file separately, it should be enough if
IDs are unique inside of each file.
...
I don't know what is happening, but I suspect that
something inside lxml or python hits a limit, and when that limit is
hit,
you get the syntax error message with the ID at the point at which the
program "had enough" whatever it had enough of..
The Activity monitor on my Mac shows nothing extraordinary: Python
memory use at the point of failure is about 150MB, but I have run
similar programs on an older Mac with an earlier version of lxml without
any trouble.
150MB is way too small to indicate any kind of memory problem, so I looked
through the ID handling in libxml2 and found that there really are a
couple
of limits.
lxml uses a global libxml2 dict for storing names, i.e. tag names,
attribute names and (tada!) also ID names. This avoids lots of memory
allocations and copying, so it's totally worth it. This dict, being
global,
is never reset or replaced, which is normally rather a feature than a
problem because the number of distinct names tends to be very low in
almost
all applications, so it will quickly contain all names used by the
application's data and happily reuse them from there. However, if you
parse
a lot of documents that contain globally unique IDs, they will uselessly
add up in the dict and never be collected. And recent versions of libxml2
set a conservative limit on the size of the dict to prevent malicious
input
attacks. From what you describe (and from what I just tested on my side),
it's likely that you are running into this limit. Surprisingly enough, I
couldn't reproduce it when parsing a single large document. Only starting
a
couple of new documents made it failed for me.
The upside is that you can disable the dict size limitation by configuring
the parser with the option "huge_tree=True". A quick test suggests that
this helps.
I've started looking into ways to work around this behaviour in libxml2.
The ID hash table (which maps IDs to nodes) is created on the fly using
the
document's dict, so always creating it ahead of time (even if it's not
used) and giving it its own dict might work. Adds a bit to the document
creation time, though, which can hurt in a couple of places... Definitely
something that needs a bit of experimentation.
Anyway, the "huge_tree" option should get you unstuck for now.
Stefan
_________________________________________________________________
Mailing list for the lxml Python XML toolkit - http://lxml.de/
lxml@lxml.de
https://mailman-mail5.webfaction.com/listinfo/lxml