
Using the option "huge_tree=True" on the parser works, but has performance issues that grow progressively worse. The default setting for parser seems to be about 1.2 million ID. Up to that point it processes a play with ~20,000 words and associated IDs at the rate of 1 a second. By the time the program has worked through its 9.7 million words, the checking of IDs takes 8 seconds per play: performance degrades by almost an order of magnitude. If progress were linear, the program should take about 500 seconds. In fact, it took 2400 seconds. I'm not sure how many folks out there are likely to encounter this problem. But it's a serious limit in lib-xml for the work I'm doing with a corpus of eventually 70,000 texts (~3 billion words) that are tokenized and have unique IDs so that you can rummage around in them and and figure out what is where. So it would be great if there were a workaround. A dumb workaround would consist in a script that just shuts off and restarts python periodically and writes out its finds to text files in an 'append' mode. That's what I've been doing manually, and it scales up to a point. With many thanks for your diagnosis of the problem. MM Martin Mueller Professor emeritus of English and Classics Northwestern University On 5/25/14, 7:35, "Stefan Behnel" <stefan_ml@behnel.de> wrote:
Martin Mueller, 25.05.2014 03:34:
I have run into an odd problem with the current version of lxml running on Python3.4 on a six-year old Mac Pro laptop with 8GB of memory.
I want to loop through ~500 TEI encoded plays, where each word token has an xml:id, like this:
<w lemma="act" n="1-b-0140" ana="#vvn" reg="acted" xml:id="A07064-000200">acted</w>
where the ID is composed from a text id (A07064) and a wordcounter
The basic program goes like
plays = os.walk(sourcePlayDirectory)
for directory in plays: for item in directory[2]: filename = directory[0] + '/' + item
tree = etree.parse(filename, parser) for element in tree.iter(tei + 'w'): {some conditions}
This works, but after not quite a minute--about 50 plays and probably about looking at a million <w> elements, the program stops and produces an error message like this:
tree = etree.parse(filename, parser) File "lxml.etree.pyx", line 3239, in lxml.etree.parse (src/lxml/lxml.etree.c:69970) File "parser.pxi", line 1749, in lxml.etree._parseDocument (src/lxml/lxml.etree.c:102081) File "parser.pxi", line 1775, in lxml.etree._parseDocumentFromURL (src/lxml/lxml.etree.c:102345) File "parser.pxi", line 1679, in lxml.etree._parseDocFromFile (src/lxml/lxml.etree.c:101380) File "parser.pxi", line 1110, in lxml.etree._BaseParser._parseDocFromFile (src/lxml/lxml.etree.c:96832) File "parser.pxi", line 582, in lxml.etree._ParserContext._handleParseResultDoc (src/lxml/lxml.etree.c:91290) File "parser.pxi", line 683, in lxml.etree._handleParseResult (src/lxml/lxml.etree.c:92476) File "parser.pxi", line 622, in lxml.etree._raiseParseError (src/lxml/lxml.etree.c:91772) lxml.etree.XMLSyntaxError: ID A07897-308750 already defined, line 59347, column 40
Now there is nothing wrong with ID A07897-308750. It hasn't been used previously, and if you start the program again with, say, the play previous to the one that raised the exception, it will sail right through text A07897 and continue for not quite a minute but produce the same error message with a different ID.
So it's not the ID.
Agreed. Since you're parsing each file separately, it should be enough if IDs are unique inside of each file.
I don't know what is happening, but I suspect that something inside lxml or python hits a limit, and when that limit is hit, you get the syntax error message with the ID at the point at which the program "had enough" whatever it had enough of..
The Activity monitor on my Mac shows nothing extraordinary: Python memory use at the point of failure is about 150MB, but I have run similar programs on an older Mac with an earlier version of lxml without any trouble.
150MB is way too small to indicate any kind of memory problem, so I looked through the ID handling in libxml2 and found that there really are a couple of limits.
lxml uses a global libxml2 dict for storing names, i.e. tag names, attribute names and (tada!) also ID names. This avoids lots of memory allocations and copying, so it's totally worth it. This dict, being global, is never reset or replaced, which is normally rather a feature than a problem because the number of distinct names tends to be very low in almost all applications, so it will quickly contain all names used by the application's data and happily reuse them from there. However, if you parse a lot of documents that contain globally unique IDs, they will uselessly add up in the dict and never be collected. And recent versions of libxml2 set a conservative limit on the size of the dict to prevent malicious input attacks. From what you describe (and from what I just tested on my side), it's likely that you are running into this limit. Surprisingly enough, I couldn't reproduce it when parsing a single large document. Only starting a couple of new documents made it failed for me.
The upside is that you can disable the dict size limitation by configuring the parser with the option "huge_tree=True". A quick test suggests that this helps.
I've started looking into ways to work around this behaviour in libxml2. The ID hash table (which maps IDs to nodes) is created on the fly using the document's dict, so always creating it ahead of time (even if it's not used) and giving it its own dict might work. Adds a bit to the document creation time, though, which can hurt in a couple of places... Definitely something that needs a bit of experimentation.
Anyway, the "huge_tree" option should get you unstuck for now.
Stefan
_________________________________________________________________ Mailing list for the lxml Python XML toolkit - http://lxml.de/ lxml@lxml.de https://mailman-mail5.webfaction.com/listinfo/lxml