Lxml aborts with an odd error message

I have run into an odd problem with the current version of lxml running on Python3.4 on a six-year old Mac Pro laptop with 8GB of memory. I want to loop through ~500 TEI encoded plays, where each word token has an xml:id, like this: <w lemma="act" n="1-b-0140" ana="#vvn" reg="acted" xml:id="A07064-000200">acted</w> where the ID is composed from a text id (A07064) and a wordcounter The basic program goes like plays = os.walk(sourcePlayDirectory) for directory in plays: for item in directory[2]: filename = directory[0] + '/' + item tree = etree.parse(filename, parser) for element in tree.iter(tei + 'w'): {some conditions} This works, but after not quite a minute--about 50 plays and probably about looking at a million <w> elements, the program stops and produces an error message like this: tree = etree.parse(filename, parser) File "lxml.etree.pyx", line 3239, in lxml.etree.parse (src/lxml/lxml.etree.c:69970) File "parser.pxi", line 1749, in lxml.etree._parseDocument (src/lxml/lxml.etree.c:102081) File "parser.pxi", line 1775, in lxml.etree._parseDocumentFromURL (src/lxml/lxml.etree.c:102345) File "parser.pxi", line 1679, in lxml.etree._parseDocFromFile (src/lxml/lxml.etree.c:101380) File "parser.pxi", line 1110, in lxml.etree._BaseParser._parseDocFromFile (src/lxml/lxml.etree.c:96832) File "parser.pxi", line 582, in lxml.etree._ParserContext._handleParseResultDoc (src/lxml/lxml.etree.c:91290) File "parser.pxi", line 683, in lxml.etree._handleParseResult (src/lxml/lxml.etree.c:92476) File "parser.pxi", line 622, in lxml.etree._raiseParseError (src/lxml/lxml.etree.c:91772) lxml.etree.XMLSyntaxError: ID A07897-308750 already defined, line 59347, column 40 Now there is nothing wrong with ID A07897-308750. It hasn't been used previously, and if you start the program again with, say, the play previous to the one that raised the exception, it will sail right through text A07897 and continue for not quite a minute but produce the same error message with a different ID. So it's not the ID. I don't know what is happening, but I suspect that something inside lxml or python hits a limit, and when that limit is hit, you get the syntax error message with the ID at the point at which the program "had enough" whatever it had enough of.. The Activity monitor on my Mac shows nothing extraordinary: Python memory use at the point of failure is about 150MB, but I have run similar programs on an older Mac with an earlier version of lxml without any trouble. Martin Mueller Professor emeritus of English and Classics Northwestern University

1) You run everything through one parser, so it appears to accumulate IDs it has seen so far. Are you sure your IDs are unique across all files? 2) Are there any chances it encounters the same file twice (hardlinks/symlinks/restarts from a dirty state)?
I have run into an odd problem with the current version of lxml running on Python3.4 on a six-year old Mac Pro laptop with 8GB of memory.
I want to loop through ~500 TEI encoded plays, where each word token has an xml:id, like this:
<w lemma="act" n="1-b-0140" ana="#vvn" reg="acted" xml:id="A07064-000200">acted</w>
where the ID is composed from a text id (A07064) and a wordcounter
The basic program goes like
plays = os.walk(sourcePlayDirectory)
-- Best regards, Ivan mailto:vano@mail.mipt.ru

Martin Mueller Professor emeritus of English and Classics Northwestern University Yes, I am sure that the IDs are unique across the corpus. I don't think that the same file is encountered twice. If that were the case, I would expect the error to occur at the first ID of the file that is encountered twice. But it happens somewhere in the middle, so we would posit that the program happily encounters some number of IDs twice, but suddenly complains. On 5/24/14, 19:59, "Ivan Pozdeev" <vano@mail.mipt.ru> wrote:

Martin Mueller, 25.05.2014 03:34:
Agreed. Since you're parsing each file separately, it should be enough if IDs are unique inside of each file.
150MB is way too small to indicate any kind of memory problem, so I looked through the ID handling in libxml2 and found that there really are a couple of limits. lxml uses a global libxml2 dict for storing names, i.e. tag names, attribute names and (tada!) also ID names. This avoids lots of memory allocations and copying, so it's totally worth it. This dict, being global, is never reset or replaced, which is normally rather a feature than a problem because the number of distinct names tends to be very low in almost all applications, so it will quickly contain all names used by the application's data and happily reuse them from there. However, if you parse a lot of documents that contain globally unique IDs, they will uselessly add up in the dict and never be collected. And recent versions of libxml2 set a conservative limit on the size of the dict to prevent malicious input attacks. From what you describe (and from what I just tested on my side), it's likely that you are running into this limit. Surprisingly enough, I couldn't reproduce it when parsing a single large document. Only starting a couple of new documents made it failed for me. The upside is that you can disable the dict size limitation by configuring the parser with the option "huge_tree=True". A quick test suggests that this helps. I've started looking into ways to work around this behaviour in libxml2. The ID hash table (which maps IDs to nodes) is created on the fly using the document's dict, so always creating it ahead of time (even if it's not used) and giving it its own dict might work. Adds a bit to the document creation time, though, which can hurt in a couple of places... Definitely something that needs a bit of experimentation. Anyway, the "huge_tree" option should get you unstuck for now. Stefan

Using the option "huge_tree=True" on the parser works, but has performance issues that grow progressively worse. The default setting for parser seems to be about 1.2 million ID. Up to that point it processes a play with ~20,000 words and associated IDs at the rate of 1 a second. By the time the program has worked through its 9.7 million words, the checking of IDs takes 8 seconds per play: performance degrades by almost an order of magnitude. If progress were linear, the program should take about 500 seconds. In fact, it took 2400 seconds. I'm not sure how many folks out there are likely to encounter this problem. But it's a serious limit in lib-xml for the work I'm doing with a corpus of eventually 70,000 texts (~3 billion words) that are tokenized and have unique IDs so that you can rummage around in them and and figure out what is where. So it would be great if there were a workaround. A dumb workaround would consist in a script that just shuts off and restarts python periodically and writes out its finds to text files in an 'append' mode. That's what I've been doing manually, and it scales up to a point. With many thanks for your diagnosis of the problem. MM Martin Mueller Professor emeritus of English and Classics Northwestern University On 5/25/14, 7:35, "Stefan Behnel" <stefan_ml@behnel.de> wrote:

Martin Mueller, 25.05.2014 22:49:
Yes, I noticed that, too. I'm currently coding up a new parser option, working title "collect_ids=False", that would allow you to disable the ID hash table building. When I do that, performance jumps from ~52 seconds per million IDs to 2 seconds on my side for an extreme test case. However, it's not ready for a release yet. I get test failures in other areas with this change, so it needs a bit more work. I can publish a 3.4 alpha when it's ready, might take a couple of days, though. Stefan

Ivan Pozdeev, 26.05.2014 19:20:
No. The hash table is local to a parser run, so it gets discarded anyway. The problem is that the ID strings are interned.
And could you possibly name some relevant locations/symbols besides your findings to save others' time groveling through code?
Sure. Take a look at xmlAddID() in valid.c and its usages in SAX2.c in libxml2's sources. The idea is to set the XML_SKIP_IDS flag. Stefan

1) You run everything through one parser, so it appears to accumulate IDs it has seen so far. Are you sure your IDs are unique across all files? 2) Are there any chances it encounters the same file twice (hardlinks/symlinks/restarts from a dirty state)?
I have run into an odd problem with the current version of lxml running on Python3.4 on a six-year old Mac Pro laptop with 8GB of memory.
I want to loop through ~500 TEI encoded plays, where each word token has an xml:id, like this:
<w lemma="act" n="1-b-0140" ana="#vvn" reg="acted" xml:id="A07064-000200">acted</w>
where the ID is composed from a text id (A07064) and a wordcounter
The basic program goes like
plays = os.walk(sourcePlayDirectory)
-- Best regards, Ivan mailto:vano@mail.mipt.ru

Martin Mueller Professor emeritus of English and Classics Northwestern University Yes, I am sure that the IDs are unique across the corpus. I don't think that the same file is encountered twice. If that were the case, I would expect the error to occur at the first ID of the file that is encountered twice. But it happens somewhere in the middle, so we would posit that the program happily encounters some number of IDs twice, but suddenly complains. On 5/24/14, 19:59, "Ivan Pozdeev" <vano@mail.mipt.ru> wrote:

Martin Mueller, 25.05.2014 03:34:
Agreed. Since you're parsing each file separately, it should be enough if IDs are unique inside of each file.
150MB is way too small to indicate any kind of memory problem, so I looked through the ID handling in libxml2 and found that there really are a couple of limits. lxml uses a global libxml2 dict for storing names, i.e. tag names, attribute names and (tada!) also ID names. This avoids lots of memory allocations and copying, so it's totally worth it. This dict, being global, is never reset or replaced, which is normally rather a feature than a problem because the number of distinct names tends to be very low in almost all applications, so it will quickly contain all names used by the application's data and happily reuse them from there. However, if you parse a lot of documents that contain globally unique IDs, they will uselessly add up in the dict and never be collected. And recent versions of libxml2 set a conservative limit on the size of the dict to prevent malicious input attacks. From what you describe (and from what I just tested on my side), it's likely that you are running into this limit. Surprisingly enough, I couldn't reproduce it when parsing a single large document. Only starting a couple of new documents made it failed for me. The upside is that you can disable the dict size limitation by configuring the parser with the option "huge_tree=True". A quick test suggests that this helps. I've started looking into ways to work around this behaviour in libxml2. The ID hash table (which maps IDs to nodes) is created on the fly using the document's dict, so always creating it ahead of time (even if it's not used) and giving it its own dict might work. Adds a bit to the document creation time, though, which can hurt in a couple of places... Definitely something that needs a bit of experimentation. Anyway, the "huge_tree" option should get you unstuck for now. Stefan

Using the option "huge_tree=True" on the parser works, but has performance issues that grow progressively worse. The default setting for parser seems to be about 1.2 million ID. Up to that point it processes a play with ~20,000 words and associated IDs at the rate of 1 a second. By the time the program has worked through its 9.7 million words, the checking of IDs takes 8 seconds per play: performance degrades by almost an order of magnitude. If progress were linear, the program should take about 500 seconds. In fact, it took 2400 seconds. I'm not sure how many folks out there are likely to encounter this problem. But it's a serious limit in lib-xml for the work I'm doing with a corpus of eventually 70,000 texts (~3 billion words) that are tokenized and have unique IDs so that you can rummage around in them and and figure out what is where. So it would be great if there were a workaround. A dumb workaround would consist in a script that just shuts off and restarts python periodically and writes out its finds to text files in an 'append' mode. That's what I've been doing manually, and it scales up to a point. With many thanks for your diagnosis of the problem. MM Martin Mueller Professor emeritus of English and Classics Northwestern University On 5/25/14, 7:35, "Stefan Behnel" <stefan_ml@behnel.de> wrote:

Martin Mueller, 25.05.2014 22:49:
Yes, I noticed that, too. I'm currently coding up a new parser option, working title "collect_ids=False", that would allow you to disable the ID hash table building. When I do that, performance jumps from ~52 seconds per million IDs to 2 seconds on my side for an extreme test case. However, it's not ready for a release yet. I get test failures in other areas with this change, so it needs a bit more work. I can publish a 3.4 alpha when it's ready, might take a couple of days, though. Stefan

Ivan Pozdeev, 26.05.2014 19:20:
No. The hash table is local to a parser run, so it gets discarded anyway. The problem is that the ID strings are interned.
And could you possibly name some relevant locations/symbols besides your findings to save others' time groveling through code?
Sure. Take a look at xmlAddID() in valid.c and its usages in SAX2.c in libxml2's sources. The idea is to set the XML_SKIP_IDS flag. Stefan
participants (3)
-
Ivan Pozdeev
-
Martin Mueller
-
Stefan Behnel