Re: [lxml] Lxml aborts with an odd error message

Ps to my earlier email. Is there a way in which I can make lxml "forget" everything about a file as soon as it is done with it. I have a dim sense that iterparse has something to do with that, but I don't know how to tell the program, " when you're done with file A, wipe out everything in memory before moving on to file B." Some command of that kind should do the trick, because the program has no trouble processing individually any of the files it complains about when they are encountered in the aggregate. Martin Mueller Professor emeritus of English and Classics Northwestern University Yes, I am sure that the IDs are unique across the corpus. I don't think that the same file is encountered twice. If that were the case, I would expect the error to occur at the first ID of the file that is encountered twice. But it happens somewhere in the middle, so we would posit that the program happily encounters some number of IDs twice, but suddenly complains. On 5/24/14, 19:59, "Ivan Pozdeev" <vano@mail.mipt.ru> wrote:
1) You run everything through one parser, so it appears to accumulate IDs it has seen so far. Are you sure your IDs are unique across all files?
2) Are there any chances it encounters the same file twice (hardlinks/symlinks/restarts from a dirty state)?
I have run into an odd problem with the current version of lxml running on Python3.4 on a six-year old Mac Pro laptop with 8GB of memory.
I want to loop through ~500 TEI encoded plays, where each word token has an xml:id, like this:
<w lemma="act" n="1-b-0140" ana="#vvn" reg="acted" xml:id="A07064-000200">acted</w>
where the ID is composed from a text id (A07064) and a wordcounter
The basic program goes like
plays = os.walk(sourcePlayDirectory)
for directory in plays: for item in directory[2]: filename = directory[0] + '/' + item
tree = etree.parse(filename, parser) for element in tree.iter(tei + 'w'): {some conditions}
This works, but after not quite a minute--about 50 plays and probably about looking at a million <w> elements, the program stops and produces an error message like this:
tree = etree.parse(filename, parser) File "lxml.etree.pyx", line 3239, in lxml.etree.parse (src/lxml/lxml.etree.c:69970) File "parser.pxi", line 1749, in lxml.etree._parseDocument (src/lxml/lxml.etree.c:102081) File "parser.pxi", line 1775, in lxml.etree._parseDocumentFromURL (src/lxml/lxml.etree.c:102345) File "parser.pxi", line 1679, in lxml.etree._parseDocFromFile (src/lxml/lxml.etree.c:101380) File "parser.pxi", line 1110, in lxml.etree._BaseParser._parseDocFromFile (src/lxml/lxml.etree.c:96832) File "parser.pxi", line 582, in lxml.etree._ParserContext._handleParseResultDoc (src/lxml/lxml.etree.c:91290) File "parser.pxi", line 683, in lxml.etree._handleParseResult (src/lxml/lxml.etree.c:92476) File "parser.pxi", line 622, in lxml.etree._raiseParseError (src/lxml/lxml.etree.c:91772) lxml.etree.XMLSyntaxError: ID A07897-308750 already defined, line 59347, column 40
Now there is nothing wrong with ID A07897-308750. It hasn't been used previously, and if you start the program again with, say, the play previous to the one that raised the exception, it will sail right through text A07897 and continue for not quite a minute but produce the same error message with a different ID.
So it's not the ID. I don't know what is happening, but I suspect that something inside lxml or python hits a limit, and when that limit is hit, you get the syntax error message with the ID at the point at which the program "had enough" whatever it had enough of..
The Activity monitor on my Mac shows nothing extraordinary: Python memory use at the point of failure is about 150MB, but I have run similar programs on an older Mac with an earlier version of lxml without any trouble.
Martin Mueller Professor emeritus of English and Classics Northwestern University
_________________________________________________________________ Mailing list for the lxml Python XML toolkit - http://lxml.de/ lxml@lxml.de https://mailman-mail5.webfaction.com/listinfo/lxml
-- Best regards, Ivan mailto:vano@mail.mipt.ru
_________________________________________________________________ Mailing list for the lxml Python XML toolkit - http://lxml.de/ lxml@lxml.de https://mailman-mail5.webfaction.com/listinfo/lxml

Ps to my earlier email. Is there a way in which I can make lxml "forget" everything about a file as soon as it is done with it. I have a dim sense that iterparse has something to do with that, but I don't know how to tell the program, " when you're done with file A, wipe out everything in memory before moving on to file B." Some command of that kind should do the trick, because the program has no trouble processing individually any of the files it complains about when they are encountered in the aggregate.
Create another parser object?
Martin Mueller Professor emeritus of English and Classics Northwestern University
Yes, I am sure that the IDs are unique across the corpus. I don't think that the same file is encountered twice. If that were the case, I would expect the error to occur at the first ID of the file that is encountered twice. But it happens somewhere in the middle, so we would posit that the program happily encounters some number of IDs twice, but suddenly complains.
On 5/24/14, 19:59, "Ivan Pozdeev" <vano@mail.mipt.ru> wrote:
1) You run everything through one parser, so it appears to accumulate IDs it has seen so far. Are you sure your IDs are unique across all files?
2) Are there any chances it encounters the same file twice (hardlinks/symlinks/restarts from a dirty state)?
I have run into an odd problem with the current version of lxml running on Python3.4 on a six-year old Mac Pro laptop with 8GB of memory.
I want to loop through ~500 TEI encoded plays, where each word token has an xml:id, like this:
<w lemma="act" n="1-b-0140" ana="#vvn" reg="acted" xml:id="A07064-000200">acted</w>
where the ID is composed from a text id (A07064) and a wordcounter
The basic program goes like
plays = os.walk(sourcePlayDirectory)
for directory in plays: for item in directory[2]: filename = directory[0] + '/' + item
tree = etree.parse(filename, parser) for element in tree.iter(tei + 'w'): {some conditions}
This works, but after not quite a minute--about 50 plays and probably about looking at a million <w> elements, the program stops and produces an error message like this:
tree = etree.parse(filename, parser) File "lxml.etree.pyx", line 3239, in lxml.etree.parse (src/lxml/lxml.etree.c:69970) File "parser.pxi", line 1749, in lxml.etree._parseDocument (src/lxml/lxml.etree.c:102081) File "parser.pxi", line 1775, in lxml.etree._parseDocumentFromURL (src/lxml/lxml.etree.c:102345) File "parser.pxi", line 1679, in lxml.etree._parseDocFromFile (src/lxml/lxml.etree.c:101380) File "parser.pxi", line 1110, in lxml.etree._BaseParser._parseDocFromFile (src/lxml/lxml.etree.c:96832) File "parser.pxi", line 582, in lxml.etree._ParserContext._handleParseResultDoc (src/lxml/lxml.etree.c:91290) File "parser.pxi", line 683, in lxml.etree._handleParseResult (src/lxml/lxml.etree.c:92476) File "parser.pxi", line 622, in lxml.etree._raiseParseError (src/lxml/lxml.etree.c:91772) lxml.etree.XMLSyntaxError: ID A07897-308750 already defined, line 59347, column 40
Now there is nothing wrong with ID A07897-308750. It hasn't been used previously, and if you start the program again with, say, the play previous to the one that raised the exception, it will sail right through text A07897 and continue for not quite a minute but produce the same error message with a different ID.
So it's not the ID. I don't know what is happening, but I suspect that something inside lxml or python hits a limit, and when that limit is hit, you get the syntax error message with the ID at the point at which the program "had enough" whatever it had enough of..
The Activity monitor on my Mac shows nothing extraordinary: Python memory use at the point of failure is about 150MB, but I have run similar programs on an older Mac with an earlier version of lxml without any trouble.
Martin Mueller Professor emeritus of English and Classics Northwestern University
_________________________________________________________________ Mailing list for the lxml Python XML toolkit - http://lxml.de/ lxml@lxml.de https://mailman-mail5.webfaction.com/listinfo/lxml
-- Best regards, Ivan mailto:vano@mail.mipt.ru
_________________________________________________________________ Mailing list for the lxml Python XML toolkit - http://lxml.de/ lxml@lxml.de https://mailman-mail5.webfaction.com/listinfo/lxml
_________________________________________________________________ Mailing list for the lxml Python XML toolkit - http://lxml.de/ lxml@lxml.de https://mailman-mail5.webfaction.com/listinfo/lxml
-- Best regards, Ivan mailto:vano@mail.mipt.ru

I'm not a programmer, and "Create another parser object" is too terse for me to understand. I removed "parser" from the command so that the parsing line now reads tree = etree.parse(filename) But it made no difference: the program aborted at exactly the same point. Then I started the program again, but this time started with the second file. The program aborted in the middle of one file later. In layman's terms, I'm using some outer loop (os.walk) to loop through a set of files and bring up each file for separate processing. The separate processing happens when the next file is opened by with open(filename) as fh: and the processing begins with tree = etree.parse(filename) If I just look at the code it suggests that tree = etree.prase(filename) is a fresh start and has nothing to do with what happened to the previous file or will happen to the next file. But this is clearly not the case: something accumulates inside the program, and whenever there is enough of that (whatever it is) , the program grinds to a halt. I could use a workaround and process each file separately--pretty tedious for 500 files. I could also break the 500 files into ten batches of fifty files, which would be safe, and is tolerable. But there ought to be a way of telling the program: when you're done with File A, clear out everything, and start afresh with File B. How to I say this in Python? Martin Mueller Professor emeritus of English and Classics Northwestern University On 5/24/14, 21:57, "Ivan Pozdeev" <vano@mail.mipt.ru> wrote:
Ps to my earlier email. Is there a way in which I can make lxml "forget" everything about a file as soon as it is done with it. I have a dim sense that iterparse has something to do with that, but I don't know how to tell the program, " when you're done with file A, wipe out everything in memory before moving on to file B." Some command of that kind should do the trick, because the program has no trouble processing individually any of the files it complains about when they are encountered in the aggregate.
Create another parser object?
Martin Mueller Professor emeritus of English and Classics Northwestern University
Yes, I am sure that the IDs are unique across the corpus. I don't think that the same file is encountered twice. If that were the case, I would expect the error to occur at the first ID of the file that is encountered twice. But it happens somewhere in the middle, so we would posit that the program happily encounters some number of IDs twice, but suddenly complains.
On 5/24/14, 19:59, "Ivan Pozdeev" <vano@mail.mipt.ru> wrote:
1) You run everything through one parser, so it appears to accumulate IDs it has seen so far. Are you sure your IDs are unique across all files?
2) Are there any chances it encounters the same file twice (hardlinks/symlinks/restarts from a dirty state)?
I have run into an odd problem with the current version of lxml running on Python3.4 on a six-year old Mac Pro laptop with 8GB of memory.
I want to loop through ~500 TEI encoded plays, where each word token has an xml:id, like this:
<w lemma="act" n="1-b-0140" ana="#vvn" reg="acted" xml:id="A07064-000200">acted</w>
where the ID is composed from a text id (A07064) and a wordcounter
The basic program goes like
plays = os.walk(sourcePlayDirectory)
for directory in plays: for item in directory[2]: filename = directory[0] + '/' + item
tree = etree.parse(filename, parser) for element in tree.iter(tei + 'w'): {some conditions}
This works, but after not quite a minute--about 50 plays and probably about looking at a million <w> elements, the program stops and produces an error message like this:
tree = etree.parse(filename, parser) File "lxml.etree.pyx", line 3239, in lxml.etree.parse (src/lxml/lxml.etree.c:69970) File "parser.pxi", line 1749, in lxml.etree._parseDocument (src/lxml/lxml.etree.c:102081) File "parser.pxi", line 1775, in lxml.etree._parseDocumentFromURL (src/lxml/lxml.etree.c:102345) File "parser.pxi", line 1679, in lxml.etree._parseDocFromFile (src/lxml/lxml.etree.c:101380) File "parser.pxi", line 1110, in lxml.etree._BaseParser._parseDocFromFile (src/lxml/lxml.etree.c:96832) File "parser.pxi", line 582, in lxml.etree._ParserContext._handleParseResultDoc (src/lxml/lxml.etree.c:91290) File "parser.pxi", line 683, in lxml.etree._handleParseResult (src/lxml/lxml.etree.c:92476) File "parser.pxi", line 622, in lxml.etree._raiseParseError (src/lxml/lxml.etree.c:91772) lxml.etree.XMLSyntaxError: ID A07897-308750 already defined, line 59347, column 40
Now there is nothing wrong with ID A07897-308750. It hasn't been used previously, and if you start the program again with, say, the play previous to the one that raised the exception, it will sail right through text A07897 and continue for not quite a minute but produce the same error message with a different ID.
So it's not the ID. I don't know what is happening, but I suspect that something inside lxml or python hits a limit, and when that limit is hit, you get the syntax error message with the ID at the point at which the program "had enough" whatever it had enough of..
The Activity monitor on my Mac shows nothing extraordinary: Python memory use at the point of failure is about 150MB, but I have run similar programs on an older Mac with an earlier version of lxml without any trouble.
Martin Mueller Professor emeritus of English and Classics Northwestern University
_________________________________________________________________ Mailing list for the lxml Python XML toolkit - http://lxml.de/ lxml@lxml.de https://mailman-mail5.webfaction.com/listinfo/lxml
-- Best regards, Ivan mailto:vano@mail.mipt.ru
_________________________________________________________________ Mailing list for the lxml Python XML toolkit - http://lxml.de/ lxml@lxml.de https://mailman-mail5.webfaction.com/listinfo/lxml
_________________________________________________________________ Mailing list for the lxml Python XML toolkit - http://lxml.de/ lxml@lxml.de https://mailman-mail5.webfaction.com/listinfo/lxml
-- Best regards, Ivan mailto:vano@mail.mipt.ru
_________________________________________________________________ Mailing list for the lxml Python XML toolkit - http://lxml.de/ lxml@lxml.de https://mailman-mail5.webfaction.com/listinfo/lxml

Ivan Pozdeev, 25.05.2014 05:57:
Ps to my earlier email. Is there a way in which I can make lxml "forget" everything about a file as soon as it is done with it. I have a dim sense that iterparse has something to do with that, but I don't know how to tell the program, " when you're done with file A, wipe out everything in memory before moving on to file B." Some command of that kind should do the trick, because the program has no trouble processing individually any of the files it complains about when they are encountered in the aggregate.
Create another parser object?
That shouldn't change anything. A parser doesn't keep state across runs. It's mainly just a wrapper around a specific configuration and a lock that prevents concurrent usage. Parser state is freshly created on each run. Stefan
participants (3)
-
Ivan Pozdeev
-
Martin Mueller
-
Stefan Behnel