
I have run into an odd problem with the current version of lxml running on Python3.4 on a six-year old Mac Pro laptop with 8GB of memory. I want to loop through ~500 TEI encoded plays, where each word token has an xml:id, like this: <w lemma="act" n="1-b-0140" ana="#vvn" reg="acted" xml:id="A07064-000200">acted</w> where the ID is composed from a text id (A07064) and a wordcounter The basic program goes like plays = os.walk(sourcePlayDirectory) for directory in plays: for item in directory[2]: filename = directory[0] + '/' + item tree = etree.parse(filename, parser) for element in tree.iter(tei + 'w'): {some conditions} This works, but after not quite a minute--about 50 plays and probably about looking at a million <w> elements, the program stops and produces an error message like this: tree = etree.parse(filename, parser) File "lxml.etree.pyx", line 3239, in lxml.etree.parse (src/lxml/lxml.etree.c:69970) File "parser.pxi", line 1749, in lxml.etree._parseDocument (src/lxml/lxml.etree.c:102081) File "parser.pxi", line 1775, in lxml.etree._parseDocumentFromURL (src/lxml/lxml.etree.c:102345) File "parser.pxi", line 1679, in lxml.etree._parseDocFromFile (src/lxml/lxml.etree.c:101380) File "parser.pxi", line 1110, in lxml.etree._BaseParser._parseDocFromFile (src/lxml/lxml.etree.c:96832) File "parser.pxi", line 582, in lxml.etree._ParserContext._handleParseResultDoc (src/lxml/lxml.etree.c:91290) File "parser.pxi", line 683, in lxml.etree._handleParseResult (src/lxml/lxml.etree.c:92476) File "parser.pxi", line 622, in lxml.etree._raiseParseError (src/lxml/lxml.etree.c:91772) lxml.etree.XMLSyntaxError: ID A07897-308750 already defined, line 59347, column 40 Now there is nothing wrong with ID A07897-308750. It hasn't been used previously, and if you start the program again with, say, the play previous to the one that raised the exception, it will sail right through text A07897 and continue for not quite a minute but produce the same error message with a different ID. So it's not the ID. I don't know what is happening, but I suspect that something inside lxml or python hits a limit, and when that limit is hit, you get the syntax error message with the ID at the point at which the program "had enough" whatever it had enough of.. The Activity monitor on my Mac shows nothing extraordinary: Python memory use at the point of failure is about 150MB, but I have run similar programs on an older Mac with an earlier version of lxml without any trouble. Martin Mueller Professor emeritus of English and Classics Northwestern University