New subject: Lxml aborts with an odd error message

May 24, 2014

      Ps to my earlier email. Is there a way in which I can make lxml "forget"
everything about a file as soon as it is done with it. I have a dim sense
that iterparse has something to do with that, but I don't know  how to
tell the program, " when you're done with file A, wipe out everything in
memory before moving on to file B."  Some command of that kind should do
the trick, because the program has no trouble processing individually any
of the files it complains about when they are encountered in the
aggregate. 

Martin Mueller
Professor emeritus of English and Classics
Northwestern University

Yes, I am sure that the IDs are unique across the corpus.  I don't think
that the same file is encountered twice. If that were the case, I would
expect the  error to occur at the first ID of the file that is encountered
twice. But it happens somewhere in the middle, so we would posit that the
program happily encounters some number of IDs twice, but suddenly
complains. 

On 5/24/14, 19:59, "Ivan Pozdeev" <vano@mail.mipt.ru> wrote:
...
1) You run everything through one parser, so it appears to accumulate IDs
it has seen so far.
Are you sure your IDs are unique across all files?
2) Are there any chances it encounters the same file twice
(hardlinks/symlinks/restarts from a dirty state)?
...
I have run into an odd problem with the current version of lxml running
on
Python3.4 on a six-year old Mac Pro laptop with 8GB of memory.
...
I want to loop through ~500 TEI encoded plays, where each word token has
an xml:id, like this:
...
<w lemma="act" n="1-b-0140" ana="#vvn" reg="acted"
xml:id="A07064-000200">acted</w>
...
where the ID is composed from a text id (A07064) and a wordcounter
...
The basic program goes like
...
plays = os.walk(sourcePlayDirectory)
...
for directory in plays:
        for item in directory[2]:
                filename = directory[0] + '/' + item
...
tree = etree.parse(filename, parser)
                for element in tree.iter(tei + 'w'):
                        {some conditions}
...
This works, but after not quite a minute--about 50 plays and probably
about looking at a million <w> elements, the program stops and produces
an
error message like this:
...
tree = etree.parse(filename, parser)
  File "lxml.etree.pyx", line 3239, in lxml.etree.parse
(src/lxml/lxml.etree.c:69970)
  File "parser.pxi", line 1749, in lxml.etree._parseDocument
(src/lxml/lxml.etree.c:102081)
  File "parser.pxi", line 1775, in lxml.etree._parseDocumentFromURL
(src/lxml/lxml.etree.c:102345)
  File "parser.pxi", line 1679, in lxml.etree._parseDocFromFile
(src/lxml/lxml.etree.c:101380)
  File "parser.pxi", line 1110, in
lxml.etree._BaseParser._parseDocFromFile (src/lxml/lxml.etree.c:96832)
  File "parser.pxi", line 582, in
lxml.etree._ParserContext._handleParseResultDoc
(src/lxml/lxml.etree.c:91290)
  File "parser.pxi", line 683, in lxml.etree._handleParseResult
(src/lxml/lxml.etree.c:92476)
  File "parser.pxi", line 622, in lxml.etree._raiseParseError
(src/lxml/lxml.etree.c:91772)
lxml.etree.XMLSyntaxError: ID A07897-308750 already defined, line 59347,
column 40
...
Now there is nothing wrong with ID A07897-308750. It hasn't been used
previously, and if you start the program again with, say, the play
previous to the one that raised the exception, it will sail right
through
text A07897 and continue for not quite  a minute but produce the same
error message with a different ID.
...
So it's not the ID. I don't know what is happening, but I suspect that
something inside lxml or python hits a limit, and when that limit is
hit,
you get the syntax error message with the ID at the point at which the
program "had enough" whatever it had enough of..
...
The Activity monitor on my Mac shows nothing extraordinary: Python
memory
use at the point of failure is about 150MB, but I have run similar
programs on an older Mac with an earlier version of lxml without any
trouble.
...
Martin Mueller
Professor emeritus of English and Classics
Northwestern University
...
_________________________________________________________________
Mailing list for the lxml Python XML toolkit - http://lxml.de/
lxml@lxml.de
https://mailman-mail5.webfaction.com/listinfo/lxml
-- 
Best regards,
Ivan                            mailto:vano@mail.mipt.ru
_________________________________________________________________
Mailing list for the lxml Python XML toolkit - http://lxml.de/
lxml@lxml.de
https://mailman-mail5.webfaction.com/listinfo/lxml

Re: [lxml] Lxml aborts with an odd error message

Martin Mueller

Ivan Pozdeev

Martin Mueller

Stefan Behnel

tags

participants (3)