Re: [lxml] Lxml aborts with an odd error message

Ps to my earlier email. Is there a way in which I can make lxml "forget" everything about a file as soon as it is done with it. I have a dim sense that iterparse has something to do with that, but I don't know how to tell the program, " when you're done with file A, wipe out everything in memory before moving on to file B." Some command of that kind should do the trick, because the program has no trouble processing individually any of the files it complains about when they are encountered in the aggregate. Martin Mueller Professor emeritus of English and Classics Northwestern University Yes, I am sure that the IDs are unique across the corpus. I don't think that the same file is encountered twice. If that were the case, I would expect the error to occur at the first ID of the file that is encountered twice. But it happens somewhere in the middle, so we would posit that the program happily encounters some number of IDs twice, but suddenly complains. On 5/24/14, 19:59, "Ivan Pozdeev" <vano@mail.mipt.ru> wrote:

I'm not a programmer, and "Create another parser object" is too terse for me to understand. I removed "parser" from the command so that the parsing line now reads tree = etree.parse(filename) But it made no difference: the program aborted at exactly the same point. Then I started the program again, but this time started with the second file. The program aborted in the middle of one file later. In layman's terms, I'm using some outer loop (os.walk) to loop through a set of files and bring up each file for separate processing. The separate processing happens when the next file is opened by with open(filename) as fh: and the processing begins with tree = etree.parse(filename) If I just look at the code it suggests that tree = etree.prase(filename) is a fresh start and has nothing to do with what happened to the previous file or will happen to the next file. But this is clearly not the case: something accumulates inside the program, and whenever there is enough of that (whatever it is) , the program grinds to a halt. I could use a workaround and process each file separately--pretty tedious for 500 files. I could also break the 500 files into ten batches of fifty files, which would be safe, and is tolerable. But there ought to be a way of telling the program: when you're done with File A, clear out everything, and start afresh with File B. How to I say this in Python? Martin Mueller Professor emeritus of English and Classics Northwestern University On 5/24/14, 21:57, "Ivan Pozdeev" <vano@mail.mipt.ru> wrote:

Ivan Pozdeev, 25.05.2014 05:57:
That shouldn't change anything. A parser doesn't keep state across runs. It's mainly just a wrapper around a specific configuration and a lock that prevents concurrent usage. Parser state is freshly created on each run. Stefan

I'm not a programmer, and "Create another parser object" is too terse for me to understand. I removed "parser" from the command so that the parsing line now reads tree = etree.parse(filename) But it made no difference: the program aborted at exactly the same point. Then I started the program again, but this time started with the second file. The program aborted in the middle of one file later. In layman's terms, I'm using some outer loop (os.walk) to loop through a set of files and bring up each file for separate processing. The separate processing happens when the next file is opened by with open(filename) as fh: and the processing begins with tree = etree.parse(filename) If I just look at the code it suggests that tree = etree.prase(filename) is a fresh start and has nothing to do with what happened to the previous file or will happen to the next file. But this is clearly not the case: something accumulates inside the program, and whenever there is enough of that (whatever it is) , the program grinds to a halt. I could use a workaround and process each file separately--pretty tedious for 500 files. I could also break the 500 files into ten batches of fifty files, which would be safe, and is tolerable. But there ought to be a way of telling the program: when you're done with File A, clear out everything, and start afresh with File B. How to I say this in Python? Martin Mueller Professor emeritus of English and Classics Northwestern University On 5/24/14, 21:57, "Ivan Pozdeev" <vano@mail.mipt.ru> wrote:

Ivan Pozdeev, 25.05.2014 05:57:
That shouldn't change anything. A parser doesn't keep state across runs. It's mainly just a wrapper around a specific configuration and a lock that prevents concurrent usage. Parser state is freshly created on each run. Stefan
participants (3)
-
Ivan Pozdeev
-
Martin Mueller
-
Stefan Behnel