Mailman 3 problem with removing elements: bug or feature - lxml - The Python XML Toolkit

newer
How to iterate through nodes of...

problem with removing elements: bug or feature

Martin Mueller

July 22, 2014

4:34 p.m.

I have run into the following problem with removing an element, and I can't figure out whether it's a bug or whether I'm missing something. I want to make a change to an element, depending on the next element, and delete that next element. The change and deletion occur in a function. The main code program like this: if element.getnext() is not None \ and element.getnext().get('rend') == 'superscript': element = processSuperscripts(element, tree, changelog) The function first spells out the change and then orders the deletion; parent = element.getnext().getparent() parent.remove(element.getnext()) This code works the first time, but it doesn't work on successive occasions. I have used a work around where I add a flag to the next element and delete it in a separate loop. Thus my function ends element.getnext().set('n', 'DELETE') and the main program adds the loop: for element in tree.iter(tei + 'w'): if element.get('n') == 'DELETE': parent = element.getparent() parent.remove(element) This works: all instances of the first element are properly changed and all instances of the second element are properly deleted. Is there a way of making the deletion work inside the function without a second "clean-up" loop? Martin Mueller Professor emeritus of English and Classics Northwestern University

Show replies by date

Stefan Behnel

July 2014

7:16 p.m.

Martin Mueller, 22.07.2014 18:34:

...

I have run into the following problem with removing an element, and I can't figure out whether it's a bug or whether I'm missing something.

I want to make a change to an element, depending on the next element, and delete that next element. The change and deletion occur in a function. The main code program like this:

if element.getnext() is not None \ and element.getnext().get('rend') == 'superscript': element = processSuperscripts(element, tree, changelog)

The function first spells out the change and then orders the deletion;

parent = element.getnext().getparent() parent.remove(element.getnext())

This code works the first time, but it doesn't work on successive occasions.

Is that done in a tree iteration loop? The behaviour when replacing or removing the next elements in a tree during iteration is undefined, as the implementation may already have visited it, so the next iteration may start from the removed element (and terminate there) rather than continuing with the rest of the tree structure.

...

I have used a work around where I add a flag to the next element and delete it in a separate loop. Thus my function ends

element.getnext().set('n', 'DELETE')

and the main program adds the loop:

for element in tree.iter(tei + 'w'): if element.get('n') == 'DELETE': parent = element.getparent() parent.remove(element)

This works: all instances of the first element are properly changed and all instances of the second element are properly deleted.

I'd rather collect all elements that you want to delete in a list and then delete them one after the other after finding all of them. That's substantially more efficient than repeating the entire tree iteration. Note that your code above doesn't even fix the problem, as it still modifies the current node during iteration.

...

Is there a way of making the deletion work inside the function without a second "clean-up" loop?

You could also remember only the (single) next element to delete, and then remove it in the next loop iteration, i.e. to_delete = None for el in tree.iter(tei+'w'): if to_delete is not None: remove_from_parent(to_delete) to_delete = None if should_delete(el): to_delete = el if to_delete is not None: remove_from_parent(to_delete) to_delete = None That way, you only keep a reference to one obsolete element at any time, not to all of them. And the element that you remove always safely lives in the part of the document that you already passed over. Stefan

Martin Mueller

5:01 a.m.

Thank you for this very full reply, which gives me a lot to chew on, and I'm not sure I understand all of it. I have an immediate follow-up question. Yes, the code happens as part of a tree iteration loop. In the context of my current project, which has some 500 texts, the second loop may be inefficient, but the additional time cost is trivial. You seem to question whether it really works. The first iteration does something for the current element, and depending on context marks a previous or following element for deletion, marking it with a DELETE value for an n attribute. The second iteration stops at every element where n="delete" and removes it. Leaving aside the inefficiency of this routine, is it wrong or unstable in some sense? What does it mean to "collect" the elements to be deleted in a "list"? Is that list some appendix to the document or a separate document created on the fly. I don't understand this conceptually. If it's a separate list to which the elements are moved, they are in fact removed from the tree, and deleting the item s in the list doesn't really matter. But if the list is a list of elements that still remain in the tree, if you use the list to find you'd still have to cycle through the tree. If your answer to my first question is "this is a dumb and inefficient way of doing it, but it works" I don't really want to impose on your time to learn how to do it more elegantly, unless you feel like teaching me. MM Martin Mueller Professor emeritus of English and Classics Northwestern University On 7/27/14, 13:16, "Stefan Behnel" <stefan_ml@behnel.de> wrote:

...

Martin Mueller, 22.07.2014 18:34:

...
I have run into the following problem with removing an element, and I can't figure out whether it's a bug or whether I'm missing something.

I want to make a change to an element, depending on the next element, and delete that next element. The change and deletion occur in a function. The main code program like this:

if element.getnext() is not None \ and element.getnext().get('rend') == 'superscript': element = processSuperscripts(element, tree, changelog)

The function first spells out the change and then orders the deletion;

parent = element.getnext().getparent() parent.remove(element.getnext())

This code works the first time, but it doesn't work on successive occasions.

Is that done in a tree iteration loop? The behaviour when replacing or removing the next elements in a tree during iteration is undefined, as the implementation may already have visited it, so the next iteration may start from the removed element (and terminate there) rather than continuing with the rest of the tree structure.

...
I have used a work around where I add a flag to the next element and delete it in a separate loop. Thus my function ends

element.getnext().set('n', 'DELETE')

and the main program adds the loop:

for element in tree.iter(tei + 'w'): if element.get('n') == 'DELETE': parent = element.getparent() parent.remove(element)

This works: all instances of the first element are properly changed and all instances of the second element are properly deleted.

I'd rather collect all elements that you want to delete in a list and then delete them one after the other after finding all of them. That's substantially more efficient than repeating the entire tree iteration.

Note that your code above doesn't even fix the problem, as it still modifies the current node during iteration.

...
Is there a way of making the deletion work inside the function without a second "clean-up" loop?

You could also remember only the (single) next element to delete, and then remove it in the next loop iteration, i.e.

to_delete = None for el in tree.iter(tei+'w'): if to_delete is not None: remove_from_parent(to_delete) to_delete = None if should_delete(el): to_delete = el

if to_delete is not None: remove_from_parent(to_delete) to_delete = None

That way, you only keep a reference to one obsolete element at any time, not to all of them. And the element that you remove always safely lives in the part of the document that you already passed over.

Stefan

_________________________________________________________________ Mailing list for the lxml Python XML toolkit - http://lxml.de/ lxml@lxml.de https://mailman-mail5.webfaction.com/listinfo/lxml

Stefan Behnel

6:16 a.m.

Martin Mueller schrieb am 31.07.2014 um 07:01:

...

On 7/27/14, 13:16, "Stefan Behnel" wrote:

...
Martin Mueller, 22.07.2014 18:34:

...
I have run into the following problem with removing an element, and I can't figure out whether it's a bug or whether I'm missing something.

I want to make a change to an element, depending on the next element, and delete that next element. The change and deletion occur in a function. The main code program like this:

if element.getnext() is not None \ and element.getnext().get('rend') == 'superscript': element = processSuperscripts(element, tree, changelog)

The function first spells out the change and then orders the deletion;

parent = element.getnext().getparent() parent.remove(element.getnext())

This code works the first time, but it doesn't work on successive occasions.

Is that done in a tree iteration loop? The behaviour when replacing or removing the next elements in a tree during iteration is undefined, as the implementation may already have visited it, so the next iteration may start from the removed element (and terminate there) rather than continuing with the rest of the tree structure.

...
I have used a work around where I add a flag to the next element and delete it in a separate loop. Thus my function ends

element.getnext().set('n', 'DELETE')

and the main program adds the loop:

for element in tree.iter(tei + 'w'): if element.get('n') == 'DELETE': parent = element.getparent() parent.remove(element)

This works: all instances of the first element are properly changed and all instances of the second element are properly deleted.

I'd rather collect all elements that you want to delete in a list and then delete them one after the other after finding all of them. That's substantially more efficient than repeating the entire tree iteration.

Note that your code above doesn't even fix the problem, as it still modifies the current node during iteration.

...
Is there a way of making the deletion work inside the function without a second "clean-up" loop?

You could also remember only the (single) next element to delete, and then remove it in the next loop iteration, i.e.

to_delete = None for el in tree.iter(tei+'w'): if to_delete is not None: remove_from_parent(to_delete) to_delete = None if should_delete(el): to_delete = el

if to_delete is not None: remove_from_parent(to_delete) to_delete = None

That way, you only keep a reference to one obsolete element at any time, not to all of them. And the element that you remove always safely lives in the part of the document that you already passed over.

Thank you for this very full reply, which gives me a lot to chew on, and I'm not sure I understand all of it.

I have an immediate follow-up question. Yes, the code happens as part of a tree iteration loop. In the context of my current project, which has some 500 texts, the second loop may be inefficient, but the additional time cost is trivial. You seem to question whether it really works. The first iteration does something for the current element, and depending on context marks a previous or following element for deletion, marking it with a DELETE value for an n attribute. The second iteration stops at every element where n="delete" and removes it.

Leaving aside the inefficiency of this routine, is it wrong or unstable in some sense?

It's a bit more inefficient than necessary and risks not working on elements that happen to have an n="DELETE" attribute set already on input (however unlikely that is). The main problem, however, is that moving around or deleting the current or next element during tree iteration is *undefined behaviour*. It may work, but it may also stop the iteration or make it continue at the place where the element was moved to. Or it may try to eat your cat. Undefined. The reason is that lxml needs to hold on to either the current or the next position in the tree when it returns the next iteration item and gives control back to your code. If your code then happens to change or remove the element at that position, lxml will loose track of that real position and continue at some other place when you ask the iteration to continue. BTW, the meaning of "undefined" in this case is "I do not want to give any guarantees about the behaviour in order to avoid restrictions on future changes", as well as "It's actually hard to always stick to exactly the same guarantees in all cases, so I simply won't guarantee them". Python does that, too. Adding or removing entries from a dict while you're iterating over it leads to undefined behaviour, as it depends on the order of the hash keys as well as the size of the dict what the next iteration item is, and both can change when adding/removing keys.

...

What does it mean to "collect" the elements to be deleted in a "list"?

Guess I wasn't clear enough here. Let's formalise it in code. :) to_delete = [] for el in tree.iter(tei+'w'): if should_delete(el): to_delete.append(el) for el in to_delete: remove_from_parent(el)

...

Is that list some appendix to the document or a separate document created on the fly. I don't understand this conceptually. If it's a separate list to which the elements are moved, they are in fact removed from the tree, and deleting the item s in the list doesn't really matter. But if the list is a list of elements that still remain in the tree, if you use the list to find you'd still have to cycle through the tree.

No, because each element knows its position in the tree and its parent. So you can simply say def remove_from_parent(el): parent = el.getparent() parent.remove(el) without any tree iteration. (Note that iteration is the problem here, not general traversal through direct next/parent/children references.)

...

If your answer to my first question is "this is a dumb and inefficient way of doing it, but it works" I don't really want to impose on your time to learn how to do it more elegantly, unless you feel like teaching me.

I actually see one of the purposes of this mailing list in coming up with best practices and communicating them. If you call that teaching, that's fine with me, but even for me it's a mixture of teaching and learning. Stefan

Holger Joukl

7 a.m.

...

I actually see one of the purposes of this mailing list in coming up with best practices and communicating them. If you call that teaching, that's fine with me, but even for me it's a mixture of teaching and learning.

Stefan

That's the spirit! :) Great lib, great maintenance from Stefan, really helpful (and interesting, and fun) mailing list. lxml.objectify has been at the core of our product for years now and I couldn't be happier with it. Holger Landesbank Baden-Wuerttemberg Anstalt des oeffentlichen Rechts Hauptsitze: Stuttgart, Karlsruhe, Mannheim, Mainz HRA 12704 Amtsgericht Stuttgart

3859

Age (days ago)

3868

Last active (days ago)

List overview

Download

4 comments

3 participants

participants (3)

Holger Joukl
Martin Mueller
Stefan Behnel

problem with removing elements: bug or feature

Martin Mueller

Stefan Behnel

Martin Mueller

Stefan Behnel

Holger Joukl

tags

participants (3)