I have a "tail" problem with the following XML fragment <lg> <l>WHo doth desire the trump of fame, to sound vnto the Skies,</l> <l><pb/>Or els who seekes the holy place, where mighty Ioue he lies,</l> <l>He must not by deceitfull mind, nor yet by puissant strength,</l> </lg> I want to turn the <pb> tag from child to previous sibling of <l> and use this code for element in tree.iter(): if element.tag == 'pb': parent = element.getparent() grandparent = parent.getparent() position = grandparent.index(parent) parent.text = element.tail grandparent.insert(position , (element)) Some of it works but it fails to get rid of the pb tail and produces this output: <lg> <l>WHo doth desire the trump of same, to sound vnto the skies,</l> <pb/>Or els who seekes the holy place, where mighty Ioue he lies, <l>Or els who seekes the holy place, where mighty Ioue he lies,</l> <l>He must not by deceitfull mind, nor yet by puissant strength,</l> </lg> If there is a with_tail=False solution, I don't know where to stick the command. Martin Mueller Professor emeritus of English and Classics Northwestern University
On Tue, Feb 24, 2015 at 06:25:35PM +0000, Martin Mueller wrote:
I have a "tail" problem with the following XML fragment
<lg> <l>WHo doth desire the trump of fame, to sound vnto the Skies,</l> <l><pb/>Or els who seekes the holy place, where mighty Ioue he lies,</l> <l>He must not by deceitfull mind, nor yet by puissant strength,</l>
</lg>
I want to turn the <pb> tag from child to previous sibling of <l>
Do I understand it correctly that you want to convert the above into <lg> <l>WHo doth desire the trump of fame, to sound vnto the Skies,</l> <pb/> <l>Or els who seekes the holy place, where mighty Ioue he lies,</l> <l>He must not by deceitfull mind, nor yet by puissant strength,</l> </lg> ?
and use this code
for element in tree.iter(): if element.tag == 'pb': parent = element.getparent() grandparent = parent.getparent() position = grandparent.index(parent) parent.text = element.tail grandparent.insert(position , (element))
Modifying the data structure while you're iterating over it is often dangerous. I'm not 100% sure it is so in this case, but I wouldn't do it anyway, just in case -- so I'd collect the iteration results into a list before I'd attempt to modify the tree. Also, tree.iter() can do the tag-name filtering for you: for element in list(tree.iter('pb')): parent = element.getparent() # <l> grandparent = parent.getparent() # <lg> position = grandparent.index(parent) parent.text = element.tail element.tail = '\n' # <-- the thing you're missing grandparent.insert(position, element)
Some of it works but it fails to get rid of the pb tail and produces this output:
<lg> <l>WHo doth desire the trump of same, to sound vnto the skies,</l> <pb/>Or els who seekes the holy place, where mighty Ioue he lies, <l>Or els who seekes the holy place, where mighty Ioue he lies,</l> <l>He must not by deceitfull mind, nor yet by puissant strength,</l>
</lg>
If there is a with_tail=False solution, I don't know where to stick the command.
You can clear the tail by hand by doing element.tail = '\n' # or '', or None HTH, Marius Gedminas -- If you are smart enough to know that you're not smart enough to be an Engineer, then you're in Business.
Thank you for your advice, which was very helpful. I should have thought of it myself, but I'm still trying to wrap my mind around the slightly weird ways of 'tail' in python xml processing. I was a little alarmed about your warnings about the dangers of "modifying the data structure while iterating over it."
Modifying the data structure while you're iterating over it is often dangerous. I'm not 100% sure it is so in this case, but I wouldn't do it anyway, just in case -- so I'd collect the iteration results into a list before I'd attempt to modify the tree.
But in my case I don't see how one should "collect the iteration results into a list." Given many cases like <lg> <l>WHo doth desire the trump of fame, to sound vnto the Skies,</l> <l><pb/>Or els who seekes the holy place, where mighty Ioue he lies,</l> <l>He must not by deceitfull mind, nor yet by puissant strength,</l> </lg> it is the case that whenever <pb/> is a first child of <l> I want to turn it into its previous sibling. I could flag the pb's with a temporary attribute and go over them in a second pass, which I've often done. But what does that buy in this case or why is it "dangerous" to make the change one by one as you go along? From a perspective of well-formedness or schema validation in TEI documents, "<l/><pb/><l/>" will always be as valid as <l/><l><pb/></l>, though having both patterns in the same collection of documents is a nuisance.
Am .02.2015, 05:35 Uhr, schrieb Martin Mueller <martinmueller@northwestern.edu>:
I was a little alarmed about your warnings about the dangers of "modifying the data structure while iterating over it."
It's a typical gotcha in Python. Consider the following example: l = list(range(10)) for i in l: print(l.pop()) 9 8 7 6 5 This applies to any mutable sequences: do not manipulate the *sequence* while iterating over it. Charlie -- Charlie Clark Managing Director Clark Consulting & Research German Office Kronenstr. 27a Düsseldorf D- 40217 Tel: +49-211-600-3657 Mobile: +49-178-782-6226
On Fri, Feb 27, 2015 at 04:35:06AM +0000, Martin Mueller wrote:
Thank you for your advice, which was very helpful. I should have thought of it myself, but I'm still trying to wrap my mind around the slightly weird ways of 'tail' in python xml processing.
I was a little alarmed about your warnings about the dangers of "modifying the data structure while iterating over it."
Are you familiar with the problem? Here's an example, using Python lists: a_list = [1, 2, 3, 4, 5] for item in a_list: a_list.remove(item) print a_list Try running it to see what happens. You'd expect a_list to be empty at the end of the loop, right? The risk of modifying a data structure while you're iterating over it is that you may skip elements, or process some elements more than once, or get an exception.
Modifying the data structure while you're iterating over it is often dangerous. I'm not 100% sure it is so in this case, but I wouldn't do it anyway, just in case -- so I'd collect the iteration results into a list before I'd attempt to modify the tree.
But in my case I don't see how one should "collect the iteration results into a list."
I did that in my example. Compare for node in tree.iter('tagname'): # <-- iterating over the tree directly # modifying tree might be dangerous versus for node in list(tree.iter('tagname')): # <-- iterating over a list # modifying tree is fine here which is just a shorter spelling for interesting_nodes = list(tree.iter('tagname')) for node in interesting_nodes: # modifying tree is fine here which is also a shorted spelling for interesting_nodes = [] for node in tree.iter('tagname'): # modifying tree might be dangerous, but we're not doing that interesting_nodes.append(node) # now we're done iterating over the tree and it's safe to modify it for node in interesting_nodes: # modifying tree is fine here
Given many cases like
<lg> <l>WHo doth desire the trump of fame, to sound vnto the Skies,</l> <l><pb/>Or els who seekes the holy place, where mighty Ioue he lies,</l> <l>He must not by deceitfull mind, nor yet by puissant strength,</l>
</lg>
it is the case that whenever <pb/> is a first child of <l> I want to turn it into its previous sibling. I could flag the pb's with a temporary attribute and go over them in a second pass, which I've often done.
There are simpler ways, like the one I outlined above.
But what does that buy in this case or why is it "dangerous" to make the change one by one as you go along?
I gave an example of the possible danger at the beginning of this email. Now I don't know if this danger applies to lxml. I tried to look it up in the documentation and failed to find anything relevant. I then tried a small experiment and couldn't get my code to misbehave, so perhaps lxml's iteration can safely cope with modifications. (Or perhaps my code example was just too simple to trigger a possible error condition, I don't know!)
From a perspective of well-formedness or schema validation in TEI documents, "<l/><pb/><l/>" will always be as valid as <l/><l><pb/></l>, though having both patterns in the same collection of documents is a nuisance.
Regards, Marius Gedminas -- The usual "drop and roll" advice for those who have lit themselves on fire, is contraindicated if you're standing fifteen feet up a ladder with a nice bed of wood shavings at the base. -- John Schilling shares his experience
Marius Gedminas schrieb am 27.02.2015 um 09:34:
The risk of modifying a data structure while you're iterating over it is that you may skip elements, or process some elements more than once, or get an exception. [...] Now I don't know if this danger applies to lxml. I tried to look it up in the documentation and failed to find anything relevant. I then tried a small experiment and couldn't get my code to misbehave, so perhaps lxml's iteration can safely cope with modifications. (Or perhaps my code example was just too simple to trigger a possible error condition,
Most likely so. lxml currently looks one match ahead. This has the advantage that tree modifications during iteration work in many cases. It has the disadvantage that in a large document where an element only appears once, the whole document is searched despite already having found the only match. Given that this is very fast in lxml, it usually doesn't matter that much, but it's certainly visible in some extreme cases. Don't rely on this, though. It might change at some point to, say, only look one element ahead, instead of one element that actually matches the current search. That would reduce the search overhead in the "one element only" case. Generally speaking, it's safe to modify parts of the tree that no longer need to be touched by the traversal (such as siblings that were already traversed or attributes of the current element), but the behaviour when modifying tree content that lies ahead or above (ancestors) is undefined. If unsure, follow one of the examples that you (Marius) gave in your email. Structural tree modifications are best done outside of the iteration loop. Stefan
participants (4)
-
Charlie Clark
-
Marius Gedminas
-
Martin Mueller
-
Stefan Behnel