a question about odd behaviour of 'break' in an lxml loop with iterchildren and itersiblings
I am puzzled by the following behaviour of lxml. I want to group element children with the same tag. Here is a schematic representation of my text: <sp xml:id="sp-0645"> <speaker xml:id="spk-0645"/> <ab xml:id="ftln-0645" n="2.1.1"/> <ab xml:id="ftln-0646" n="2.1.2"/> <ab xml:id="ftln-0647" n="2.1.3"/> <ab xml:id="ftln-0648" n="2.1.4"/> <stage/> <ab xml:id="ftln-0703" n="2.1.59"/> <ab xml:id="ftln-0704" n="2.1.60"/> <stage/> <ab xml:id="ftln-0705" n="2.1.61"/> </sp> I want to end up with something like : <sp xml:id="sp-0645"> <speaker xml:id="spk-0645"/> <p> <lb xml:id="ftln-0645" n="2.1.1"/> <lb xml:id="ftln-0646" n="2.1.2"/> <lb xml:id="ftln-0647" n="2.1.3"/> <lb xml:id="ftln-0648" n="2.1.4"/> </p> <stage/> <p> <lb xml:id="ftln-0703" n="2.1.59"/> <lb xml:id="ftln-0704" n="2.1.60"/> </p> <stage/> <p> <lb xml:id="ftln-0705" n="2.1.61"/> </p> </sp> I try to do this through a combination of iterchildren() and itersiblings() and use the following script: speech = 'speech.xml' tree = etree.parse(speech) for sp in tree.iter('sp'): for child in sp.iterchildren(): if child.tag != 'ab': print('child', child.tag) for sibling in child.itersiblings(): if sibling.tag == 'ab': print('sibling'. sibling.tag) else: break The output of this script suggests that the two loops work properly: child speaker sibling ab sibling ab sibling ab sibling ab child stage sibling ab sibling ab child stage sibling ab Now I complicate things by creating a paragraph element with each child element. The printout shows that each of the paragraphs exists and that after the break statement the program returns to the top child loop: child speaker sibling ab sibling ab sibling ab sibling ab else <Element p at 0x1014cb548> child stage sibling ab sibling ab else <Element p at 0x1014cb508> child stage sibling ab elif <Element p at 0x1014cb608> But if I now add code to the script that fills the created paragraphs with ab elements, the program does this correctly for the first iteration but then it exits. Here is the code and the result: speech = 'speech.xml' tree = etree.parse(speech) for sp in tree.iter('sp'): for child in sp.iterchildren(): if child.tag != 'ab': paragraph = etree.Element('p') print('child', child.tag) for sibling in child.itersiblings(): if sibling.tag == 'ab' and sibling.getnext()is not None: print('sibling', sibling.tag) paragraph.append(sibling) elif sibling.tag == 'ab' and sibling.getnext() is None: print('sibling', sibling.tag) paragraph.append(sibling) print('elif', etree.tostring(paragraph, encoding='unicode', pretty_print=True)) else: print('else',etree.tostring(paragraph, encoding='unicode', pretty_print=True)) Break child speaker sibling ab sibling ab sibling ab sibling ab else <p><ab xml:id="ftln-0645" n="2.1.1"/> <ab xml:id="ftln-0646" n="2.1.2"/> <ab xml:id="ftln-0647" n="2.1.3"/> <ab xml:id="ftln-0648" n="2.1.4"/> </p> Why does the program execute only the first iteration of the loop and exit completely after the 'break', when it doesn't do that in the structurally identical but simpler versions?
Dear Martin, I think you are hitting one of the frequent lxml stumbling blocks: Changing the tree during iteration. Since you move child elements to the new paragraph, they break child iteration. One solution would be to break your code into a “collect phase” in which you identify the elements you want to move, and a “transform phase” in which you do the actual transformation. Here’s an example of that approach that seems to work for your case: for sp in tree.iter('sp'): # collect phase newstruct = [] for child in sp.iterchildren(): if child.tag != 'ab': newgroup = [] newstruct.append((child, newgroup)) for sibling in child.itersiblings(): if sibling.tag == 'ab': newgroup.append(sibling) else: break # transform phase for child, group in newstruct: paragraph = etree.Element('p') paragraph.extend(group) child.addnext(paragraph) print(etree.tostring(tree, encoding='unicode', pretty_print=True)) <sp xml:id="sp-0645"> <speaker xml:id="spk-0645"/> <p> <ab xml:id="ftln-0645" n="2.1.1"/> <ab xml:id="ftln-0646" n="2.1.2"/> <ab xml:id="ftln-0647" n="2.1.3"/> <ab xml:id="ftln-0648" n="2.1.4"/> </p> <stage/> <p> <ab xml:id="ftln-0703" n="2.1.59"/> <ab xml:id="ftln-0704" n="2.1.60"/> </p> <stage/> <p> <ab xml:id="ftln-0705" n="2.1.61"/> </p> </sp> Hope that helps. Frederik Am 15.01.2016 um 16:31 schrieb Martin Mueller:
I am puzzled by the following behaviour of lxml. I want to group element children with the same tag. Here is a schematic representation of my text:
<sp xml:id="sp-0645"> <speaker xml:id="spk-0645"/> <ab xml:id="ftln-0645" n="2.1.1"/> <ab xml:id="ftln-0646" n="2.1.2"/> <ab xml:id="ftln-0647" n="2.1.3"/> <ab xml:id="ftln-0648" n="2.1.4"/> <stage/> <ab xml:id="ftln-0703" n="2.1.59"/> <ab xml:id="ftln-0704" n="2.1.60"/> <stage/> <ab xml:id="ftln-0705" n="2.1.61"/> </sp>
I want to end up with something like :
<sp xml:id="sp-0645"> <speaker xml:id="spk-0645"/> <p> <lb xml:id="ftln-0645" n="2.1.1"/> <lb xml:id="ftln-0646" n="2.1.2"/> <lb xml:id="ftln-0647" n="2.1.3"/> <lb xml:id="ftln-0648" n="2.1.4"/> </p> <stage/> <p> <lb xml:id="ftln-0703" n="2.1.59"/> <lb xml:id="ftln-0704" n="2.1.60"/> </p> <stage/> <p> <lb xml:id="ftln-0705" n="2.1.61"/> </p> </sp>
I try to do this through a combination of iterchildren() and itersiblings() and use the following script:
speech = 'speech.xml' tree = etree.parse(speech)
for sp in tree.iter('sp'): for child in sp.iterchildren(): if child.tag != 'ab': print('child', child.tag)
for sibling in child.itersiblings(): if sibling.tag == 'ab': print('sibling'. sibling.tag) else: break
The output of this script suggests that the two loops work properly: child speaker sibling ab sibling ab sibling ab sibling ab child stage sibling ab sibling ab child stage sibling ab
Now I complicate things by creating a paragraph element with each child element. The printout shows that each of the paragraphs exists and that after the break statement the program returns to the top child loop:
child speaker sibling ab sibling ab sibling ab sibling ab else <Element p at 0x1014cb548> child stage sibling ab sibling ab else <Element p at 0x1014cb508> child stage sibling ab elif <Element p at 0x1014cb608>
But if I now add code to the script that fills the created paragraphs with ab elements, the program does this correctly for the first iteration but then it exits. Here is the code and the result:
speech = 'speech.xml' tree = etree.parse(speech)
for sp in tree.iter('sp'): for child in sp.iterchildren(): if child.tag != 'ab': paragraph = etree.Element('p') print('child', child.tag)
for sibling in child.itersiblings(): if sibling.tag == 'ab' and sibling.getnext()is not None:
print('sibling', sibling.tag) paragraph.append(sibling) elif sibling.tag == 'ab' and sibling.getnext() is None: print('sibling', sibling.tag) paragraph.append(sibling)
print('elif', etree.tostring(paragraph, encoding='unicode', pretty_print=True))
else:
print('else',etree.tostring(paragraph, encoding='unicode', pretty_print=True))
Break
child speaker sibling ab sibling ab sibling ab sibling ab else <p><ab xml:id="ftln-0645" n="2.1.1"/>
<ab xml:id="ftln-0646" n="2.1.2"/>
<ab xml:id="ftln-0647" n="2.1.3"/>
<ab xml:id="ftln-0648" n="2.1.4"/>
</p>
Why does the program execute only the first iteration of the loop and exit completely after the 'break', when it doesn't do that in the structurally identical but simpler versions?
_________________________________________________________________ Mailing list for the lxml Python XML toolkit - http://lxml.de/ lxml@lxml.de https://mailman-mail5.webfaction.com/listinfo/lxml
Thatnk you for this solution. Charlie Clark had also told me that this is what Michael Kay would call (in reference to namespaces) as the "elephant trap." hope I won't make this mistake again. MM
On Jan 15, 2016, at 10:58 AM, Frederik Elwert <frederik.elwert@web.de> wrote:
Dear Martin,
I think you are hitting one of the frequent lxml stumbling blocks: Changing the tree during iteration. Since you move child elements to the new paragraph, they break child iteration.
One solution would be to break your code into a “collect phase” in which you identify the elements you want to move, and a “transform phase” in which you do the actual transformation.
Here’s an example of that approach that seems to work for your case:
for sp in tree.iter('sp'): # collect phase newstruct = [] for child in sp.iterchildren(): if child.tag != 'ab': newgroup = [] newstruct.append((child, newgroup)) for sibling in child.itersiblings(): if sibling.tag == 'ab': newgroup.append(sibling) else: break # transform phase for child, group in newstruct: paragraph = etree.Element('p') paragraph.extend(group) child.addnext(paragraph)
print(etree.tostring(tree, encoding='unicode', pretty_print=True))
<sp xml:id="sp-0645"> <speaker xml:id="spk-0645"/> <p> <ab xml:id="ftln-0645" n="2.1.1"/> <ab xml:id="ftln-0646" n="2.1.2"/> <ab xml:id="ftln-0647" n="2.1.3"/> <ab xml:id="ftln-0648" n="2.1.4"/> </p> <stage/> <p> <ab xml:id="ftln-0703" n="2.1.59"/> <ab xml:id="ftln-0704" n="2.1.60"/> </p> <stage/> <p> <ab xml:id="ftln-0705" n="2.1.61"/> </p> </sp>
Hope that helps.
Frederik
Am 15.01.2016 um 16:31 schrieb Martin Mueller:
I am puzzled by the following behaviour of lxml. I want to group element children with the same tag. Here is a schematic representation of my text:
<sp xml:id="sp-0645"> <speaker xml:id="spk-0645"/> <ab xml:id="ftln-0645" n="2.1.1"/> <ab xml:id="ftln-0646" n="2.1.2"/> <ab xml:id="ftln-0647" n="2.1.3"/> <ab xml:id="ftln-0648" n="2.1.4"/> <stage/> <ab xml:id="ftln-0703" n="2.1.59"/> <ab xml:id="ftln-0704" n="2.1.60"/> <stage/> <ab xml:id="ftln-0705" n="2.1.61"/> </sp>
I want to end up with something like :
<sp xml:id="sp-0645"> <speaker xml:id="spk-0645"/> <p> <lb xml:id="ftln-0645" n="2.1.1"/> <lb xml:id="ftln-0646" n="2.1.2"/> <lb xml:id="ftln-0647" n="2.1.3"/> <lb xml:id="ftln-0648" n="2.1.4"/> </p> <stage/> <p> <lb xml:id="ftln-0703" n="2.1.59"/> <lb xml:id="ftln-0704" n="2.1.60"/> </p> <stage/> <p> <lb xml:id="ftln-0705" n="2.1.61"/> </p> </sp>
I try to do this through a combination of iterchildren() and itersiblings() and use the following script:
speech = 'speech.xml' tree = etree.parse(speech)
for sp in tree.iter('sp'): for child in sp.iterchildren(): if child.tag != 'ab': print('child', child.tag)
for sibling in child.itersiblings(): if sibling.tag == 'ab': print('sibling'. sibling.tag) else: break
The output of this script suggests that the two loops work properly: child speaker sibling ab sibling ab sibling ab sibling ab child stage sibling ab sibling ab child stage sibling ab
Now I complicate things by creating a paragraph element with each child element. The printout shows that each of the paragraphs exists and that after the break statement the program returns to the top child loop:
child speaker sibling ab sibling ab sibling ab sibling ab else <Element p at 0x1014cb548> child stage sibling ab sibling ab else <Element p at 0x1014cb508> child stage sibling ab elif <Element p at 0x1014cb608>
But if I now add code to the script that fills the created paragraphs with ab elements, the program does this correctly for the first iteration but then it exits. Here is the code and the result:
speech = 'speech.xml' tree = etree.parse(speech)
for sp in tree.iter('sp'): for child in sp.iterchildren(): if child.tag != 'ab': paragraph = etree.Element('p') print('child', child.tag)
for sibling in child.itersiblings(): if sibling.tag == 'ab' and sibling.getnext()is not None:
print('sibling', sibling.tag) paragraph.append(sibling) elif sibling.tag == 'ab' and sibling.getnext() is None: print('sibling', sibling.tag) paragraph.append(sibling)
print('elif', etree.tostring(paragraph, encoding='unicode', pretty_print=True))
else:
print('else',etree.tostring(paragraph, encoding='unicode', pretty_print=True))
Break
child speaker sibling ab sibling ab sibling ab sibling ab else <p><ab xml:id="ftln-0645" n="2.1.1"/>
<ab xml:id="ftln-0646" n="2.1.2"/>
<ab xml:id="ftln-0647" n="2.1.3"/>
<ab xml:id="ftln-0648" n="2.1.4"/>
</p>
Why does the program execute only the first iteration of the loop and exit completely after the 'break', when it doesn't do that in the structurally identical but simpler versions?
_________________________________________________________________ Mailing list for the lxml Python XML toolkit - http://lxml.de/ lxml@lxml.de https://mailman-mail5.webfaction.com/listinfo/lxml
participants (2)
-
Frederik Elwert
-
Martin Mueller