I have been struggling with what looks like a very simple problem but it
is beyond my feeble powers. The longer XML fragment below is a very
elaborate but flat representation of the first two lines from
Shakespeare's Comedy of Errors:
<sp><speaker>EGEON</speaker>
<l> Proceed, Solinus, to procure my fall,</l>
<l> And by the doom of death end woes and all</l>
</sp>
In this encoding the word, punctuation, and space tokens are not
hierarchically gathered into <l> elements. Instead all such tokens are
children of an <ab> element, and an empty milestone element marks the
hierarchical ordering into chunks of verse or prose. Don't ask me why this
somewhat counterintuitive encoding was chosen in the first place, but my
goal is to "unflatten" this very flat structure, which would involve (not
necessarily in that order)
1. creating an appropriate container element (<l> or <ab>) for all the
tokens between one milestone and the next (or renaming the milestone
element)
2 inserting all tokens between one milestone and the next into that
element in their current document order
3. deleting the surrounding <ab> element
The end product of these transformations would still keep the <w>, <pc>,
and <c>tokens, but would have the hierarchical structure that I listed
above.
I see the general problem of "from linear to hierarchical," but I don't
see any examples in the lxml documentation that allows me to get from here
to there. It's probably a very simple thing that I should know about, but
I don't, and I'll be grateful for any help.
<sp xml:id="sp-0001" who="#Egeon_Err">
<speaker xml:id="spk-0001">
<w xml:id="w0000410">EGEON</w>
</speaker>
<ab xml:id="ab-0001">
<lb xml:id="lb-00009"/>
<milestone unit="ftln" xml:id="ftln-0001" n="1.1.1" ana="#verse"
corresp="#w0000420 #p0000430 #c0000440 #w0000450 #p0000460 #c0000470
#w0000480 #c0000490 #w0000500 #c0000510 #w0000520 #c0000530 #w0000540
#p0000550"/>
<w xml:id="w0000420" n="1.1.1">Proceed</w>
<pc xml:id="p0000430" n="1.1.1">,</pc>
<c xml:id="c0000440" n="1.1.1"> </c>
<w xml:id="w0000450" n="1.1.1">Solinus</w>
<pc xml:id="p0000460" n="1.1.1">,</pc>
<c xml:id="c0000470" n="1.1.1"> </c>
<w xml:id="w0000480" n="1.1.1">to</w>
<c xml:id="c0000490" n="1.1.1"> </c>
<w xml:id="w0000500" n="1.1.1">procure</w>
<c xml:id="c0000510" n="1.1.1"> </c>
<w xml:id="w0000520" n="1.1.1">my</w>
<c xml:id="c0000530" n="1.1.1"> </c>
<w xml:id="w0000540" n="1.1.1">fall</w>
<pc xml:id="p0000550" n="1.1.1">,</pc>
<lb xml:id="lb-00010"/>
<milestone unit="ftln" xml:id="ftln-0002" n="1.1.2" ana="#verse"
corresp="#w0000560 #c0000570 #w0000580 #c0000590 #w0000600 #c0000610
#w0000620 #c0000630 #w0000640 #c0000650 #w0000660 #c0000670 #w0000680
#c0000690 #w0000700 #c0000710 #w0000720 #c0000730 #w0000740 #p0000750"/>
<w xml:id="w0000560" n="1.1.2">And</w>
<c xml:id="c0000570" n="1.1.2"> </c>
<w xml:id="w0000580" n="1.1.2">by</w>
<c xml:id="c0000590" n="1.1.2"> </c>
<w xml:id="w0000600" n="1.1.2">the</w>
<c xml:id="c0000610" n="1.1.2"> </c>
<w xml:id="w0000620" n="1.1.2">doom</w>
<c xml:id="c0000630" n="1.1.2"> </c>
<w xml:id="w0000640" n="1.1.2">of</w>
<c xml:id="c0000650" n="1.1.2"> </c>
<w xml:id="w0000660" n="1.1.2">death</w>
<c xml:id="c0000670" n="1.1.2"> </c>
<w xml:id="w0000680" n="1.1.2">end</w>
<c xml:id="c0000690" n="1.1.2"> </c>
<w xml:id="w0000700" n="1.1.2">woes</w>
<c xml:id="c0000710" n="1.1.2"> </c>
<w xml:id="w0000720" n="1.1.2">and</w>
<c xml:id="c0000730" n="1.1.2"> </c>
<w xml:id="w0000740" n="1.1.2">all</w>
<pc xml:id="p0000750" n="1.1.2">.</pc>
</ab>
</sp>
Martin Mueller
Professor emeritus of English and Classics
Northwestern University