problem with going from linear to hierarchical structure

I have been struggling with what looks like a very simple problem but it is beyond my feeble powers. The longer XML fragment below is a very elaborate but flat representation of the first two lines from Shakespeare's Comedy of Errors: <sp><speaker>EGEON</speaker> <l> Proceed, Solinus, to procure my fall,</l> <l> And by the doom of death end woes and all</l> </sp> In this encoding the word, punctuation, and space tokens are not hierarchically gathered into <l> elements. Instead all such tokens are children of an <ab> element, and an empty milestone element marks the hierarchical ordering into chunks of verse or prose. Don't ask me why this somewhat counterintuitive encoding was chosen in the first place, but my goal is to "unflatten" this very flat structure, which would involve (not necessarily in that order) 1. creating an appropriate container element (<l> or <ab>) for all the tokens between one milestone and the next (or renaming the milestone element) 2 inserting all tokens between one milestone and the next into that element in their current document order 3. deleting the surrounding <ab> element The end product of these transformations would still keep the <w>, <pc>, and <c>tokens, but would have the hierarchical structure that I listed above. I see the general problem of "from linear to hierarchical," but I don't see any examples in the lxml documentation that allows me to get from here to there. It's probably a very simple thing that I should know about, but I don't, and I'll be grateful for any help. <sp xml:id="sp-0001" who="#Egeon_Err"> <speaker xml:id="spk-0001"> <w xml:id="w0000410">EGEON</w> </speaker> <ab xml:id="ab-0001"> <lb xml:id="lb-00009"/> <milestone unit="ftln" xml:id="ftln-0001" n="1.1.1" ana="#verse" corresp="#w0000420 #p0000430 #c0000440 #w0000450 #p0000460 #c0000470 #w0000480 #c0000490 #w0000500 #c0000510 #w0000520 #c0000530 #w0000540 #p0000550"/> <w xml:id="w0000420" n="1.1.1">Proceed</w> <pc xml:id="p0000430" n="1.1.1">,</pc> <c xml:id="c0000440" n="1.1.1"> </c> <w xml:id="w0000450" n="1.1.1">Solinus</w> <pc xml:id="p0000460" n="1.1.1">,</pc> <c xml:id="c0000470" n="1.1.1"> </c> <w xml:id="w0000480" n="1.1.1">to</w> <c xml:id="c0000490" n="1.1.1"> </c> <w xml:id="w0000500" n="1.1.1">procure</w> <c xml:id="c0000510" n="1.1.1"> </c> <w xml:id="w0000520" n="1.1.1">my</w> <c xml:id="c0000530" n="1.1.1"> </c> <w xml:id="w0000540" n="1.1.1">fall</w> <pc xml:id="p0000550" n="1.1.1">,</pc> <lb xml:id="lb-00010"/> <milestone unit="ftln" xml:id="ftln-0002" n="1.1.2" ana="#verse" corresp="#w0000560 #c0000570 #w0000580 #c0000590 #w0000600 #c0000610 #w0000620 #c0000630 #w0000640 #c0000650 #w0000660 #c0000670 #w0000680 #c0000690 #w0000700 #c0000710 #w0000720 #c0000730 #w0000740 #p0000750"/> <w xml:id="w0000560" n="1.1.2">And</w> <c xml:id="c0000570" n="1.1.2"> </c> <w xml:id="w0000580" n="1.1.2">by</w> <c xml:id="c0000590" n="1.1.2"> </c> <w xml:id="w0000600" n="1.1.2">the</w> <c xml:id="c0000610" n="1.1.2"> </c> <w xml:id="w0000620" n="1.1.2">doom</w> <c xml:id="c0000630" n="1.1.2"> </c> <w xml:id="w0000640" n="1.1.2">of</w> <c xml:id="c0000650" n="1.1.2"> </c> <w xml:id="w0000660" n="1.1.2">death</w> <c xml:id="c0000670" n="1.1.2"> </c> <w xml:id="w0000680" n="1.1.2">end</w> <c xml:id="c0000690" n="1.1.2"> </c> <w xml:id="w0000700" n="1.1.2">woes</w> <c xml:id="c0000710" n="1.1.2"> </c> <w xml:id="w0000720" n="1.1.2">and</w> <c xml:id="c0000730" n="1.1.2"> </c> <w xml:id="w0000740" n="1.1.2">all</w> <pc xml:id="p0000750" n="1.1.2">.</pc> </ab> </sp> Martin Mueller Professor emeritus of English and Classics Northwestern University

Am .01.2015, 18:14 Uhr, schrieb Martin Mueller <martinmueller@northwestern.edu>:
The end product of these transformations would still keep the <w>, <pc>, and <c>tokens, but would have the hierarchical structure that I listed above. I see the general problem of "from linear to hierarchical," but I don't see any examples in the lxml documentation that allows me to get from here to there. It's probably a very simple thing that I should know about, but I don't, and I'll be grateful for any help.
*I* think that flattening invariably involves a new tree. What you might do is create a Python class for the structure and have parsing and serialising code take of the different trees you want. You *might* be able to do things with the objectify module but it might be best to get something working first and then improve it. Charlie -- Charlie Clark Managing Director Clark Consulting & Research German Office Kronenstr. 27a Düsseldorf D- 40217 Tel: +49-211-600-3657 Mobile: +49-178-782-6226

Am 02.01.2015 um 18:14 schrieb Martin Mueller:
I have been struggling with what looks like a very simple problem but it is beyond my feeble powers. The longer XML fragment below is a very elaborate but flat representation of the first two lines from Shakespeare's Comedy of Errors:
<sp><speaker>EGEON</speaker> <l> Proceed, Solinus, to procure my fall,</l> <l> And by the doom of death end woes and all</l> </sp>
In this encoding the word, punctuation, and space tokens are not hierarchically gathered into <l> elements. Instead all such tokens are children of an <ab> element, and an empty milestone element marks the hierarchical ordering into chunks of verse or prose. Don't ask me why this somewhat counterintuitive encoding was chosen in the first place, but my goal is to "unflatten" this very flat structure, which would involve (not necessarily in that order)
1. creating an appropriate container element (<l> or <ab>) for all the tokens between one milestone and the next (or renaming the milestone element) 2 inserting all tokens between one milestone and the next into that element in their current document order 3. deleting the surrounding <ab> element
The end product of these transformations would still keep the <w>, <pc>, and <c>tokens, but would have the hierarchical structure that I listed above. I see the general problem of "from linear to hierarchical," but I don't see any examples in the lxml documentation that allows me to get from here to there. It's probably a very simple thing that I should know about, but I don't, and I'll be grateful for any help.
<sp xml:id="sp-0001" who="#Egeon_Err"> <speaker xml:id="spk-0001"> <w xml:id="w0000410">EGEON</w> </speaker> <ab xml:id="ab-0001"> <lb xml:id="lb-00009"/> <milestone unit="ftln" xml:id="ftln-0001" n="1.1.1" ana="#verse" corresp="#w0000420 #p0000430 #c0000440 #w0000450 #p0000460 #c0000470 #w0000480 #c0000490 #w0000500 #c0000510 #w0000520 #c0000530 #w0000540 #p0000550"/> <w xml:id="w0000420" n="1.1.1">Proceed</w> <pc xml:id="p0000430" n="1.1.1">,</pc> <c xml:id="c0000440" n="1.1.1"> </c> <w xml:id="w0000450" n="1.1.1">Solinus</w> <pc xml:id="p0000460" n="1.1.1">,</pc> <c xml:id="c0000470" n="1.1.1"> </c> <w xml:id="w0000480" n="1.1.1">to</w> <c xml:id="c0000490" n="1.1.1"> </c> <w xml:id="w0000500" n="1.1.1">procure</w> <c xml:id="c0000510" n="1.1.1"> </c> <w xml:id="w0000520" n="1.1.1">my</w> <c xml:id="c0000530" n="1.1.1"> </c> <w xml:id="w0000540" n="1.1.1">fall</w> <pc xml:id="p0000550" n="1.1.1">,</pc> <lb xml:id="lb-00010"/> <milestone unit="ftln" xml:id="ftln-0002" n="1.1.2" ana="#verse" corresp="#w0000560 #c0000570 #w0000580 #c0000590 #w0000600 #c0000610 #w0000620 #c0000630 #w0000640 #c0000650 #w0000660 #c0000670 #w0000680 #c0000690 #w0000700 #c0000710 #w0000720 #c0000730 #w0000740 #p0000750"/> <w xml:id="w0000560" n="1.1.2">And</w> <c xml:id="c0000570" n="1.1.2"> </c> <w xml:id="w0000580" n="1.1.2">by</w> <c xml:id="c0000590" n="1.1.2"> </c> <w xml:id="w0000600" n="1.1.2">the</w> <c xml:id="c0000610" n="1.1.2"> </c> <w xml:id="w0000620" n="1.1.2">doom</w> <c xml:id="c0000630" n="1.1.2"> </c> <w xml:id="w0000640" n="1.1.2">of</w> <c xml:id="c0000650" n="1.1.2"> </c> <w xml:id="w0000660" n="1.1.2">death</w> <c xml:id="c0000670" n="1.1.2"> </c> <w xml:id="w0000680" n="1.1.2">end</w> <c xml:id="c0000690" n="1.1.2"> </c> <w xml:id="w0000700" n="1.1.2">woes</w> <c xml:id="c0000710" n="1.1.2"> </c> <w xml:id="w0000720" n="1.1.2">and</w> <c xml:id="c0000730" n="1.1.2"> </c> <w xml:id="w0000740" n="1.1.2">all</w> <pc xml:id="p0000750" n="1.1.2">.</pc> </ab> </sp>
I can imagine two ways: Either make use of the corresp attribute on the milestones to collect the desired elements: for sp in tree.xpath('//sp'): ab = sp.xpath('ab')[0] # Assumes only one ab per sp. for milestone in ab.xpath('milestone'): line = etree.SubElement(sp, 'l') corr_ids = [id_.lstrip('#') for id_ in milestone.get('corresp').split()] for id_ in corr_ids: elem = sp.xpath('id($cid)', cid=id_)[0] line.append(elem) sp.remove(ab) Or iter the elements until the next milestone is reached: for sp in tree.xpath('//sp'): ab = sp.xpath('ab')[0] # Assumes only one ab per sp. for milestone in ab.xpath('milestone'): line = etree.SubElement(sp, 'l') elem = milestone.getnext() while True: if elem is None or elem.tag == 'milestone': break line.append(deepcopy(elem)) # Can't append directly, # would break getnext(). elem = elem.getnext() sp.remove(ab) There might be more elegant ways of achieving this, but it basically works. See <http://nbviewer.ipython.org/gist/frederik-elwert/fef31d94b3ef4589a983> for a working example with output. Best, Frederik -- Dr. Frederik Elwert Post-doctoral researcher Project manager SeNeReKo Centre for Religious Studies Ruhr-University Bochum Universitätsstr. 150 D-44780 Bochum Room FNO 01/180 Tel. +49-(0)234 - 32 24794

What a lovely New Year's Present! Many, many thanks. I had been wandering around in the vicinity of that solution but hadn't got close enough by myself. Martin Mueller Professor emeritus of English and Classics Northwestern University On 1/2/15, 14:03, "Frederik Elwert" <frederik.elwert@web.de> wrote:
Am 02.01.2015 um 18:14 schrieb Martin Mueller:
I have been struggling with what looks like a very simple problem but it is beyond my feeble powers. The longer XML fragment below is a very elaborate but flat representation of the first two lines from Shakespeare's Comedy of Errors:
<sp><speaker>EGEON</speaker> <l> Proceed, Solinus, to procure my fall,</l> <l> And by the doom of death end woes and all</l> </sp>
In this encoding the word, punctuation, and space tokens are not hierarchically gathered into <l> elements. Instead all such tokens are children of an <ab> element, and an empty milestone element marks the hierarchical ordering into chunks of verse or prose. Don't ask me why this somewhat counterintuitive encoding was chosen in the first place, but my goal is to "unflatten" this very flat structure, which would involve (not necessarily in that order)
1. creating an appropriate container element (<l> or <ab>) for all the tokens between one milestone and the next (or renaming the milestone element) 2 inserting all tokens between one milestone and the next into that element in their current document order 3. deleting the surrounding <ab> element
The end product of these transformations would still keep the <w>, <pc>, and <c>tokens, but would have the hierarchical structure that I listed above. I see the general problem of "from linear to hierarchical," but I don't see any examples in the lxml documentation that allows me to get from here to there. It's probably a very simple thing that I should know about, but I don't, and I'll be grateful for any help.
<sp xml:id="sp-0001" who="#Egeon_Err"> <speaker xml:id="spk-0001"> <w xml:id="w0000410">EGEON</w> </speaker> <ab xml:id="ab-0001"> <lb xml:id="lb-00009"/> <milestone unit="ftln" xml:id="ftln-0001" n="1.1.1" ana="#verse" corresp="#w0000420 #p0000430 #c0000440 #w0000450 #p0000460 #c0000470 #w0000480 #c0000490 #w0000500 #c0000510 #w0000520 #c0000530 #w0000540 #p0000550"/> <w xml:id="w0000420" n="1.1.1">Proceed</w> <pc xml:id="p0000430" n="1.1.1">,</pc> <c xml:id="c0000440" n="1.1.1"> </c> <w xml:id="w0000450" n="1.1.1">Solinus</w> <pc xml:id="p0000460" n="1.1.1">,</pc> <c xml:id="c0000470" n="1.1.1"> </c> <w xml:id="w0000480" n="1.1.1">to</w> <c xml:id="c0000490" n="1.1.1"> </c> <w xml:id="w0000500" n="1.1.1">procure</w> <c xml:id="c0000510" n="1.1.1"> </c> <w xml:id="w0000520" n="1.1.1">my</w> <c xml:id="c0000530" n="1.1.1"> </c> <w xml:id="w0000540" n="1.1.1">fall</w> <pc xml:id="p0000550" n="1.1.1">,</pc> <lb xml:id="lb-00010"/> <milestone unit="ftln" xml:id="ftln-0002" n="1.1.2" ana="#verse" corresp="#w0000560 #c0000570 #w0000580 #c0000590 #w0000600 #c0000610 #w0000620 #c0000630 #w0000640 #c0000650 #w0000660 #c0000670 #w0000680 #c0000690 #w0000700 #c0000710 #w0000720 #c0000730 #w0000740 #p0000750"/> <w xml:id="w0000560" n="1.1.2">And</w> <c xml:id="c0000570" n="1.1.2"> </c> <w xml:id="w0000580" n="1.1.2">by</w> <c xml:id="c0000590" n="1.1.2"> </c> <w xml:id="w0000600" n="1.1.2">the</w> <c xml:id="c0000610" n="1.1.2"> </c> <w xml:id="w0000620" n="1.1.2">doom</w> <c xml:id="c0000630" n="1.1.2"> </c> <w xml:id="w0000640" n="1.1.2">of</w> <c xml:id="c0000650" n="1.1.2"> </c> <w xml:id="w0000660" n="1.1.2">death</w> <c xml:id="c0000670" n="1.1.2"> </c> <w xml:id="w0000680" n="1.1.2">end</w> <c xml:id="c0000690" n="1.1.2"> </c> <w xml:id="w0000700" n="1.1.2">woes</w> <c xml:id="c0000710" n="1.1.2"> </c> <w xml:id="w0000720" n="1.1.2">and</w> <c xml:id="c0000730" n="1.1.2"> </c> <w xml:id="w0000740" n="1.1.2">all</w> <pc xml:id="p0000750" n="1.1.2">.</pc> </ab> </sp>
I can imagine two ways: Either make use of the corresp attribute on the milestones to collect the desired elements:
for sp in tree.xpath('//sp'): ab = sp.xpath('ab')[0] # Assumes only one ab per sp. for milestone in ab.xpath('milestone'): line = etree.SubElement(sp, 'l') corr_ids = [id_.lstrip('#') for id_ in milestone.get('corresp').split()] for id_ in corr_ids: elem = sp.xpath('id($cid)', cid=id_)[0] line.append(elem) sp.remove(ab)
Or iter the elements until the next milestone is reached:
for sp in tree.xpath('//sp'): ab = sp.xpath('ab')[0] # Assumes only one ab per sp. for milestone in ab.xpath('milestone'): line = etree.SubElement(sp, 'l') elem = milestone.getnext() while True: if elem is None or elem.tag == 'milestone': break line.append(deepcopy(elem)) # Can't append directly, # would break getnext(). elem = elem.getnext() sp.remove(ab)
There might be more elegant ways of achieving this, but it basically works. See <http://nbviewer.ipython.org/gist/frederik-elwert/fef31d94b3ef4589a983> for a working example with output.
Best, Frederik
-- Dr. Frederik Elwert
Post-doctoral researcher Project manager SeNeReKo Centre for Religious Studies Ruhr-University Bochum
Universitätsstr. 150 D-44780 Bochum
Room FNO 01/180 Tel. +49-(0)234 - 32 24794

Frederik Elwert schrieb am 02.01.2015 um 22:03:
I can imagine two ways: Either make use of the corresp attribute on the milestones to collect the desired elements:
for sp in tree.xpath('//sp'): ab = sp.xpath('ab')[0] # Assumes only one ab per sp. for milestone in ab.xpath('milestone'): line = etree.SubElement(sp, 'l') corr_ids = [id_.lstrip('#') for id_ in milestone.get('corresp').split()] for id_ in corr_ids: elem = sp.xpath('id($cid)', cid=id_)[0] line.append(elem) sp.remove(ab)
Or iter the elements until the next milestone is reached:
for sp in tree.xpath('//sp'): ab = sp.xpath('ab')[0] # Assumes only one ab per sp. for milestone in ab.xpath('milestone'): line = etree.SubElement(sp, 'l') elem = milestone.getnext() while True: if elem is None or elem.tag == 'milestone': break line.append(deepcopy(elem)) # Can't append directly, # would break getnext(). elem = elem.getnext() sp.remove(ab)
One comment: .iter() and .iterfind() are generally more efficient than .xpath() when looking for tag names, in terms of both memory and speed, so I would use tree.iter('sp') # same as tree.iterfind('.//sp') sp.find('ab') # returns first match ab.iterfind('milestone') Oh, and there is an .itersiblings() method which can be used in the second version. It even accepts a specific tag name (or list of tags) to iterate on, so you could say for elem in milestone.itersiblings('milestone'): or, alternatively, reuse the list of 'milestone' elements that you are already iterating over. Stefan
participants (4)
-
Charlie Clark
-
Frederik Elwert
-
Martin Mueller
-
Stefan Behnel