a simple question about a tricky problem

I use lxml to work with a large collection of TEI-encoded texts(66,000) that are linguistically annotated. Each token is wrapped in a <w> or <pc> element with a unique ID and various attributes. I can march through the texts at the lowest level of <w> and <pc> elements without paying any attention to the discursive structure of higher elements. I just do for w in tree.iter(tei + ‘w’, tei + ‘pc’: if x: do this if y: do that But now I want to create a concordance in which tokens meeting some condition are pulled out and surrounded with seven words on either side. I do this with itersiblings(), but that is a tricky operation. The next <w> token may not be a sibling but a child of a higher level sibling. Remembering that “elements are lists” you have patterns like [a, b, c, [d, e, f] g, h, i, [k, l, m, n] Getting from ‘c’ to ‘d’ is one thing, getting from ‘f’ to ‘g’ is another. In a large archive of sometimes quite weird encodings, the details become very hairy very fast. Is there are some “Gordian knot” solution, or does one just figure out this obstacle race one detail at a time? There are “soft” tags that do not break the continuity of a sentence (hi), hard tags that mark an end beyond which you don’t want to go anyhow (p), and “jump tags” (note) where your “next sibling” is the first <w> after the <note> element, which may be quite long. I am old enough to have grown up with Winnie the Poh and feel like “Bear of Very Little Brain” when confronted with these problems. I’ll be grateful for any advice, including a confirmation that it’s the just way it is. Martin Mueller Professor of English and Classics emeritus

Martin Mueller wrote at 2023-6-8 04:02 +0000:
Apparently, the sequence of `w` and `pc` elements (in document order) is essential. You already have a solution to determine this sequence. If you have any element, you can determine its `parent` and therefore (recursively) the path to the element. If you have elements `e1` and `e2`, you can then determine the deepest common ancestor. Maybe, that helps you to solve your problem.

On Thu, 2023-06-08 at 04:02 +0000, Martin Mueller wrote:
I would approach this by first transforming each document into a simpler structure, using XSLT. If you do not care about anything other than tei:p, tei:w, and tei:sc elements, and for all of the latter two to be children of the former, then your transform can go find all tei:p (and any other containing elements you might have) and output them, and then all descendant tei:w and tei:sc, as children. Something like: <xsl:template match="/"> <doc> <xsl:apply-templates select="//tei:p"/> </doc> </xsl:template> <xsl:template match="tei:p"> <p> <xsl:apply-templates select=".//tei:w | .//tei:sc"/> </p> </xsl:template> <xsl:template match="tei:sc | tei:w"> <xsl:copy> <!-- Whatever handling of attributes and children and content you want. --> </xsl:copy> </xsl:template> Following that, you can find the preceding and following siblings that don't cross boundaries very easily. Jamie

On 8 Jun 2023, at 9:10, Jamie Norrish wrote:
lxml will also simply let you pass a list of tags into iterparse so you can do this directly while iterating. See https://lxml.de/parsing.html#iterparse-and-iterwalk Charlie -- Charlie Clark Managing Director Clark Consulting & Research German Office Sengelsweg 34 Düsseldorf D- 40489 Tel: +49-203-3925-0390 Mobile: +49-178-782-6226

Martin Mueller wrote at 2023-6-8 04:02 +0000:
Apparently, the sequence of `w` and `pc` elements (in document order) is essential. You already have a solution to determine this sequence. If you have any element, you can determine its `parent` and therefore (recursively) the path to the element. If you have elements `e1` and `e2`, you can then determine the deepest common ancestor. Maybe, that helps you to solve your problem.

On Thu, 2023-06-08 at 04:02 +0000, Martin Mueller wrote:
I would approach this by first transforming each document into a simpler structure, using XSLT. If you do not care about anything other than tei:p, tei:w, and tei:sc elements, and for all of the latter two to be children of the former, then your transform can go find all tei:p (and any other containing elements you might have) and output them, and then all descendant tei:w and tei:sc, as children. Something like: <xsl:template match="/"> <doc> <xsl:apply-templates select="//tei:p"/> </doc> </xsl:template> <xsl:template match="tei:p"> <p> <xsl:apply-templates select=".//tei:w | .//tei:sc"/> </p> </xsl:template> <xsl:template match="tei:sc | tei:w"> <xsl:copy> <!-- Whatever handling of attributes and children and content you want. --> </xsl:copy> </xsl:template> Following that, you can find the preceding and following siblings that don't cross boundaries very easily. Jamie

On 8 Jun 2023, at 9:10, Jamie Norrish wrote:
lxml will also simply let you pass a list of tags into iterparse so you can do this directly while iterating. See https://lxml.de/parsing.html#iterparse-and-iterwalk Charlie -- Charlie Clark Managing Director Clark Consulting & Research German Office Sengelsweg 34 Düsseldorf D- 40489 Tel: +49-203-3925-0390 Mobile: +49-178-782-6226
participants (4)
-
Charlie Clark
-
Dieter Maurer
-
Jamie Norrish
-
Martin Mueller