Mailman 3 a simple question about a tricky problem - lxml - The Python XML Toolkit

7 Jun 2023

      I use lxml to work with a large collection of TEI-encoded texts(66,000) that are linguistically annotated.  Each token is wrapped in a <w> or <pc> element with a unique ID and various attributes. I can march through the texts at the lowest level of <w> and <pc> elements without paying any attention to the discursive structure of higher elements. I just do

            for  w in tree.iter(tei + ‘w’, tei + ‘pc’:
             if x:
                do this
            if y:
                do that

But now I want to create a concordance in which tokens meeting some condition are pulled out and surrounded with seven words on either side.  I do this with itersiblings(), but that is a tricky operation. The next <w> token may not be a sibling but a child of a higher level sibling.  Remembering that “elements are lists” you have patterns like

            [a, b, c, [d, e, f] g, h, i, [k, l, m, n]

Getting from ‘c’ to ‘d’ is one thing, getting from ‘f’ to ‘g’ is another. In a large archive of sometimes quite weird encodings, the details become very hairy very fast. Is there are some “Gordian knot” solution, or does one just figure out this obstacle race one detail at a time? There are “soft” tags that do not break the continuity of a sentence (hi), hard tags that mark an end beyond which you don’t want to go anyhow (p), and “jump tags” (note) where your “next sibling” is the first <w> after the <note> element, which may be quite long.

I am old enough to have grown up with Winnie the Poh and feel like “Bear of Very Little Brain” when confronted with these problems. I’ll be grateful for any advice, including a confirmation that it’s the just way it is.

Martin Mueller
Professor of English and Classics emeritus

a simple question about a tricky problem

Martin Mueller

Dieter Maurer

Jamie Norrish

Charlie Clark

tags

participants (4)