[lxml] Re: a simple question about a tricky problem

8 Jun 2023

      Martin Mueller wrote at 2023-6-8 04:02 +0000:
...
I use lxml to work with a large collection of TEI-encoded texts(66,000) that are linguistically annotated.  Each token is wrapped in a <w> or <pc> element with a unique ID and various attributes. I can march through the texts at the lowest level of <w> and <pc> elements without paying any attention to the discursive structure of higher elements. I just do
for  w in tree.iter(tei + �w�, tei + �pc�:
            if x:
               do this
           if y:
               do that
But now I want to create a concordance in which tokens meeting some condition are pulled out and surrounded with seven words on either side.  I do this with itersiblings(), but that is a tricky operation. The next <w> token may not be a sibling but a child of a higher level sibling.  Remembering that �elements are lists� you have patterns like
[a, b, c, [d, e, f] g, h, i, [k, l, m, n]
Apparently, the sequence of `w` and `pc` elements (in document order)
is essential. You already have a solution to determine this sequence.

If you have any element, you can determine its `parent`
and therefore (recursively) the path to the element.
If you have elements `e1` and `e2`, you can then determine
the deepest common ancestor. Maybe, that helps you to solve your problem.

[lxml] Re: a simple question about a tricky problem

Dieter Maurer