New subject: [lxml-dev] xpath on text nodes

May 12, 2009

      Hi,

Jamie Norrish wrote:
...
I've included at the end of this message an example of the XML I'm
operating over, where the aim is to get a rough number of characters of
textual content preceding and following a name or rs element. Given the
highly multiform nature of the markup, I *think* that the simplest way
of going about this is to go from text node to text node, forward and
back, accumulating the text as it goes, and stopping once a certain
amount has been reached.
The way I'm currently doing this is by simply selecting a certain number
of text nodes preceding and following the name or rs element
(name_node.xpath('following::text()[position()<15]'), for example), and
iterating through those and stopping when the right amount of text has
been accumulated. Obviously this has the problem that too many or too
few text nodes (in the XPath result sense) may be selected, which is
either inefficient or leads to too little context.
Selecting an ancestor and then splitting the textual content of that
isn't, I think, a better option, given the nature of the XML I'm dealing
with. A name/rs element may be at almost any level of the tree, and its
textual content may well be repeated multiple times within any given
chunk.
Ok, I now see where you are coming from. Something like the above XPath
expression or the respective lxml.etree API code would have been my first
attempt, too. I actually doubt that you can do much better in this case.

It's actually a more general problem. Imagine you select a text node that
has a certain length and contains the found text multiple times. How would
you find a good context here? Is it the context of the first occurrence,
which may include a lot of preceding text but not the last occurrence
within the text node itself (if it is long enough) - or is it the last
occurrence that is interesting here, with all the text that follows the
matching text node?

So the underlying problem is even independent of the API you use, it's more
that substrings do not match nicely with the granularity of a text node.
...
I totally understand that it's problematic to change lxml to have a
different model for text, and I'm either going to continue with my
current method, or else use a modified form of my ideal solution, which
is to get the parent element of the text, and then use XPath again to
get the appropriate next text node in the sequence from that. This is a
little more cumbersome than I'd like, obviously, since the expression
changes not just by the direction of the context (preceding or
following) but also whether the current text is the text or tail of the
element. I'd have to run some tests to see whether the extra processing
slowed things down too much - this process is one that operates over
(often) thousands of name elements within each of over a thousand
documents.
Maybe you should try the same thing without XPath, just using the API.
XPath is fast when you are very selective or when you grab the aggregated
text content of an element. It's less great when you do things iteratively.
The API based algorithm may not even be that complex as you can use tree
iteration and stuff. (Did I mention that readability counts? :)
...
(The point of getting this context is to give people some idea of who a
name element might be referring to, for when it is being keyed to an
entity in our authority control system. So the markup doesn't matter
particularly, but the textual content does.)
Stefan Behnel wrote:
...
I still do not have a clear idea of what you consider "text context"
actually. Does that take the tree structure into account (e.g. only within
a certain parent element), or is it just any text content that precedes the
XPath result in reverse document order, wherever it occurs in the tree?
Just any, though there are some cases where the markup could be used to
usefully limit the context (so, for example, the name may occur within a
bibliographic entry in a list of citations, and it's unlikely that any
textual content from before or after that entry will be relevant. That's
typically going to be the exception, however; even staying within a
paragraph element is not necessarily helpful (named things are often
introduced at the end of a paragraph and given more context in the
following paragraph, for example).
This sounds like your algorithm is already more complex than a simple "any
text node preceding the one that matches". That convinces me that an API
based solution will be a lot more flexible than anything you could scratch
out of XPath. It would allow you to special case certain tag types, for
example, or to notice when you cross parent boundaries.
...
Here's the example of a small piece of a document, in case it helps.
I'll leave it in the reply, just in case others have ideas, too.
...
But really, I'm happy enough with the way lxml works (it's great software -
thank you and everyone else who has made it what it is!). Not being
familiar with its inner workings I didn't know whether it would be
feasible or practical to add XPath to text results. Now I know, and I'll
continue on without complaint.
:)

Stefan
...
<lb/>give my love to everybody including <name
key="name-110011" type="person">Peter</name>, hoping he is
          <lb/>finding his way around the house better now, & that
this
    <lb/>
    <pb xml:id="n12" n="12" corresp="#JCB-001l"/>
    finds you as it leaves me, in the best of health & very
    <lb/>much in love with you.
  </p>
        <closer>
          <salute><choice><abbr>Yr</abbr><expan>Your</expan></choice>
<choice><abbr>affect.</abbr><expan>affectionate</expan></choice> son
    </salute>
          <lb/>
          <signed>
            <name key="name-207379" type="person">J.C. Ulysses
Beaglehole</name>
          </signed>
          <seg type="postscript">P.S. You might tell yourself, <name
key="name-110417" type="person">Auntie</name> & <name
key="name-034628" type="person">Christine</name>, that
      <lb/>I have struck nobody yet with so swish a
<choice><orig>dressing-<lb/>gown</orig><reg>dressing-gown</reg></choice>
as mine.
      <lb/>I had now better get on to some other letters
      <lb/>of thanks, greeting, business, etc.</seg>
          <signed><name key="name-207379"
type="person">J.</name></signed>
          <seg type="postscript">P.P.S. You might send me the date of
Auntie <unclear>Sis'</unclear>
      <lb/>birthday. I hope Auntie's had a fitting celebration.</seg>
          <signed><name key="name-207379" type="person">J.</name>
      <lb/>
    </signed>
          <seg type="postscript">P.P.P.S. I have been writing all the
morning & it is now
      <lb/>¼ to 1. If you pass the letter round it will save
      <lb/>much exhaustion to my dexter hand.</seg>
          <salute>
      <choice><abbr>Yrs</abbr><expan>Yours</expan></choice>
      <del>finally</del> penultimately,
      <lb/>
    </salute>
          <signed><name key="name-207379"
type="person">J.C.B.</name></signed>

Re: [lxml-dev] xpath on text nodes

Stefan Behnel

Jamie Norrish

tags

participants (2)