Re: [lxml-dev] xpath on text nodes
data:image/s3,"s3://crabby-images/4cf20/4cf20edf9c3655e7f5c4e7d874c5fdf3b39d715f" alt=""
Hi, Jamie Norrish wrote:
I've included at the end of this message an example of the XML I'm operating over, where the aim is to get a rough number of characters of textual content preceding and following a name or rs element. Given the highly multiform nature of the markup, I *think* that the simplest way of going about this is to go from text node to text node, forward and back, accumulating the text as it goes, and stopping once a certain amount has been reached.
The way I'm currently doing this is by simply selecting a certain number of text nodes preceding and following the name or rs element (name_node.xpath('following::text()[position()<15]'), for example), and iterating through those and stopping when the right amount of text has been accumulated. Obviously this has the problem that too many or too few text nodes (in the XPath result sense) may be selected, which is either inefficient or leads to too little context.
Selecting an ancestor and then splitting the textual content of that isn't, I think, a better option, given the nature of the XML I'm dealing with. A name/rs element may be at almost any level of the tree, and its textual content may well be repeated multiple times within any given chunk.
Ok, I now see where you are coming from. Something like the above XPath expression or the respective lxml.etree API code would have been my first attempt, too. I actually doubt that you can do much better in this case. It's actually a more general problem. Imagine you select a text node that has a certain length and contains the found text multiple times. How would you find a good context here? Is it the context of the first occurrence, which may include a lot of preceding text but not the last occurrence within the text node itself (if it is long enough) - or is it the last occurrence that is interesting here, with all the text that follows the matching text node? So the underlying problem is even independent of the API you use, it's more that substrings do not match nicely with the granularity of a text node.
I totally understand that it's problematic to change lxml to have a different model for text, and I'm either going to continue with my current method, or else use a modified form of my ideal solution, which is to get the parent element of the text, and then use XPath again to get the appropriate next text node in the sequence from that. This is a little more cumbersome than I'd like, obviously, since the expression changes not just by the direction of the context (preceding or following) but also whether the current text is the text or tail of the element. I'd have to run some tests to see whether the extra processing slowed things down too much - this process is one that operates over (often) thousands of name elements within each of over a thousand documents.
Maybe you should try the same thing without XPath, just using the API. XPath is fast when you are very selective or when you grab the aggregated text content of an element. It's less great when you do things iteratively. The API based algorithm may not even be that complex as you can use tree iteration and stuff. (Did I mention that readability counts? :)
(The point of getting this context is to give people some idea of who a name element might be referring to, for when it is being keyed to an entity in our authority control system. So the markup doesn't matter particularly, but the textual content does.)
Stefan Behnel wrote:
I still do not have a clear idea of what you consider "text context" actually. Does that take the tree structure into account (e.g. only within a certain parent element), or is it just any text content that precedes the XPath result in reverse document order, wherever it occurs in the tree?
Just any, though there are some cases where the markup could be used to usefully limit the context (so, for example, the name may occur within a bibliographic entry in a list of citations, and it's unlikely that any textual content from before or after that entry will be relevant. That's typically going to be the exception, however; even staying within a paragraph element is not necessarily helpful (named things are often introduced at the end of a paragraph and given more context in the following paragraph, for example).
This sounds like your algorithm is already more complex than a simple "any text node preceding the one that matches". That convinces me that an API based solution will be a lot more flexible than anything you could scratch out of XPath. It would allow you to special case certain tag types, for example, or to notice when you cross parent boundaries.
Here's the example of a small piece of a document, in case it helps.
I'll leave it in the reply, just in case others have ideas, too.
But really, I'm happy enough with the way lxml works (it's great software - thank you and everyone else who has made it what it is!). Not being familiar with its inner workings I didn't know whether it would be feasible or practical to add XPath to text results. Now I know, and I'll continue on without complaint.
:) Stefan
<lb/>give my love to everybody including <name key="name-110011" type="person">Peter</name>, hoping he is <lb/>finding his way around the house better now, & that this <lb/> <pb xml:id="n12" n="12" corresp="#JCB-001l"/> finds you as it leaves me, in the best of health & very <lb/>much in love with you. </p> <closer> <salute><choice><abbr>Yr</abbr><expan>Your</expan></choice> <choice><abbr>affect.</abbr><expan>affectionate</expan></choice> son </salute> <lb/> <signed> <name key="name-207379" type="person">J.C. Ulysses Beaglehole</name> </signed> <seg type="postscript">P.S. You might tell yourself, <name key="name-110417" type="person">Auntie</name> & <name key="name-034628" type="person">Christine</name>, that <lb/>I have struck nobody yet with so swish a
<choice><orig>dressing-<lb/>gown</orig><reg>dressing-gown</reg></choice> as mine. <lb/>I had now better get on to some other letters <lb/>of thanks, greeting, business, etc.</seg> <signed><name key="name-207379" type="person">J.</name></signed> <seg type="postscript">P.P.S. You might send me the date of Auntie <unclear>Sis'</unclear> <lb/>birthday. I hope Auntie's had a fitting celebration.</seg> <signed><name key="name-207379" type="person">J.</name> <lb/> </signed> <seg type="postscript">P.P.P.S. I have been writing all the morning & it is now <lb/>¼ to 1. If you pass the letter round it will save <lb/>much exhaustion to my dexter hand.</seg> <salute> <choice><abbr>Yrs</abbr><expan>Yours</expan></choice> <del>finally</del> penultimately, <lb/> </salute> <signed><name key="name-207379" type="person">J.C.B.</name></signed>
data:image/s3,"s3://crabby-images/4241b/4241ba64bffd5f63bfbc0eaacf3eb1d6505bf40f" alt=""
Stefan Behnel wrote:
This sounds like your algorithm is already more complex than a simple "any text node preceding the one that matches". That convinces me that an API based solution will be a lot more flexible than anything you could scratch out of XPath. It would allow you to special case certain tag types, for example, or to notice when you cross parent boundaries.
Well, the original plan didn't really call for much special casing of particular elements, but now that things are working I'll likely add in such as I think of them. I've changed the approach completely, to use XSLT to transform the entire document into something that has the context handled appropriately (and using XPath on text nodes :). It takes two transformations (the second one to handle ordering issues with the preceding context, and to do a little cleanup of whitespace, but it is more than an order of magnitude faster than what I had before. I'm not sure why I didn't go down that route in the first place, but now that I have I'm very happy. And of course it's great that XSLT is so easy to use with lxml. Oh, I also tried using .getparent() and some logic to get the equivalent of preceding::text()[1] and following::text()[1], but it turned out (not surprisingly, given the complexities of that approach) to be marginally slower than what I had. Jamie
participants (2)
-
Jamie Norrish
-
Stefan Behnel