Re: [lxml-dev] xpath on text nodes

May 9, 2009

      Hi,

Jamie Norrish wrote:
...
On Thu, 2009-04-30 at 09:42 +0200, Stefan Behnel wrote:
...
It would be rarely used, I'd say. What sort of interesting XPath queries
could you possibly do on a node that doesn't have any children, nor
attributes, nor a tag name or namespace.
Besides selecting other nodes and values relative to the text? Yes, it
is possible to use text_result.getparent() and proceed from there - but
this has the downside of requiring, for some XPath expressions, the code
to modify the expression based on whether text_result was the text or
tail of its parent, which is annoying.
Ok, I do see your use case, although I still don't know what your
selections look like in practice. If you want a more predictable XPath
result, maybe it would make sense to select the surrounding element instead
of the plain text content.

As I said, lxml.etree does not have a representation for text nodes. So by
adding an xpath() method to text results, you'd end up with a rather
fragile setup that might crash when you replace the text of a node, just
because an XPath text result is still holding a reference to a now-dead
text node, for example. So it's not just adding a method, it's more like
rethinking concepts inside lxml.etree. I'm pretty sure this use case is not
worth going there - especially since it's nothing that can't be done today,
but rather an inconvenience.
...
...
Also, XPath queries can return Elements and (special) strings, but
also plain numbers and boolean values.
So you'd still not have a common interface for all possible result types.
Well, I'm not really asking for a common interface - only that XPath be
enabled for the results of an XPath expression for text(). This would
bring it into line with XSLT behaviour, for one.
Well, XSLT is a different language with a different tree model.
...
About using iterwalk: this wouldn't seem (on a quick perusal of the
documentation) to easily allow for me to get the preceding context of
the text result, unless I picked some arbitrary earlier element as the
starting point. What am I missing?
I guess I misjudged your use case when you first described it. iterwalk()
will not allow you to access the text context preceding an element, only
the text content of the element itself.

I still do not have a clear idea of what you consider "text context"
actually. Does that take the tree structure into account (e.g. only within
a certain parent element), or is it just any text content that precedes the
XPath result in reverse document order, wherever it occurs in the tree?

What about just stepping up parent by parent until the contained text
content is long enough? Or, if it's too long, split it by the substring
that XPath found, and strip the left and right part...

Stefan