Mailman 3 [lxml-dev] xpath on text nodes - lxml - The Python XML Toolkit

newer
[lxml-dev] Ask for help about lxml...

[lxml-dev] xpath on text nodes

Jamie Norrish

April 29, 2009

1:44 a.m.

The xpath method is currently available only for ElementTree and Element objects. Is it possible for it to be available to text nodes also? My current use case is getting a certain length text context for a particular element node, and I'd like to implement that through a recursive call to a function that returns the content of a supplied text node appended to the content of the next text node in sequence (provided the required length has not been passed). Jamie

Attachments:

signature.asc (application/pgp-signature — 197 bytes)

Show replies by date

Stefan Behnel

April 2009

8:24 a.m.

Hi, Jamie Norrish wrote:

...

The xpath method is currently available only for ElementTree and Element objects. Is it possible for it to be available to text nodes also?

There is no such concept as a text node in lxml.etree.

...

My current use case is getting a certain length text context for a particular element node, and I'd like to implement that through a recursive call to a function that returns the content of a supplied text node appended to the content of the next text node in sequence (provided the required length has not been passed).

That sounds a lot like you should do that in Python by using iterwalk() and collecting .text and .tail attributes of Elements, not by using XPath. Stefan

Jamie Norrish

9:30 p.m.

Hi,

...

There is no such concept as a text node in lxml.etree.

Okay, but the string results of an XPath selecting text nodes in the XML have additional attributes - it just seems a pity that an xpath method isn't one of them.

...

That sounds a lot like you should do that in Python by using iterwalk() and collecting .text and .tail attributes of Elements, not by using XPath.

Well, I like XPath. :) In fact I already have an implementation of the use case that, while slightly subobtimal, is sufficient - it just seemed like one obvious way of doing it better was to use XPath. I shall investigate using iterwalk instead. Thanks! Jamie

Stefan Behnel

12:42 a.m.

Jamie Norrish wrote:

...

...
There is no such concept as a text node in lxml.etree.

Okay, but the string results of an XPath selecting text nodes in the XML have additional attributes - it just seems a pity that an xpath method isn't one of them.

It would be rarely used, I'd say. What sort of interesting XPath queries could you possibly do on a node that doesn't have any children, nor attributes, nor a tag name or namespace. Also, XPath queries can return Elements and (special) strings, but also plain numbers and boolean values. So you'd still not have a common interface for all possible result types.

...

...
That sounds a lot like you should do that in Python by using iterwalk() and collecting .text and .tail attributes of Elements, not by using XPath.

Well, I like XPath. :) In fact I already have an implementation of the use case that, while slightly subobtimal, is sufficient - it just seemed like one obvious way of doing it better was to use XPath. I shall investigate using iterwalk instead.

This should basically be a no-brainer with iterwalk(). You iterate over start and end events and just collect the .text values on start and the .tail values on end. Put them in a list, count the total character length on the way, break when it's long enough and ''.join() the list. Stefan

Jamie Norrish

1:07 p.m.

On Thu, 2009-04-30 at 09:42 +0200, Stefan Behnel wrote:

...

It would be rarely used, I'd say. What sort of interesting XPath queries could you possibly do on a node that doesn't have any children, nor attributes, nor a tag name or namespace.

Besides selecting other nodes and values relative to the text? Yes, it is possible to use text_result.getparent() and proceed from there - but this has the downside of requiring, for some XPath expressions, the code to modify the expression based on whether text_result was the text or tail of its parent, which is annoying.

...

Also, XPath queries can return Elements and (special) strings, but also plain numbers and boolean values. So you'd still not have a common interface for all possible result types.

Well, I'm not really asking for a common interface - only that XPath be enabled for the results of an XPath expression for text(). This would bring it into line with XSLT behaviour, for one. However, I accept that it's not going to be used often, and probably isn't worth you implementing for that reason. About using iterwalk: this wouldn't seem (on a quick perusal of the documentation) to easily allow for me to get the preceding context of the text result, unless I picked some arbitrary earlier element as the starting point. What am I missing? Jamie

Stefan Behnel

May 2009

2:30 a.m.

Hi, Jamie Norrish wrote:

...

On Thu, 2009-04-30 at 09:42 +0200, Stefan Behnel wrote:

...
It would be rarely used, I'd say. What sort of interesting XPath queries could you possibly do on a node that doesn't have any children, nor attributes, nor a tag name or namespace.

Besides selecting other nodes and values relative to the text? Yes, it is possible to use text_result.getparent() and proceed from there - but this has the downside of requiring, for some XPath expressions, the code to modify the expression based on whether text_result was the text or tail of its parent, which is annoying.

Ok, I do see your use case, although I still don't know what your selections look like in practice. If you want a more predictable XPath result, maybe it would make sense to select the surrounding element instead of the plain text content. As I said, lxml.etree does not have a representation for text nodes. So by adding an xpath() method to text results, you'd end up with a rather fragile setup that might crash when you replace the text of a node, just because an XPath text result is still holding a reference to a now-dead text node, for example. So it's not just adding a method, it's more like rethinking concepts inside lxml.etree. I'm pretty sure this use case is not worth going there - especially since it's nothing that can't be done today, but rather an inconvenience.

...

...
Also, XPath queries can return Elements and (special) strings, but also plain numbers and boolean values. So you'd still not have a common interface for all possible result types.

Well, I'm not really asking for a common interface - only that XPath be enabled for the results of an XPath expression for text(). This would bring it into line with XSLT behaviour, for one.

Well, XSLT is a different language with a different tree model.

...

About using iterwalk: this wouldn't seem (on a quick perusal of the documentation) to easily allow for me to get the preceding context of the text result, unless I picked some arbitrary earlier element as the starting point. What am I missing?

I guess I misjudged your use case when you first described it. iterwalk() will not allow you to access the text context preceding an element, only the text content of the element itself. I still do not have a clear idea of what you consider "text context" actually. Does that take the tree structure into account (e.g. only within a certain parent element), or is it just any text content that precedes the XPath result in reverse document order, wherever it occurs in the tree? What about just stepping up parent by parent until the contained text content is long enough? Or, if it's too long, split it by the substring that XPath found, and strip the left and right part... Stefan

5768

Age (days ago)

5778

Last active (days ago)

List overview

Download

5 comments

2 participants

participants (2)

Jamie Norrish
Stefan Behnel

[lxml-dev] xpath on text nodes

Jamie Norrish

Stefan Behnel

Jamie Norrish

Stefan Behnel

Jamie Norrish

Stefan Behnel

tags

participants (2)