Mailman 3 Re: [lxml-dev] problem\bug in xpath compare() with text in tail - lxml - The Python XML Toolkit - python.org

newer
Re: [lxml-dev] Trouble parsing...

Re: [lxml-dev] problem\bug in xpath compare() with text in tail

older
[lxml-dev] Trouble parsing large...

Stefan Behnel

24 May 2008 24 May '08

11:55 a.m.

Hi, please keep the list involved. matan ninio wrote:

Is there some good place to look for information about XPath?

Search for "xpath tutorial" ? Stefan

Reply

Sign in to reply online Use email software

Show replies by date

John W. Shipman

24 May 24 May

12:46 p.m.

New subject: [lxml-dev] problem\bug in xpath compare() with text in tail

matan ninio wrote: +-- | Is there some good place to look for information about XPath? +-- If I might recommend my modest XSLT reference: http://www.nmt.edu/tcc/help/pubs/xslt/ It has a section on XPath. John Shipman (john@nmt.edu), Applications Specialist, NM Tech Computer Center, Speare 119, Socorro, NM 87801, (505) 835-5950, http://www.nmt.edu/~john ``Let's go outside and commiserate with nature.'' --Dave Farber

Reply

Sign in to reply online Use email software

Raymond Wiker

1:09 p.m.

New subject: [lxml-dev] problem\bug in xpath compare() with text in tail

On May 24, 2008, at 21:46 , John W. Shipman wrote:

matan ninio wrote:

+-- | Is there some good place to look for information about XPath? +--

If I might recommend my modest XSLT reference:

http://www.nmt.edu/tcc/help/pubs/xslt/

It has a section on XPath.

There's also some good stuff on http://www.zvon.org.

Reply

Sign in to reply online Use email software

Matan Ninio

1:41 p.m.

New subject: [lxml-dev] problem\bug in xpath compare() with text in tail

Raymond Wiker <rwiker <at> gmail.com> writes:

On May 24, 2008, at 21:46 , John W. Shipman wrote:

...
matan ninio wrote:

+-- | Is there some good place to look for information about XPath? +--

If I might recommend my modest XSLT reference:

http://www.nmt.edu/tcc/help/pubs/xslt/

It has a section on XPath.

There's also some good stuff on http://www.zvon.org.

Thanks for the links. I have already read several of them, including the very nice one in zvon.org mentioned above. But I'm yet to find the bit of information i'm missing. Why dose the behavior of "text()" change to exclude tail elements when moving from "//text()" to "//*[contains(text(),'ABC')]"? What does the "text()" function *actually* do? I can see that if an element where to have more then one text value, the meaning of "contains(text()," may be unclear. But they why is the //text() version actually pulling out the tail elements? This thread is somewhat off-topic. I am new to this list, so i really don't know if it's considered acceptable to discuss such topics here. If not, I apologize and will take this elsewhere. Thanks again, Matan

Reply

Sign in to reply online Use email software

Raymond Wiker

25 May 25 May

3:14 a.m.

New subject: [lxml-dev] problem\bug in xpath compare() with text in tail

On May 24, 2008, at 22:41 , Matan Ninio wrote:

Raymond Wiker <rwiker <at> gmail.com> writes:

...
On May 24, 2008, at 21:46 , John W. Shipman wrote:

...
matan ninio wrote:

+-- | Is there some good place to look for information about XPath? +--

If I might recommend my modest XSLT reference:

http://www.nmt.edu/tcc/help/pubs/xslt/

It has a section on XPath.

There's also some good stuff on http://www.zvon.org.

Thanks for the links. I have already read several of them, including the very nice one in zvon.org mentioned above. But I'm yet to find the bit of information i'm missing. Why dose the behavior of "text()" change to exclude tail elements when moving from "//text()" to "// *[contains(text(),'ABC')]"? What does the "text()" function *actually* do? I can see that if an element where to have more then one text value, the meaning of "contains(text()," may be unclear. But they why is the //text() version actually pulling out the tail elements?

The text() function is a predicate that returns true for XML text nodes - it is not a function that returns the concatenation of text nodes under a specific element. Thus, //text() returns all text nodes in a tree. If you want to return all text nodes that contain the string "ABC", the correct test might be something like "//text() [contains(., 'ABC')]"

Reply

Sign in to reply online Use email software

Stefan Behnel

3:42 a.m.

New subject: [lxml-dev] problem\bug in xpath compare() with text in tail

Hi, while XPath might be considered somewhat off-topic for ElementTree, I find your question about text() and .tail very on-topic for lxml. ElementTree does not expose the concept of a "text node" to Python space, so having them appear in XPath is somewhat ugly. Also, note that the parser may decide to split long text content or content that contains entities into multiple text nodes, so "text()" is not even guaranteed to return a text node that contains the complete ".text" value of a node. That makes it a somewhat fragile concept in XPath. If you want to test for .text and .tail reliably, it is easiest to do it in Python space. Look at the "siblings" example I gave in my first reply. Note also that most XPath string functions can work on node content, so for example: //*[contains(., 'ABC')] succeeds for any node where 'ABC' exists in the concatenated string value of the node and its children (but not in the .tail text of the node itself): >>> e=et.HTML("<html><body>inbody<h5>text</h5>tail</body></html>") >>> e.xpath("//*[contains(., 'text')]") [<Element html at b7789374>, <Element body at b77893c4>, <Element h5 at b7789414>] >>> e.xpath("//*[contains(., 'tail')]") [<Element html at b7789464>, <Element body at b778939c>] Matan Ninio wrote:

Why dose the behavior of "text()" change to exclude tail elements when moving from "//text()" to "//*[contains(text(),'ABC')]"? What does the "text()" function *actually* do?

"//text()" will get you /any/ text node in the tree, regardless of its position. "text()" is a node test that succeeds for all text nodes. "//*[contains(text(),'ABC')]" will get you the element that has a text node as direct child that contains the string "ABC". However, apparently, this only works for the first text node: >>> e = et.HTML("<html><body>inbody<h5>text</h5>tail</body></html>") >>> e.xpath("//*[contains(text(), 'tail')]") [] >>> e.xpath("//*[contains(text(), 'inbody')]") [<Element body at b7789324>] Not sure if this is in line with the XPath spec - might be a problem in libxml2. Although:

I can see that if an element where to have more then one text value, the meaning of "contains(text()," may be unclear.

I would accept that as an explanation. :) Stefan

Reply

Sign in to reply online Use email software

6060

Age (days ago)

6061

Last active (days ago)

Download

5 comments

4 participants

tags

participants (4)

John W. Shipman
Matan Ninio
Raymond Wiker
Stefan Behnel