Re: [lxml-dev] Some XPath questions...

Mike Meyer wrote:
In <468579E3.7010802@colorstudy.com>, Ian Bicking <ianb@colorstudy.com> typed:
Thanks, very helpful. I'm guessing it was an oversight that you didn't copy the list...
I wasn't sure which way to go.
Without CC'ing people won't know you've already answered my questions.
There's some other tricky ones I'm not sure about either, though they seem to be kind of working. Things like div:only-child (when it's a div //*[name() = 'div' and last() = 1] This doesn't seem to be working for me: >>> xpath('span:only-child') *[name() = 'span' and (last() = 1)] But testing with <div><span></span></div> in the document, I don't get anything returned.
Seems to work ok for me:
d = fromstring('<html><div><span></span></div><span /></html>') d.xpath('//span') [<Element span at 918500>, <Element span at 9182d0>] d.xpath('//*[name() = "span" and last() = 1]') [<Element span at 918500>]
But I may be missing something. Got a full test case?
Here's a small example:
from lxml.etree import HTML h = '<html><body><div></div><div><span></span></div></body></html>' h = HTML(h) h.xpath("descendant-or-self::*[name() = 'span' and (last() = 1)]") [] h.xpath("//*[name() = 'span' and (last() = 1)]") [<Element span at -4866f944>]
So when I use // it works. Huh. I prefer descendant-or-self, because I find it peculiar to do a search from the root when you've called the method on some particular element (that may not be at the root).
div:empty (no children, including text, maybe not including whitespace). Ouch. let me think about that one. Yeah, I couldn't figure that one out. I thought this might work: >>> xpath('E:empty') e[count(./children::*) = 0 and string(.) = ''] But maybe I don't understand how count() works; this isn't a valid XPath expression.
You want "child" not "children". Using normalize-space(.) instead of string(.) will exclude whitespace. This does assume you are ignoring comments and PIs; I believe that's the behavior you want.
Cool, that seems to work right. One query I'm realizing might be really hard (maybe too hard in XPath) is *:first-of-type, *:last-of-type, and *:only-of-type, since they match in a funny sort of way. You can't really do: *[count(../*[name() = name()) = 1] But it's kind of what *:only-of-type means. Or: *[count(following-sibling::name()) = 0 and count(previous-sibling::name()) = 0] You just can't use name() that way. Hmm... well, it's not that important of a query to me, I guess, so maybe I'll just catch it and give an error. -- Ian Bicking | ianb@colorstudy.com | http://blog.ianbicking.org | Write code, do good | http://topp.openplans.org/careers

Hi Ian, if this is supposed to go into lxml.html (or maybe something like lxml.css) please don't call your function "xpath()". That's the XPath evaluation method in etree. Consider calling it "build_xpath()", "css_to_xpath()" or something, depending on the context you provide it in. Ian Bicking wrote:
Mike Meyer wrote:
In <468579E3.7010802@colorstudy.com>, Ian Bicking <ianb@colorstudy.com> typed:
Thanks, very helpful. I'm guessing it was an oversight that you didn't copy the list... I wasn't sure which way to go.
Without CC'ing people won't know you've already answered my questions.
And without CC'ing the list, the mail won't get archived, people won't be able to find the discussion later and will keep asking the same questions over and over again. :) Oh, and: people won't even be able to comment on what you (Mike) propose as a solution and you won't be able to learn anything either, in case there's a better solution.
So when I use // it works. Huh. I prefer descendant-or-self, because I find it peculiar to do a search from the root when you've called the method on some particular element (that may not be at the root).
There's also ".//*".
div:empty (no children, including text, maybe not including whitespace). Ouch. let me think about that one. Yeah, I couldn't figure that one out. I thought this might work: >>> xpath('E:empty') e[count(./children::*) = 0 and string(.) = ''] But maybe I don't understand how count() works; this isn't a valid XPath expression. You want "child" not "children". Using normalize-space(.) instead of string(.) will exclude whitespace. This does assume you are ignoring comments and PIs; I believe that's the behavior you want.
Cool, that seems to work right.
What about "e[not(*) and not(normalize-space())]" ?
One query I'm realizing might be really hard (maybe too hard in XPath) is *:first-of-type, *:last-of-type, and *:only-of-type, since they match in a funny sort of way. You can't really do:
*[count(../*[name() = name()) = 1]
You need two expressions here, one to find the node and one to compare it to others (note that name() can also take an argument) - but those are really trick, you're right. They may already touch the borders of what XPath can express.
But it's kind of what *:only-of-type means. Or:
*[count(following-sibling::name()) = 0 and count(previous-sibling::name()) = 0]
You just can't use name() that way. Hmm... well, it's not that important of a query to me, I guess, so maybe I'll just catch it and give an error.
But you can call "name()" with an argument - although not with a node-set (it will just work on the first entry and ignore the rest in that case). Stefan

Stefan Behnel wrote:
So when I use // it works. Huh. I prefer descendant-or-self, because I find it peculiar to do a search from the root when you've called the method on some particular element (that may not be at the root).
There's also ".//*".
That seems to be equivalent to //*, i.e., // goes directly to the root regardless of context.
div:empty (no children, including text, maybe not including whitespace). Ouch. let me think about that one. Yeah, I couldn't figure that one out. I thought this might work: >>> xpath('E:empty') e[count(./children::*) = 0 and string(.) = ''] But maybe I don't understand how count() works; this isn't a valid XPath expression. You want "child" not "children". Using normalize-space(.) instead of string(.) will exclude whitespace. This does assume you are ignoring comments and PIs; I believe that's the behavior you want. Cool, that seems to work right.
What about "e[not(*) and not(normalize-space())]" ?
Yes, that works too.
One query I'm realizing might be really hard (maybe too hard in XPath) is *:first-of-type, *:last-of-type, and *:only-of-type, since they match in a funny sort of way. You can't really do:
*[count(../*[name() = name()) = 1]
You need two expressions here, one to find the node and one to compare it to others (note that name() can also take an argument) - but those are really trick, you're right. They may already touch the borders of what XPath can express.
I could probably do it by adding a new function, I suppose; css:last-of-type() for instance. It's not that hard to do in Python, after all. -- Ian Bicking | ianb@colorstudy.com | http://blog.ianbicking.org | Write code, do good | http://topp.openplans.org/careers
participants (2)
-
Ian Bicking
-
Stefan Behnel