[lxml-dev] Some XPath questions...

I'm trying to implement CSS selectors, by translating them into XPath. There's some CSS expressions that I'm having a hard time with, so maybe someone can tell me how they might work. Expression: div:first-child -- means a div element when it is the first child of its parent. I.e.: <li> <div id="a">...</div> <div id="b">...</div> </li> It makes the first div and not the second. I thought this could be: descendant-or-self::*/div[0] or... descendant-or-self::*/div[position() = 0] Those two should be equivalent; the second is a bit easier to handle programmatically. But it doesn't work (doesn't match anything). Another expreesion: div.foo + div -- means a div element that is the immediately next sibling of a div element with the class .foo. I would translate this to: descendant-or-self::div[@class='foo']/following-sibling::div[0] (The class matching is actually a bit more complex, but it doesn't actually matter to this.) I'm (a) not sure if this is right, because maybe it means the next div after the matching div, even if there's another element in-between, and (b) it doesn't return any results regardless. Another expression: div:contains('celia') -- means a div where the textual content has the word 'celia' in it, case insensitive. At least, I think it's case insensitive -- the CSS spec is annoyingly vague, but implementations seem to work like this. I translate this to: descendant-or-self::div[contains(css:lower-case(string(.)), 'celia'] I added the lower-case function like: def _make_lower_case(context, s): return s.lower() etree.FunctionNamespace("css")['lower-case'] = _make_lower_case But XPath gives so few errors that it's hard to tell if it's really working. The XPath expression returns some elements, but not the correct number from what I can tell. Especially since when I had a bug and wasn't lowercasing the second argument (using 'CELIA') it still returned elements. There's some other tricky ones I'm not sure about either, though they seem to be kind of working. Things like div:only-child (when it's a div with no siblings), div:last-child (no next sibling), div:first-child (no previous sibling), div:first-of-type (no preceding siblings that are divs), div:last-of-type (no following siblings that are divs), div:only-of-type (you are probably getting the pattern), div:empty (no children, including text, maybe not including whitespace). There's also div:nth-child(matcher) and div:nth-of-type(matcher), which selects among siblings with patterns like "2" (second sibling), "3n" (every third element), "odd" (odd elements) and some other selections. I kind of see how to deal with this using position(), but I'm not sure how to do either nth-of-type or nth-child (and the ones I do understand I am also vague about). I've committed the incomplete code in lxml.html.css -- Ian Bicking | ianb@colorstudy.com | http://blog.ianbicking.org | Write code, do good | http://topp.openplans.org/careers

Hi Ian, just to comment on your actual first post in this thread, which I kinda oversaw because of the later discussion. I think this is pretty cool stuff and I love to have this in lxml. The html module really seems to be getting somewhere. I think we shouldn't even wait too long with a release so that we get some more feedback on the new APIs. Maybe I should fix lxml's versioning so that we can put out a 2.0alpha1 (and not only alpha, beta, final). Ian Bicking wrote:
"css" is not the namespace, it's the prefix. You can do this: ns = etree.FunctionNamespace("http://my/css/namespace") ns.prefix = "css" ns['lower-case'] = _make_lower_case or this: ns = etree.FunctionNamespace("http://my/css/namespace") ns['lower-case'] = _make_lower_case def css_to_xpath(css): xpath = build_xpath(css) return etree.XPath(xpath, {'css' : "http://my/css/namespace"}) You should consider providing a default namespace map here, and maybe even return compiled XPath objects, i.e. callables. Note that these provide a "path" attribute that returns the original path, so if you have to extend an expression later on, you can still do so by creating a new XPath object. Note that this would also allow you to wrap the function with an additional call to set(), so that or-ed results really become the union and not the sum of all parts.
But XPath gives so few errors that it's hard to tell if it's really working.
Sadly, there doesn't seem to be a simple way to find out that a function was undeclared. Or maybe I'll just have to look back into that... didn't I do that already? :)
If I understand this correctly, this would be nth-of-type: //*/NAME[position() = x] nth-child: //*/*[position() = x] To deal with things like "2n", try this: //*/NAME[(position() mod 2) = 0]
I've committed the incomplete code in lxml.html.css
I skipped through it a bit and found it really cool. I'm not completely satisfied with the naming, but I now see that the context of the css module makes it clearer what the semantics are. Still, I prefer "css_to_xpath()", and providing a top-level class XPath() makes me think it should return an etree.XPath object, i.e. a compiled path. One more note: def run_xpath(doc, xpath): return [el for el in doc.xpath(xpath) if isinstance(el, etree.ElementBase)] Do you mean "etree.iselement(el)" here or are you intentionally restricting this to real-element subclasses of _Element? (i.e. no plain lxml.etree elements, no PIs, no comments) I actually think this module merits its own top-level placing, not necessarily only as part of lxml.html. It could just as well become "lxml.css", and should thus not rely too much on a specific API from lxml.html. Stefan

Stefan Behnel wrote:
Yeah, I was thinking about writing up a summary of things that need to be done in the html package; there's still some outstanding stuff, but not too much. The clean module needs to be cleaned up (I'm thinking of moving from a function to a class). I'd like to make the usedoctest hack a little more general, as elsewhere I'm now using a similar hack to enable ELLIPSIS, and I'd like them not to conflict. And then some docs, but I guess that's it.
OK, I've switched to this.
Is there any advantage to this, over a more global prefix? I suppose there's a possible collision of css:, but I doubt that will be a problem.
That's handy. I was thinking of creating a CSSXPath subclass or something, that would keep the original CSS selector around, in addition the translated XPath.
If you use | in the XPath expression it seems to work out that there won't be any duplicates.
We talked about it previously when I was trying to use match(), and instead of errors got bizarre results. But I don't think it resulted in any improvements on error messages.
I think I already have all this working now... though I wish there was a test case I could use, as I'm not 100% sure that my tests are testing for the correct results.
I was thinking about changing around all the public naming. I'd like for it to be a method on elements, though I'm not sure what to call the method. .css(expr) is a bit funny, as it's not "css", it's just a css selector. .select(expr) doesn't say what kind of selector you are using. Another public function would be like XPath, something that compiles the entire CSS expression. Especially since the CSS parsing is non-trivial (just like the XPath parsing is non-trivial), precompiling will be beneficial. I'm thinking of also adding a fast path for a couple common kinds of selectors, that translate them more quickly into XPath. E.g., search for r'^\.(\w+)' for class name matches, or '^#(\w+)' for id matches, etc. And there's the question about whether simple CSS selectors should be translated to XPath at all (especially when they aren't precompiled). For people that are familiar with CSS selectors, it seems entirely possible that they will use it for very simple queries, like el.css('div'). If I detect that case and turn it into el.findall('div') then it would be completely reasonable; but if it gets tokenized, parsed, translated to XPath, compiled, then run, then that's going to be pretty inefficient. Anyway, back to naming -- if there's a method and a function/object to compile expressions, that's all the public interface I think it needs. I don't think translating css to xpath without compiling is particularly important.
I wasn't aware of iselement(). I'm not actually sure this is even necessary; I'm not sure if I can ever match non-elements with the expressions at all. I think I put it in there at some point when I wasn't sure. Instead it should probably be an assertion in the tests.
Yes, you can do selections on anything. CSS it seems uses | for namespaces, like "atom|title", and it doesn't know anything special about HTML (except for special handling of the class attribute). Right now I'm assuming the XPath picks up the prefixes from elsewhere in the document. CSS uses "@namespace prefix URI", but that's part of a CSS document, and we're only handling selectors. So I just translate "atom|title" to "//atom:title", and assume it'll work. The CSS syntax does seem handier for a lot of kinds of selections, and after translating them I find the equivalent XPath rather complex in some cases (e.g., li:first-child). So there's some benefit there. -- Ian Bicking | ianb@colorstudy.com | http://blog.ianbicking.org | Write code, do good | http://topp.openplans.org/careers

Hi Ian, just to comment on your actual first post in this thread, which I kinda oversaw because of the later discussion. I think this is pretty cool stuff and I love to have this in lxml. The html module really seems to be getting somewhere. I think we shouldn't even wait too long with a release so that we get some more feedback on the new APIs. Maybe I should fix lxml's versioning so that we can put out a 2.0alpha1 (and not only alpha, beta, final). Ian Bicking wrote:
"css" is not the namespace, it's the prefix. You can do this: ns = etree.FunctionNamespace("http://my/css/namespace") ns.prefix = "css" ns['lower-case'] = _make_lower_case or this: ns = etree.FunctionNamespace("http://my/css/namespace") ns['lower-case'] = _make_lower_case def css_to_xpath(css): xpath = build_xpath(css) return etree.XPath(xpath, {'css' : "http://my/css/namespace"}) You should consider providing a default namespace map here, and maybe even return compiled XPath objects, i.e. callables. Note that these provide a "path" attribute that returns the original path, so if you have to extend an expression later on, you can still do so by creating a new XPath object. Note that this would also allow you to wrap the function with an additional call to set(), so that or-ed results really become the union and not the sum of all parts.
But XPath gives so few errors that it's hard to tell if it's really working.
Sadly, there doesn't seem to be a simple way to find out that a function was undeclared. Or maybe I'll just have to look back into that... didn't I do that already? :)
If I understand this correctly, this would be nth-of-type: //*/NAME[position() = x] nth-child: //*/*[position() = x] To deal with things like "2n", try this: //*/NAME[(position() mod 2) = 0]
I've committed the incomplete code in lxml.html.css
I skipped through it a bit and found it really cool. I'm not completely satisfied with the naming, but I now see that the context of the css module makes it clearer what the semantics are. Still, I prefer "css_to_xpath()", and providing a top-level class XPath() makes me think it should return an etree.XPath object, i.e. a compiled path. One more note: def run_xpath(doc, xpath): return [el for el in doc.xpath(xpath) if isinstance(el, etree.ElementBase)] Do you mean "etree.iselement(el)" here or are you intentionally restricting this to real-element subclasses of _Element? (i.e. no plain lxml.etree elements, no PIs, no comments) I actually think this module merits its own top-level placing, not necessarily only as part of lxml.html. It could just as well become "lxml.css", and should thus not rely too much on a specific API from lxml.html. Stefan

Stefan Behnel wrote:
Yeah, I was thinking about writing up a summary of things that need to be done in the html package; there's still some outstanding stuff, but not too much. The clean module needs to be cleaned up (I'm thinking of moving from a function to a class). I'd like to make the usedoctest hack a little more general, as elsewhere I'm now using a similar hack to enable ELLIPSIS, and I'd like them not to conflict. And then some docs, but I guess that's it.
OK, I've switched to this.
Is there any advantage to this, over a more global prefix? I suppose there's a possible collision of css:, but I doubt that will be a problem.
That's handy. I was thinking of creating a CSSXPath subclass or something, that would keep the original CSS selector around, in addition the translated XPath.
If you use | in the XPath expression it seems to work out that there won't be any duplicates.
We talked about it previously when I was trying to use match(), and instead of errors got bizarre results. But I don't think it resulted in any improvements on error messages.
I think I already have all this working now... though I wish there was a test case I could use, as I'm not 100% sure that my tests are testing for the correct results.
I was thinking about changing around all the public naming. I'd like for it to be a method on elements, though I'm not sure what to call the method. .css(expr) is a bit funny, as it's not "css", it's just a css selector. .select(expr) doesn't say what kind of selector you are using. Another public function would be like XPath, something that compiles the entire CSS expression. Especially since the CSS parsing is non-trivial (just like the XPath parsing is non-trivial), precompiling will be beneficial. I'm thinking of also adding a fast path for a couple common kinds of selectors, that translate them more quickly into XPath. E.g., search for r'^\.(\w+)' for class name matches, or '^#(\w+)' for id matches, etc. And there's the question about whether simple CSS selectors should be translated to XPath at all (especially when they aren't precompiled). For people that are familiar with CSS selectors, it seems entirely possible that they will use it for very simple queries, like el.css('div'). If I detect that case and turn it into el.findall('div') then it would be completely reasonable; but if it gets tokenized, parsed, translated to XPath, compiled, then run, then that's going to be pretty inefficient. Anyway, back to naming -- if there's a method and a function/object to compile expressions, that's all the public interface I think it needs. I don't think translating css to xpath without compiling is particularly important.
I wasn't aware of iselement(). I'm not actually sure this is even necessary; I'm not sure if I can ever match non-elements with the expressions at all. I think I put it in there at some point when I wasn't sure. Instead it should probably be an assertion in the tests.
Yes, you can do selections on anything. CSS it seems uses | for namespaces, like "atom|title", and it doesn't know anything special about HTML (except for special handling of the class attribute). Right now I'm assuming the XPath picks up the prefixes from elsewhere in the document. CSS uses "@namespace prefix URI", but that's part of a CSS document, and we're only handling selectors. So I just translate "atom|title" to "//atom:title", and assume it'll work. The CSS syntax does seem handier for a lot of kinds of selections, and after translating them I find the equivalent XPath rather complex in some cases (e.g., li:first-child). So there's some benefit there. -- Ian Bicking | ianb@colorstudy.com | http://blog.ianbicking.org | Write code, do good | http://topp.openplans.org/careers
participants (2)
-
Ian Bicking
-
Stefan Behnel