Mailman 3 [lxml-dev] Some XPath questions... - lxml - The Python XML Toolkit

June 29, 2007

      I'm trying to implement CSS selectors, by translating them into XPath. 
There's some CSS expressions that I'm having a hard time with, so maybe 
someone can tell me how they might work.

Expression:

div:first-child -- means a div element when it is the first child of its 
parent.  I.e.:

   <li>
     <div id="a">...</div>
     <div id="b">...</div>
   </li>

It makes the first div and not the second.

I thought this could be:

   descendant-or-self::*/div[0]
   or... descendant-or-self::*/div[position() = 0]

Those two should be equivalent; the second is a bit easier to handle 
programmatically.  But it doesn't work (doesn't match anything).

Another expreesion:

div.foo + div -- means a div element that is the immediately next 
sibling of a div element with the class .foo.  I would translate this to:

   descendant-or-self::div[@class='foo']/following-sibling::div[0]

(The class matching is actually a bit more complex, but it doesn't 
actually matter to this.)  I'm (a) not sure if this is right, because 
maybe it means the next div after the matching div, even if there's 
another element in-between, and (b) it doesn't return any results 
regardless.

Another expression:

div:contains('celia') -- means a div where the textual content has the 
word 'celia' in it, case insensitive.  At least, I think it's case 
insensitive -- the CSS spec is annoyingly vague, but implementations 
seem to work like this.  I translate this to:

   descendant-or-self::div[contains(css:lower-case(string(.)), 'celia']

I added the lower-case function like:

   def _make_lower_case(context, s):
       return s.lower()
   etree.FunctionNamespace("css")['lower-case'] = _make_lower_case

But XPath gives so few errors that it's hard to tell if it's really 
working.  The XPath expression returns some elements, but not the 
correct number from what I can tell.  Especially since when I had a bug 
and wasn't lowercasing the second argument (using 'CELIA') it still 
returned elements.

There's some other tricky ones I'm not sure about either, though they 
seem to be kind of working.  Things like div:only-child (when it's a div 
with no siblings), div:last-child (no next sibling), div:first-child (no 
previous sibling), div:first-of-type (no preceding siblings that are 
divs), div:last-of-type (no following siblings that are divs), 
div:only-of-type (you are probably getting the pattern), div:empty (no 
children, including text, maybe not including whitespace).  There's also 
div:nth-child(matcher) and div:nth-of-type(matcher), which selects among 
siblings with patterns like "2" (second sibling), "3n" (every third 
element), "odd" (odd elements) and some other selections.  I kind of see 
how to deal with this using position(), but I'm not sure how to do 
either nth-of-type or nth-child (and the ones I do understand I am also 
vague about).

I've committed the incomplete code in lxml.html.css

-- 
Ian Bicking | ianb@colorstudy.com | http://blog.ianbicking.org
             | Write code, do good | http://topp.openplans.org/careers

[lxml-dev] Some XPath questions...

Ian Bicking

Stefan Behnel

Ian Bicking

Stefan Behnel

Ian Bicking

tags

participants (2)