[lxml-dev] findall()/xpath() differences ?
Hello all, How compatible are the findall() and xpath() methods ? findall() don't seem to handle more complicated XPath expressions. Why there is a difference between what they can handle ? I would expect findall() to be the same as xpath(), but searching from the context node and always returning a list of items, making it compatible with ET's. An example: Python 2.4.3 (#2, Apr 17 2006, 14:29:19) [GCC 3.4.4 [FreeBSD] 20050518] on freebsd6 Type "help", "copyright", "credits" or "license" for more information.
from lxml import etree root = etree.XML('''<obj class="list"> ... <obj class="str" value="xxx"/> ... </obj>''') print root.xpath('obj[not(@name)]') [<Element obj at 81daa2c>] print root.findall('obj[not(@name)]') Traceback (most recent call last): File "<stdin>", line 1, in ? File "etree.pyx", line 937, in etree._Element.findall File "/usr/local/lib/python2.4/site-packages/lxml-1.0.beta-py2.4-freebsd-6.1-RELEASE-i386.egg/lxml/_elementpath.py", line 193, in findall return _compile(path).findall(element) File "/usr/local/lib/python2.4/site-packages/lxml-1.0.beta-py2.4-freebsd-6.1-RELEASE-i386.egg/lxml/_elementpath.py", line 171, in _compile p = Path(path) File "/usr/local/lib/python2.4/site-packages/lxml-1.0.beta-py2.4-freebsd-6.1-RELEASE-i386.egg/lxml/_elementpath.py", line 87, in __init__ raise SyntaxError( SyntaxError: expected path separator ([)
-- Best regards, Steve mailto:howe@carcass.dhs.org
That's a great chance to start findall vs xpath battle again =) I think lxml should eliminate .xpath method and implement .find* methods through libxml2 xpath support. On Fri, 2006-05-26 at 04:42 -0300, Steve Howe wrote:
Hello all,
How compatible are the findall() and xpath() methods ? findall() don't seem to handle more complicated XPath expressions. Why there is a difference between what they can handle ?
I would expect findall() to be the same as xpath(), but searching from the context node and always returning a list of items, making it compatible with ET's.
An example:
Python 2.4.3 (#2, Apr 17 2006, 14:29:19) [GCC 3.4.4 [FreeBSD] 20050518] on freebsd6 Type "help", "copyright", "credits" or "license" for more information.
from lxml import etree root = etree.XML('''<obj class="list"> ... <obj class="str" value="xxx"/> ... </obj>''') print root.xpath('obj[not(@name)]') [<Element obj at 81daa2c>] print root.findall('obj[not(@name)]') Traceback (most recent call last): File "<stdin>", line 1, in ? File "etree.pyx", line 937, in etree._Element.findall File "/usr/local/lib/python2.4/site-packages/lxml-1.0.beta-py2.4-freebsd-6.1-RELEASE-i386.egg/lxml/_elementpath.py", line 193, in findall return _compile(path).findall(element) File "/usr/local/lib/python2.4/site-packages/lxml-1.0.beta-py2.4-freebsd-6.1-RELEASE-i386.egg/lxml/_elementpath.py", line 171, in _compile p = Path(path) File "/usr/local/lib/python2.4/site-packages/lxml-1.0.beta-py2.4-freebsd-6.1-RELEASE-i386.egg/lxml/_elementpath.py", line 87, in __init__ raise SyntaxError( SyntaxError: expected path separator ([)
Hello Andrey, Friday, May 26, 2006, 4:50:42 AM, you wrote:
That's a great chance to start findall vs xpath battle again =)
I think lxml should eliminate .xpath method and implement .find* methods through libxml2 xpath support. I really do not want to start any battles. There must be a good reason for the differences, and I just would like to know what they are and I think they should be documented...
Anyway even if for any reasons it decides to get rid of the xpath() method, it should remain as alias for findall() to keep compatibility with older code. -- Best regards, Steve mailto:howe@carcass.dhs.org
On Fri, 2006-05-26 at 04:56 -0300, Steve Howe wrote:
Hello Andrey,
Friday, May 26, 2006, 4:50:42 AM, you wrote:
That's a great chance to start findall vs xpath battle again =)
I think lxml should eliminate .xpath method and implement .find* methods through libxml2 xpath support. I really do not want to start any battles. There must be a good reason for the differences, and I just would like to know what they are and I think they should be documented...
Anyway even if for any reasons it decides to get rid of the xpath() method, it should remain as alias for findall() to keep compatibility with older code.
Ok, at the moment .find* methods are served by the same code as in ElementTree, so .find* methods behave exactly like ElementTree's one. .xpath method is lxml's own implementation, somewhere inconsistent with ElementTree's .find. AFAIR: - there is other namespace declaration convention, - full XPath support, - and different behavior on absolute paths (I think that's the place where ElementTree is broken).
Hi Steve, Steve Howe wrote:
How compatible are the findall() and xpath() methods ? findall() don't seem to handle more complicated XPath expressions. Why there is a difference between what they can handle ?
I would expect findall() to be the same as xpath(), but searching from the context node and always returning a list of items, making it compatible with ET's.
Well, it /is/ compatible with ET's. That is the main reason why it does not support full XPath expressions. Its expressions follow the documentation from the ElementTree library. What would be the advantage of not being ET compatible here? Is there anything you can do with findall(), find() and findtext() that you couldn't do with xpath() if you wanted to? Note, BTW, that both are similarly fast for similar expressions. If you wanted more speed, you'd go for pre-parsed XPath expressions anyway. IMHO, the only two reasons why these three functions are there are 1) they are ET compatible 2) they are simple We had the discussion pop up a few times if implementing findall() through xpath() would be a good idea. It was generally agreed (and demonstrated in code) that this would too easily break ET compatibility, which was not considered worth it. Stefan
Hello Stefan, Friday, May 26, 2006, 5:00:08 AM, you wrote:
Well, it /is/ compatible with ET's. That is the main reason why it does not support full XPath expressions. Its expressions follow the documentation from the ElementTree library. Perhaps *too* compatible... :)
What would be the advantage of not being ET compatible here? Is there anything you can do with findall(), find() and findtext() that you couldn't do with xpath() if you wanted to? Note, BTW, that both are similarly fast for similar expressions. If you wanted more speed, you'd go for pre-parsed XPath expressions anyway.
IMHO, the only two reasons why these three functions are there are
1) they are ET compatible 2) they are simple
We had the discussion pop up a few times if implementing findall() through xpath() would be a good idea. It was generally agreed (and demonstrated in code) that this would too easily break ET compatibility, which was not considered worth it. Ok, reason is compatibility. Two points:
1) Shouldn't it be clearly documented ? 2) Since xpath() supports a superset of the expressions findall() does, isn't the compatibility ensured ? Or does findall() support anything xpath() does not ? It makes no sense to cripple etree´s findall() in order to to support only what ET's findall() does. -- Best regards, Steve mailto:howe@carcass.dhs.org
Steve Howe wrote:
Friday, May 26, 2006, 5:00:08 AM, you wrote:
IMHO, the only two reasons why these three functions are there are
1) they are ET compatible 2) they are simple
We had the discussion pop up a few times if implementing findall() through xpath() would be a good idea. It was generally agreed (and demonstrated in code) that this would too easily break ET compatibility, which was not considered worth it. Ok, reason is compatibility. Two points:
1) Shouldn't it be clearly documented ?
Well, regarding documentation, lxml has (inofficially) always said: "we let Fredrik write the documentation, and only if we must (or want to) do it different, we document it ourselves." ElementTree's find*() methods are documented, so all we add is "lxml supports full XPath expressions through the xpath() function".
2) Since xpath() supports a superset of the expressions findall() does, isn't the compatibility ensured ?
No, it's not a superset at all. findall() uses '{namespace}tag' notation, which is absolutely invalid in XPath. lxml has an ETXPath class that allows you to do this, but calling that for the general XPath case is just overhead, as we would still be trying to extract namespaces from it instead of passing it straight into libxml2's parser.
It makes no sense to cripple etree´s findall() in order to to support only what ET's findall() does.
It wouldn't make sense if it wasn't for compatibility. Currently, you can exchange code between lxml, ElementTree and cElementTree with relatively little extra consideration. And I mean in all directions. Making more functions incompatible (without convincing reasons) is just calling for trouble. ("he, lxml didn't raise an exception on this expression!!") The reasons for leaving it as is are: 1) it works 2) it is 100% compatible now and trivial to keep compatible 3) it is not trivial to reimplement without breaking compatibility 4) it makes things slower to change it, as it requires parsing the expression twice (once in lxml, once in libxml2) and it's not faster to evaluate it. The reasons to change it are: 1) it supports different expressions than xpath(), which is documented (although perhaps not clearly so) and the reason why there is an xpath() method. Honestly, unless there are good reasons to do it, I'm absolutely +1 for keeping the current state. Stefan
Stefan Behnel wrote: [snip]
Honestly, unless there are good reasons to do it, I'm absolutely +1 for keeping the current state.
+1 for keeping it the way it works too. I followed the same reasoning when I first made it work the way it does. :) We might indeed want to put a small section in our documentation mentioning why we did this. Might even make for a good start to a FAQ. :) Regards, Martijn
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Stefan Behnel wrote:
Honestly, unless there are good reasons to do it, I'm absolutely +1 for keeping the current state.
Agreed, this is a no broainer. If your application needs to be compatible with ET/cET, use the compatibility API. If it needs full XPath support, then use the native XPath API. I can't even see why we are talking about changing such a simple story. Tres. - -- =================================================================== Tres Seaver +1 202-558-7113 tseaver@palladion.com Palladion Software "Excellence by Design" http://palladion.com -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.1 (GNU/Linux) Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org iD8DBQFEdt/2+gerLs4ltQ4RAty4AKCv9lO7gmPWxO/Hmwk1JO/LlUjwTwCePFJf tDQN3pHsqCDrziVmmx5UBTI= =HRhp -----END PGP SIGNATURE-----
Tres Seaver wrote:
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
Stefan Behnel wrote:
Honestly, unless there are good reasons to do it, I'm absolutely +1 for keeping the current state.
Agreed, this is a no broainer. If your application needs to be compatible with ET/cET, use the compatibility API. If it needs full XPath support, then use the native XPath API. I can't even see why we are talking about changing such a simple story.
I think it really counts as a FAQ by now; I've seen this come up on the list for at least 3 times. Regards, Martijn
Tres Seaver wrote:
Honestly, unless there are good reasons to do it, I'm absolutely +1 for keeping the current state.
Agreed, this is a no broainer. If your application needs to be compatible with ET/cET, use the compatibility API. If it needs full XPath support, then use the native XPath API.
and as usual, if someone finds a glaring difference between how findall handles a given xpath pattern and how xpath handles it (clarke notation issues aside), it's probably a bug in ET. </F>
On Fri, 2006-05-26 at 16:39 +0200, Fredrik Lundh wrote:
Tres Seaver wrote:
Honestly, unless there are good reasons to do it, I'm absolutely +1 for keeping the current state.
Agreed, this is a no broainer. If your application needs to be compatible with ET/cET, use the compatibility API. If it needs full XPath support, then use the native XPath API.
and as usual, if someone finds a glaring difference between how findall handles a given xpath pattern and how xpath handles it (clarke notation issues aside), it's probably a bug in ET.
this is a bug in ET: In [1]: import elementtree.ElementTree as et In [2]: a = et.Element('a') In [3]: b = et.Element('b') In [4]: a.append(b) In [5]: tree = et.ElementTree(a) In [6]: tree.find('/a') # should be <Element a ...> In [7]: tree.find('/b') # should be None Out[7]: <Element b at -488ee3f4>
Honestly, unless there are good reasons to do it, I'm absolutely +1 for keeping the current state. Me too, don't worry. I forgot about the way ET handles namespaces and
Hello Stefan, Friday, May 26, 2006, 5:44:52 AM, you wrote: [...] that is incompatible enough so that a compatibility function should be kept. I'll just use xpath() and there is no problem about it. -- Best regards, Steve mailto:howe@carcass.dhs.org
On Fri, 2006-05-26 at 10:00 +0200, Stefan Behnel wrote:
We had the discussion pop up a few times if implementing findall() through xpath() would be a good idea. It was generally agreed (and demonstrated in code) that this would too easily break ET compatibility, which was not considered worth it.
As far as I remember nobody just cared enough, and broken compatibility was only in cases where Frederik was testing incompleteness of his 'semi-xpath' implementation, for example testing that '//' is invalid expression, or there could not be '[..]' selectors after node name. In cases where the useful functionality was tested - there were no failures.
participants (6)
-
Andrey Tatarinov
-
Fredrik Lundh
-
Martijn Faassen
-
Stefan Behnel
-
Steve Howe
-
Tres Seaver