Mailman 3 [lxml-dev] Bug in XPath evaluation - lxml - The Python XML Toolkit

[lxml-dev] Bug in XPath evaluation

Torsten Rehn

22 Apr 2007 22 Apr '07

3:05 p.m.

Hi list, here's what I have: poc.xml: <?xml version="1.0" encoding="utf-8" ?> <myrootnode> <myns:mynode xmlns:myns="http.//www.example.com/myns"> <myns:mysubnode>some text</myns:mysubnode> </myns:mynode> </myrootnode> poc.py: #!/usr/bin/env python from lxml import etree DocTree = etree.parse("poc.xml") QueryResult = DocTree.xpath("//myns:mynode") The result (with added version info): [gentop][scel@/home/scel/workspace/lxmlbug] > ./poc.py lxml.etree: (1, 2, 1, 0) libxml used: (2, 6, 27) libxml compiled: (2, 6, 27) libxslt used: (1, 1, 17) libxslt compiled: (1, 1, 17) Traceback (most recent call last): File "./poc.py", line 9, in ? QueryResult = DocTree.xpath("//myns:mynode") File "etree.pyx", line 1256, in etree._ElementTree.xpath File "xpath.pxi", line 75, in etree._XPathEvaluatorBase.evaluate File "xpath.pxi", line 212, in etree.XPathDocumentEvaluator.__call__ File "xpath.pxi", line 105, in etree._XPathEvaluatorBase._handle_result File "xpath.pxi", line 93, in etree._XPathEvaluatorBase._raise_parse_error etree.XPathSyntaxError: error in xpath expression The expression however, is valid (or I'm just insanely stupid). I tested the same query on the same data using http://dmag.upf.edu/contorsion/query.jsp and it worked just as it should. Strangely, //*[name()='myns:mynode'] works with lxml. Regards, Torsten -- Torsten Rehn <scel@users.sourceforge.net>

Attachments:

signature.asc (application/pgp-signature — 827 bytes)

Show replies by thread

Stefan Behnel

23 Apr 23 Apr

1:24 a.m.

New subject: [lxml-dev] Bug in XPath evaluation - not a bug :)

Hi, Torsten Rehn wrote:

...

poc.xml:

<?xml version="1.0" encoding="utf-8" ?> <myrootnode> <myns:mynode xmlns:myns="http.//www.example.com/myns"> <myns:mysubnode>some text</myns:mysubnode> </myns:mynode> </myrootnode>

poc.py:

#!/usr/bin/env python from lxml import etree DocTree = etree.parse("poc.xml") QueryResult = DocTree.xpath("//myns:mynode")

You should pass the namespace-prefix mapping to lxml. See the docs on this topic: http://codespeak.net/lxml/dev/xpathxslt.html#xpath

...

The result (with added version info):

[gentop][scel@/home/scel/workspace/lxmlbug] > ./poc.py lxml.etree: (1, 2, 1, 0) libxml used: (2, 6, 27) libxml compiled: (2, 6, 27) libxslt used: (1, 1, 17) libxslt compiled: (1, 1, 17) Traceback (most recent call last): File "./poc.py", line 9, in ? QueryResult = DocTree.xpath("//myns:mynode") File "etree.pyx", line 1256, in etree._ElementTree.xpath File "xpath.pxi", line 75, in etree._XPathEvaluatorBase.evaluate File "xpath.pxi", line 212, in etree.XPathDocumentEvaluator.__call__ File "xpath.pxi", line 105, in etree._XPathEvaluatorBase._handle_result File "xpath.pxi", line 93, in etree._XPathEvaluatorBase._raise_parse_error etree.XPathSyntaxError: error in xpath expression

As expected. Undefined prefixes are invalid. Stefan

Torsten Rehn

10:54 a.m.

New subject: [lxml-dev] Bug in XPath evaluation - not a bug :)

On Mon, 2007-04-23 at 08:24 +0200, Stefan Behnel wrote:

...

You should pass the namespace-prefix mapping to lxml. See the docs on this topic:

http://codespeak.net/lxml/dev/xpathxslt.html#xpath

Ah, looking at the development version's page obviously helps ;)

...

...
etree.XPathSyntaxError: error in xpath expression As expected. Undefined prefixes are invalid.

But it is valid XPath 1.0, isn't it? I'm just a little confused by the term "XPath Syntax Error". As far as I understand the issue, the problem is not with the syntax but with lxml (or whatever lies beneath) not supporting some of it (which is ok with the W3C recommendation). I'm making that much of a problem out of it because my app processes XML documents that use namespaces quite extensively. And these namespaces may be different for every XML doc that comes along, so I would have to scan the file for xmlns attributes first (and then call the .xpath() method with the second argument as described on the page you posted), which is kind of ugly in my opinion. In my specific scenario it is a lot harder to get the namespace URI than to get the namespace prefix. Is there a good reason I am overlooking or why can I use name() in a predicate to find my node without the URI, but cannot use the better looking abbreviated syntax without an explicit predicate?

Stefan Behnel

11:09 a.m.

New subject: [lxml-dev] Bug in XPath evaluation - not a bug :)

Hi, Torsten Rehn wrote:

...

On Mon, 2007-04-23 at 08:24 +0200, Stefan Behnel wrote:

...
You should pass the namespace-prefix mapping to lxml. See the docs on this topic:

http://codespeak.net/lxml/dev/xpathxslt.html#xpath

Ah, looking at the development version's page obviously helps ;)

Actually it's reading the documentation which helps: http://codespeak.net/lxml/api.html#xpath It's been in there for at least a year.

...

...
...
etree.XPathSyntaxError: error in xpath expression As expected. Undefined prefixes are invalid.

But it is valid XPath 1.0, isn't it? I'm just a little confused by the term "XPath Syntax Error". As far as I understand the issue, the problem is not with the syntax but with lxml (or whatever lies beneath) not supporting some of it (which is ok with the W3C recommendation). I'm making that much of a problem out of it because my app processes XML documents that use namespaces quite extensively. And these namespaces may be different for every XML doc that comes along, so I would have to scan the file for xmlns attributes first (and then call the .xpath() method with the second argument as described on the page you posted),

So you're really ignoring the namespace and just looking at the prefix? That's definitely an unusual use case. What's the use in accepting any namespace in an XPath expression as long as the prefix is the same? I mean, honestly, the prefix doesn't tell you anything, right? Stefan

Martijn Faassen

3:06 p.m.

New subject: [lxml-dev] Bug in XPath evaluation - not a bug :)

Torsten Rehn wrote:

...

On Mon, 2007-04-23 at 08:24 +0200, Stefan Behnel wrote:

...
You should pass the namespace-prefix mapping to lxml. See the docs on this topic:

http://codespeak.net/lxml/dev/xpathxslt.html#xpath

Ah, looking at the development version's page obviously helps ;)

...
...
etree.XPathSyntaxError: error in xpath expression As expected. Undefined prefixes are invalid.

But it is valid XPath 1.0, isn't it? I'm just a little confused by the term "XPath Syntax Error". As far as I understand the issue, the problem is not with the syntax but with lxml (or whatever lies beneath) not supporting some of it (which is ok with the W3C recommendation).

I think it is indeed confusing we call it an XPath Syntax Error. The xpath expression is indeed correct, we just haven't supplied it with enough information. I wonder if there's a way we can detect this specific problem and raise something like an XPathNamespaceError instead? I think this one bites people quite frequently, as people often forget that the prefixes in XPath are not looked up in the document but is independent, just like the prefixes between documents are independent. Regards, Martijn

Martijn Faassen

3:13 p.m.

New subject: [lxml-dev] Bug in XPath evaluation - not a bug :)

Hey, Stefan Behnel wrote: [Torsten Rehn]

...

...
I'm making that much of a problem out of it because my app processes XML documents that use namespaces quite extensively. And these namespaces may be different for every XML doc that comes along, so I would have to scan the file for xmlns attributes first (and then call the .xpath() method with the second argument as described on the page you posted),

Unfortunately any lxml implementation of this behavior would have to do the same internally, so this is not an easy one to implement.

...

So you're really ignoring the namespace and just looking at the prefix? That's definitely an unusual use case.

Agreed, that is indeed odd. Makes me want to find out more. :) You have documents that use namespaces extensively, but they vary widely in the kinds of namespace URIs they use for the same prefixes? How did you arrive in such a situation?

...

What's the use in accepting any namespace in an XPath expression as long as the prefix is the same? I mean, honestly, the prefix doesn't tell you anything, right?

To make sure Torsten understands, ignoring the prefixes and looking at namespace URIs *is* the proper behavior for XML software. The prefixes are nothing but a shortcut, a temporary name, to refer to the namespace URI. This leads to confusion, and is why the ElementTree API in fact includes the whole namespace URI in the element names instead: "{http://mynamespace}foo" ("Clarke notation") ElementTree is rather strict in ignoring the prefixes entirely, which can be a bit frustrating if you are interested in the presentation of the XML document in the end. lxml follows ElementTree but offers various ways to do things with prefix. Unfortunately in xpath the compromise is to use prefixes only to spell out the XPath expression, as using the full qualified names would not be XPath compatible. Occasionally we've had some discussions about offering an API to do XPath queries using Clarke notation. Regards, Martijn

Stefan Behnel

24 Apr 24 Apr

1:20 a.m.

New subject: [lxml-dev] Bug in XPath evaluation - not a bug :)

Hi Martijn, just a quick note here. Martijn Faassen wrote:

...

full qualified names would not be XPath compatible. Occasionally we've had some discussions about offering an API to do XPath queries using Clarke notation.

...

...
...
from lxml import etree root = etree.Element("{testns}root") etree.SubElement(root, "{testns}test") <Element {testns}test at b7da3464>

...

...
...
find = ETXPath("{testns}test") find(root) [<Element {testns}test at b7da3464>]

I guess that's actually still missing from the docs - it's been in there for a while... Stefan

Martijn Faassen

7:50 a.m.

New subject: [lxml-dev] Bug in XPath evaluation - not a bug :)

Hey, On 4/24/07, Stefan Behnel <stefan_ml@behnel.de> wrote:

...

just a quick note here.

Martijn Faassen wrote:

...
full qualified names would not be XPath compatible. Occasionally we've had some discussions about offering an API to do XPath queries using Clarke notation.

...
...
...
from lxml import etree root = etree.Element("{testns}root") etree.SubElement(root, "{testns}test") <Element {testns}test at b7da3464>

...
...
...
find = ETXPath("{testns}test") find(root) [<Element {testns}test at b7da3464>]

I guess that's actually still missing from the docs - it's been in there for a while...

Yeah. I remember discussions on this, but I didn't remember it getting implemented. Cool! The docs still need tender loving care from a dedicated volunteer, and that shouldn't be you. Nobody can give the excuse that they don't know Pyrex here either, so we should have masses of volunteers standing up to contribute. :) Regards, Martijn

Torsten Rehn

10:19 a.m.

New subject: [lxml-dev] Bug in XPath evaluation - not a bug :)

...

...
So you're really ignoring the namespace and just looking at the prefix? That's definitely an unusual use case.

Agreed, that is indeed odd. Makes me want to find out more. :) You have documents that use namespaces extensively, but they vary widely in the kinds of namespace URIs they use for the same prefixes? How did you arrive in such a situation? I think we got a slight misunderstanding here. In my situation, each

On Mon, 2007-04-23 at 22:13 +0200, Martijn Faassen wrote: prefix belongs to exactly one namespace. Here's an example of what I'd like to do: Let's say there is a store that has both a print catalogue and an online shop. For whatever reason (this is a very stupid example) we want some of the items being sold to appear in the print catalogue and some others in the eshop. Here is the XML data that describes the items we sell: <itemlist> <item> <name>TurboItem</name> <price>23</price> </item> <item> <name>SuperItem</name> <price>42</price> </item> </itemlist> Now I want some way to "tag" each item either for print or eshop. But (and here's the twist: without altering the structure of the XML data. That means that I can't add an attribute to each <item> element or "encapsulate" the items like this: <thisgoestoprint> <item>...</item> </thisgoestoprint> <thisgoestoeshop> <item>...</item> </thisgoestoeshop> However, adding namespace prefixes (and their xmlns definitions) is acceptable. If it had worked the way I intended it to in the beginning, the XPath expression "//print:item" would have returned all items that go into the print catalogue. Now why do I want to avoid using the namespace URIs in the expression? In what I'm actually up to, there are a lot more options than just print and eshop. It shall be easy for users to handle a larger amount of these "options" and requiring users to write out namespace-uris just isn't convenient. Prefixes, however, are. The only solution I see right now is to scan the XML data prior to the XPath query in order to map each prefix to its namespace-uri. I do understand now that this is such an exotic use case that it wouldn't make much sense to have lxml do these mappings automatically if the second argument of .xpath() is omitted. The reason I gave this rather lengthy example was to find out if anyone reading this has an idea of an alternative solution for my problem (applying metadata to specific parts of an XML document without making the XPath expressions to address these parts too complex). Regards, Torsten

jholg＠gmx.de

11:18 a.m.

New subject: [lxml-dev] Bug in XPath evaluation - not a bug :)

Hi,

...

The only solution I see right now is to scan the XML data prior to the XPath query in order to map each prefix to its namespace-uri. I do understand now that this is such an exotic use case that it wouldn't make much sense to have lxml do these mappings automatically if the second argument of .xpath() is omitted. The reason I gave this rather lengthy example was to find out if anyone reading this has an idea of an alternative solution for my problem (applying metadata to specific parts of an XML document without making the XPath expressions to address these parts too complex).

Might be you can take advantage of nsmap (don't get confused by the result output, I'm using the lxml.objectify notion)?

...

...
...
root = etree.fromstring(""" ... <root xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" ... xmlns:py="http://codespeak.net/lxml/objectify/pytype" ... xmlns:other="otherURI" ... xmlns="myURI" ... version="v2.0"> ... <a attr1="foo" attr2="bar">1</a> ... <a py:pytype="float">1.2</a> ... <a py:pytype="str">1.2</a> ... 1 ... 2 ... 2 ... <c>what</c> ... <c>is</c> ... <c>this</c> ... <c>good</c> ... <c>for?</c> ... <d/> ... <e>2006/08/09 13:19:01.000000+02:00</e> ... <other:e>from another namespace</other:e> ... <sub1> ... <sub2> ... <sub3> ... <other:x>387.38</other:x> ... </sub3> ... </sub2> ... </sub1> ... <sub1> ... <sub2> ... <sub3> ... <other:x>387.38</other:x> ... </sub3> ... </sub2> ... </sub1> ... <sub1> ... <sub2> ... <sub3> ... <other:x>387.38</other:x> ... </sub3> ... </sub2> ... </sub1> ... </root> ... """) prefixDict = dict(root.nsmap) del prefixDict[None] prefixDict[''] = root.nsmap[None] print etree.XPath('//other:x', prefixDict)(root) [Decimal("387.38"), Decimal("387.38"), Decimal("387.38")]

What's not so nice is that nsmap uses None for the empty prefix whereas XPath seems to expect an empty string in the prefix-URI-dict. Plus I'm not sure if you can simply use the root element nsmap, as I did here. Holger -- "Feel free" - 10 GB Mailbox, 100 FreeSMS/Monat ... Jetzt GMX TopMail testen: http://www.gmx.net/de/go/topmail

Torsten Rehn

11:40 a.m.

New subject: [lxml-dev] Bug in XPath evaluation - not a bug :)

I'll look into that, but it seems as if it were just what I've been looking for. Thank you :) Torsten

6408

Age (days ago)

6410

Last active (days ago)

List overview

Download

10 comments

4 participants

participants (4)

jholg＠gmx.de
Martijn Faassen
Stefan Behnel
Torsten Rehn

[lxml-dev] Bug in XPath evaluation

Torsten Rehn

Stefan Behnel

Torsten Rehn

Stefan Behnel

Martijn Faassen

Martijn Faassen

Stefan Behnel

Martijn Faassen

Torsten Rehn

jholg＠gmx.de

Torsten Rehn

tags

participants (4)