Baffled by documentation of namespace in XPath

I have always found it difficult to wrap my head around the details of namespaces in XML processing, but I am completely baffled by the discussion and examples of namespaces in the XPath section. Consider the following minimal TEI document: <?xml version="1.0" encoding="UTF-8"?> <?xml-model href="http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_bare.r ng" schematypens="http://relaxng.org/ns/structure/1.0"?> <TEI xmlns="http://www.tei-c.org/ns/1.0"> <teiHeader> <fileDesc> <titleStmt> <title>A minimal TEI document</title> </titleStmt> <publicationStmt> <p>unpublished</p> </publicationStmt> <sourceDesc> <p>born digital</p> </sourceDesc> </fileDesc> </teiHeader> <text> <front> <div><head>Preface</head> <p>This is the preface</p></div> </front> <body> <div><head>Chapter 1</head> <div><head>Subsection 1.1</head><p>The text of 1.1</p></div> <div><head>Subsection 1.2</head><p>The text of 1.2</p></div> </div> </body> </text> </TEI> If I process this with xquery, I need to have a namespace declaration of the type declare namespace tei = "http://www.tei-c.org/ns/1.0"; The miniscript xquery version "1.0"; declare namespace tei = "http://www.tei-c.org/ns/1.0"; let $text := doc('/users/martin/dropbox/learnpython.txt/bareguineapig.xml') return $text/tei:TEI//tei:front//tei:p will return <?xml version="1.0" encoding="UTF-8"?> <p xmlns="http://www.tei-c.org/ns/1.0">This is the preface</p> What is the equivalent for this normal case of an XML document? The documentation gives the following XML fragment <a:foo xmlns:a="http://codespeak.net/ns/test1"... xmlns:b="http://codespeak.net/ns/test2"> <b:bar>Text</b:bar> </a:foo> I've never seen XML documents that use namespace prefixes in the closing tags, and I can't figure out how to apply the information from the documentation to my case. I know how to process the document without a namespace. For instance f = '/users/martin/dropbox/learnpython.txt/bareguineapignonamespace.xml' tree = etree.parse(f) r = tree.xpath('/TEI/text/front') for element in r[0].iter('p'): print element.text will print out "This is the preface." But what do I need to do to get the same result when the TEI root element contains the tei namespace? I also wonder whether there is an error in this code: r = doc.xpath('/t:foo/b:bar', namespaces={'t': 'http://codespeak.net/ns/test1', 'b': 'http://codespeak.net/ns/test2'} Shouldn't the 't' be 'a'? It doesn't seem to affect the way the code works, though . Martin Mueller Professor of English and Classics Northwestern University

Le 29/11/2012 18:30, Martin Mueller a écrit :
Your source XML contains xmlns="http://www.tei-c.org/ns/1.0", which is a *default namespace declaration*. It means that every element name in scope will implicitly have the namespace URI http://www.tei-c.org/ns/1.0 You can see this with:
print r.tag {http://www.tei-c.org/ns/1.0}TEI
lxml uses the {ns-URI}local-name here because the prefix is not relevant. Only the URI is. The other syntax, as in <a:foo xmlns:a="http://codespeak.net/ns/test1"> is a *namespace prefix declaration*. It associates a prefix with a namespace URI. In lxml would have:
element.tag {http://codespeak.net/ns/test1}foo
No, this code example is correct. What matters is that the t prefix in the XPath expression matches the t key in the namespaces dict passed to xpath(). These prefixes do *not* have to be the same as in the source document. Also, XPath 1.0 does not have default namespaces, so you have to pick a prefix for every ns URI that you might need in your expression, even if that URI happened to be the default in the source document. Cheers, -- Simon Sapin

Simon Sapin, 29.11.2012 18:57:
Yep, the key thing here is that the prefixes in the document do not matter at all. Just make up your own ones. Alternatively, there are the find*() methods on Elements and ElementTrees that provide a simple XPath subset but use the fully qualified "{namespace}localname" syntax. If you want to avoid the indirection of prefixes, use that. It's sufficient for the examples you presented above. It's also faster in many cases because it can make stronger assumptions about lxml's internal tree configuration which the generic XPath implementation cannot. http://lxml.de/tutorial.html#elementpath Stefan

Thank you for the prompt responses, but they don't quite answer my question. I understand the find*() methods with their James Clark notation, but unless I underestimate the limits of that routine it doesn't do what I want to do.I want to loop through 2,000 or perhaps 40,000 documents extracting data from a teiHeader. A TEI document has this basic structure: <TEI> <teiHeader>{header elements}</teiHeader> <text>{text elements}</text> </TEI> I want to look for <idno>, which I know to occur only in the teiHeader>. The teiHeader is always quite short, the <text>element may have between 1000 and a million <w> elements. So the code fragment r = tree.xpath('TEI/teiHeader') would pick the short header element that I can then loop through with the code for element in r[0].iter('idno'): {do something} If, on the other hand, I use the simple method tei = '{http://www.tei-c.org/ns/1.0}' text = etree.parse(someTEIdocumentwithnamespace) for element in text.iter(tei+'idno'): {do something} the program must grind through the entire text, which takes a long time. Given a TEI text with the xmlns attribute removed, the following code works: parsedTEI = etree.parse(some TEI text without a namespace attribute) teiHeaderfragment = tree.xpath('/TEI/teiHeader') for element in teiHeaderfragment[0].iter('idno'): {do something} So my question is: What do I need to add to this code, if the root element isn't <TEI> but contains the more usual notation <TEI xmlns = "http://www.tei-c.org/ns/1.0">? Is there a simple answer to that question? I couldn't find in the current documentation. The more general version of this question is: How do I restrict my search in a first step to a particular Xpath of a document? From some ElementTree examples in the NLTK book, I gather you can do this nested "for clauses." But that seems rather inelegant. What you need (or for what I need) is code that lets me target a particular Xpath in an XML document with a namespace as a first step in the operation. Is that something that a simple soul like myself can be taught or am I better of sticking with xquery, which is hard enough but where I sort of know how to deal with questions of this kind? Martin Mueller Professor of English and Classics Northwestern University On 11/29/12 3:07 PM, "Stefan Behnel" <stefan_ml@behnel.de> wrote:

On 29.11.2012, at 23:39, Martin Mueller <martinmueller@northwestern.edu> wrote:
the program must grind through the entire text, which takes a long time.
For XPath, you must assign the namespace to a prefix analog to the xmlns:-Attribute in XML. sample xml:
The prefix is just a label for the namespace used in front of element names. And it is independent of the Prefixes (or default namespaces) used in the XML document, so that the XPath expression is not dependent on the xmlns-declarations in the XML document.
xml.xpath('x:teiHeader//x:idno',namespaces={'x':'my_namespace'}) [<Element {my_namespace}idno at 0x10258a4b0>]
With another prefix, same result:
x.xpath('other:teiHeader//other:idno',namespaces={'other':'my_namespace'}) [<Element {my_namespace}idno at 0x10258a4b0>]
if there is always exactly one teiHeader, this works too:
hth, jens

Le 29/11/2012 18:30, Martin Mueller a écrit :
Your source XML contains xmlns="http://www.tei-c.org/ns/1.0", which is a *default namespace declaration*. It means that every element name in scope will implicitly have the namespace URI http://www.tei-c.org/ns/1.0 You can see this with:
print r.tag {http://www.tei-c.org/ns/1.0}TEI
lxml uses the {ns-URI}local-name here because the prefix is not relevant. Only the URI is. The other syntax, as in <a:foo xmlns:a="http://codespeak.net/ns/test1"> is a *namespace prefix declaration*. It associates a prefix with a namespace URI. In lxml would have:
element.tag {http://codespeak.net/ns/test1}foo
No, this code example is correct. What matters is that the t prefix in the XPath expression matches the t key in the namespaces dict passed to xpath(). These prefixes do *not* have to be the same as in the source document. Also, XPath 1.0 does not have default namespaces, so you have to pick a prefix for every ns URI that you might need in your expression, even if that URI happened to be the default in the source document. Cheers, -- Simon Sapin

Simon Sapin, 29.11.2012 18:57:
Yep, the key thing here is that the prefixes in the document do not matter at all. Just make up your own ones. Alternatively, there are the find*() methods on Elements and ElementTrees that provide a simple XPath subset but use the fully qualified "{namespace}localname" syntax. If you want to avoid the indirection of prefixes, use that. It's sufficient for the examples you presented above. It's also faster in many cases because it can make stronger assumptions about lxml's internal tree configuration which the generic XPath implementation cannot. http://lxml.de/tutorial.html#elementpath Stefan

Thank you for the prompt responses, but they don't quite answer my question. I understand the find*() methods with their James Clark notation, but unless I underestimate the limits of that routine it doesn't do what I want to do.I want to loop through 2,000 or perhaps 40,000 documents extracting data from a teiHeader. A TEI document has this basic structure: <TEI> <teiHeader>{header elements}</teiHeader> <text>{text elements}</text> </TEI> I want to look for <idno>, which I know to occur only in the teiHeader>. The teiHeader is always quite short, the <text>element may have between 1000 and a million <w> elements. So the code fragment r = tree.xpath('TEI/teiHeader') would pick the short header element that I can then loop through with the code for element in r[0].iter('idno'): {do something} If, on the other hand, I use the simple method tei = '{http://www.tei-c.org/ns/1.0}' text = etree.parse(someTEIdocumentwithnamespace) for element in text.iter(tei+'idno'): {do something} the program must grind through the entire text, which takes a long time. Given a TEI text with the xmlns attribute removed, the following code works: parsedTEI = etree.parse(some TEI text without a namespace attribute) teiHeaderfragment = tree.xpath('/TEI/teiHeader') for element in teiHeaderfragment[0].iter('idno'): {do something} So my question is: What do I need to add to this code, if the root element isn't <TEI> but contains the more usual notation <TEI xmlns = "http://www.tei-c.org/ns/1.0">? Is there a simple answer to that question? I couldn't find in the current documentation. The more general version of this question is: How do I restrict my search in a first step to a particular Xpath of a document? From some ElementTree examples in the NLTK book, I gather you can do this nested "for clauses." But that seems rather inelegant. What you need (or for what I need) is code that lets me target a particular Xpath in an XML document with a namespace as a first step in the operation. Is that something that a simple soul like myself can be taught or am I better of sticking with xquery, which is hard enough but where I sort of know how to deal with questions of this kind? Martin Mueller Professor of English and Classics Northwestern University On 11/29/12 3:07 PM, "Stefan Behnel" <stefan_ml@behnel.de> wrote:

On 29.11.2012, at 23:39, Martin Mueller <martinmueller@northwestern.edu> wrote:
the program must grind through the entire text, which takes a long time.
For XPath, you must assign the namespace to a prefix analog to the xmlns:-Attribute in XML. sample xml:
The prefix is just a label for the namespace used in front of element names. And it is independent of the Prefixes (or default namespaces) used in the XML document, so that the XPath expression is not dependent on the xmlns-declarations in the XML document.
xml.xpath('x:teiHeader//x:idno',namespaces={'x':'my_namespace'}) [<Element {my_namespace}idno at 0x10258a4b0>]
With another prefix, same result:
x.xpath('other:teiHeader//other:idno',namespaces={'other':'my_namespace'}) [<Element {my_namespace}idno at 0x10258a4b0>]
if there is always exactly one teiHeader, this works too:
hth, jens
participants (4)
-
jens quade
-
Martin Mueller
-
Simon Sapin
-
Stefan Behnel