Hi, I got the following different results. Could anybody let me know how to always extract the entry element whether "xmlns" exists or not? Thanks. $ cat main.py #!/usr/bin/env python # vim: set noexpandtab tabstop=2 shiftwidth=2 softtabstop=-1 fileencoding=utf-8: import sys from lxml import etree tree = etree.parse(sys.stdin) print [etree.tostring(e) for e in tree.iterfind('.//entry')] $ ./main.py <<EOF <feed xmlns="http://www.w3.org/2005/Atom"> <entry> <id>http://arxiv.org/abs/1712.02316v1</id> </entry> </feed> EOF [] $ ./main.py <<EOF <feed> <entry> <id>http://arxiv.org/abs/1712.02316v1</id> </entry> </feed> EOF <feed> ['<entry>\n <id>http://arxiv.org/abs/1712.02316v1</id>\n </entry>\n'] -- Regards, Peng
Hi, Am Tue, 17 Apr 2018 23:31:46 -0500 schrieb Peng Yu <pengyu.ut@gmail.com>:
I got the following different results. Could anybody let me know how to always extract the entry element whether "xmlns" exists or not?
First of all, this is the expected result. What you miss is the so called "namespace". According to Wikipedia, "XML namespaces are used for providing uniquely named elements and attributes in an XML document." As such, the "xmlns" is a namespace declaration (xmlns="XML NameSpace"). You need to search explicitly for an element in a specific namespace. For example, if you have this element: <feed xmlns="http://www.w3.org/2005/Atom"> then the element contains a local name ("feed") which is "bound" to the Atom namespace "http://www.w3.org/2005/Atom". It's usually written in the so called "Clark notation": {http://www.w3.org/2005/Atom}feed On the other side, if you have this element: <feed> then it's bound to no namespace. As such, the Clark notation looks like this: {}feed For XML, these two elements are different: although the local name ("feed) is the same, they belong to different namespaces. If you compare these elements, you have to take the namespace into account. Therefor, they are not the same: {http://www.w3.org/2005/Atom}feed != {}feed Look here for more information: http://lxml.de/tutorial.html#namespaces
$ cat main.py #!/usr/bin/env python # vim: set noexpandtab tabstop=2 shiftwidth=2 softtabstop=-1 fileencoding=utf-8:
import sys
from lxml import etree tree = etree.parse(sys.stdin)
print [etree.tostring(e) for e in tree.iterfind('.//entry')]
$ ./main.py <<EOF <feed xmlns="http://www.w3.org/2005/Atom"> <entry> <id>http://arxiv.org/abs/1712.02316v1</id> </entry> </feed> EOF []
Back to your code. You basically say in your XPath expression: "Give me all entry elements which are bound to no namespace". Of course, this will give you the expected result: zero elements! ;) You have to use a prefix and a namespace. The prefix ist just an "abbreviation" to the longer namespace. If you use this notation, you will get the expected result: [etree.tostring(e) for e in tree.iterfind('.//a:entry', namespaces={'a': 'http://www.w3.org/2005/Atom'})] ['<entry xmlns="http://www.w3.org/2005/Atom">...</entry>\n'] You can use whatever prefix you like. Usually, you use few characters or very short names. Easier to type. ;) Of course, you can extend your XPath expressions to get entry elements in no namespace or the Atom namespace: [etree.tostring(e) for e in tree.iterfind('.//a:entry|.//entry', namespaces={'a': 'http://www.w3.org/2005/Atom'})] Hope that helps. :) -- Gruß/Regards, Thomas Schraitle
participants (2)
-
Peng Yu
-
Thomas Schraitle