xmlns:schemaLocation is not a valid URI
data:image/s3,"s3://crabby-images/68ae1/68ae1f5d88b08fe071effbe4f0f475302f816f2e" alt=""
Hi, I'm trying to make this work: import urlparse import urllib2 from StringIO import StringIO from lxml import etree url = 'https://www.geoportal.lt/inspire-geoportal/csw' data = '''\ <?xml version="1.0" encoding="ISO-8859-1" standalone="no"?> <csw:GetRecords xmlns:csw="http://www.opengis.net/cat/csw/2.0.2" xmlns:ogc="http://www.opengis.net/ogc" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" outputSchema="http://www.isotc211.org/2005/gmd" outputFormat="application/xml" version="2.0.2" service="CSW" resultType="results" maxRecords="10" xsi:schemaLocation="http://www.opengis.net/cat/csw/2.0.2 http://schemas.opengis.net/csw/2.0.2/CSW-discovery.xsd"> <csw:Query typeNames="csw:Record"> <csw:ElementSetName>brief</csw:ElementSetName> <ogc:SortBy> <ogc:SortProperty> <ogc:PropertyName>dc:identifier</ogc:PropertyName> <ogc:SortOrder>ASC</ogc:SortOrder> </ogc:SortProperty> </ogc:SortBy> </csw:Query> </csw:GetRecords>''' u = urlparse.urlsplit(url) r = urllib2.Request(url, data) r.add_header('User-Agent', 'OWSLib (https://geopython.github.io/OWSLib)') r.add_header('Content-type', 'text/xml') r.add_header('Content-length', '%d' % len(data)) r.add_header('Accept', 'text/xml') r.add_header('Accept-Language', 'en-US') r.add_header('Accept-Encoding', 'gzip,deflate') r.add_header('Host', u.netloc) resp = urllib2.urlopen(r) content = resp.read() etree.parse(StringIO(content)) But I get following error: Traceback (most recent call last): /home/sirex/other/opendata.gov.lt/tests/giscentrasharvester.py|68| in test_lxml_redirects etree.parse(StringIO(content)) src/lxml/lxml.etree.pyx|3442| in lxml.etree.parse (src/lxml/lxml.etree.c:81716) src/lxml/parser.pxi|1828| in lxml.etree._parseDocument (src/lxml/lxml.etree.c:118859) src/lxml/parser.pxi|1848| in lxml.etree._parseMemoryDocument (src/lxml/lxml.etree.c:119128) src/lxml/parser.pxi|1736| in lxml.etree._parseDoc (src/lxml/lxml.etree.c:117808) src/lxml/parser.pxi|1102| in lxml.etree._BaseParser._parseDoc (src/lxml/lxml.etree.c:112052) src/lxml/parser.pxi|595| in lxml.etree._ParserContext._handleParseResultDoc (src/lxml/lxml.etree.c:105896) src/lxml/parser.pxi|706| in lxml.etree._handleParseResult (src/lxml/lxml.etree.c:107604) src/lxml/parser.pxi|635| in lxml.etree._raiseParseError (src/lxml/lxml.etree.c:106458) XMLSyntaxError: xmlns:schemaLocation: 'http://www.isotc211.org/2005/gmd http://schemas.opengis.net/iso/19139/20060504/gmd/gmd.xsd' is not a valid URI, line 5, column 260 (line 5) Attaching the content file. -- Mantas aka sirex __o /\ _ \<,_ -- http://t.me/sirexo -- /\/ \ ___(_)/_(_)_____________________________/_/ \ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
data:image/s3,"s3://crabby-images/68ae1/68ae1f5d88b08fe071effbe4f0f475302f816f2e" alt=""
After more experimentation I found, that if I add "/" at the end of "http://www.isotc211.org/2005/gmd", then parsing works. "http://www.isotc211.org/2005/gmd" returns HTTP/1.1 301 Moved Permanently, so I guess lxml does not support redirects? On 2017-08-21, Mantas wrote:
-- Mantas aka sirex __o /\ _ \<,_ -- http://t.me/sirexo -- /\/ \ ___(_)/_(_)_____________________________/_/ \ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
data:image/s3,"s3://crabby-images/68ae1/68ae1f5d88b08fe071effbe4f0f475302f816f2e" alt=""
After more digging I found, that the XML is invalid. For example, this will work: from StringIO import StringIO from lxml import etree etree.parse(StringIO('''\ <?xml version="1.0" encoding="UTF-8" standalone="no"?> <root xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:gmd="http://www.isotc211.org/2005/gmd" xsi:schemaLocation="http://www.isotc211.org/2005/gmd http://schemas.opengis.net/iso/19139/20060504/gmd/gmd.xsd"> </root> ''')) But if I change xsi:schemaLocation to xmlns:schemaLocation, then lxml fails: etree.parse(StringIO('''\ <?xml version="1.0" encoding="UTF-8" standalone="no"?> <root xmlns:gmd="http://www.isotc211.org/2005/gmd" xmlns:schemaLocation="http://www.isotc211.org/2005/gmd http://schemas.opengis.net/iso/19139/20060504/gmd/gmd.xsd"> </root> ''')) XMLSyntaxError: xmlns:schemaLocation: 'http://www.isotc211.org/2005/gmd http://schemas.opengis.net/iso/19139/20060504/gmd/gmd.xsd' is not a valid URI, line 3, column 120 (line 3) I'm not XML expert, but as I understand, xmlns:schemaLocation with two URLs separated by space is an invalid XML? In that case, how can I fix that? Maybe lxml parser has an option to ignore namespaces completely? On 2017-08-21, sirex wrote:
-- Mantas aka sirex __o /\ _ \<,_ -- http://t.me/sirexo -- /\/ \ ___(_)/_(_)_____________________________/_/ \ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
data:image/s3,"s3://crabby-images/68ae1/68ae1f5d88b08fe071effbe4f0f475302f816f2e" alt=""
After more experimentation I found, that if I add "/" at the end of "http://www.isotc211.org/2005/gmd", then parsing works. "http://www.isotc211.org/2005/gmd" returns HTTP/1.1 301 Moved Permanently, so I guess lxml does not support redirects? On 2017-08-21, Mantas wrote:
-- Mantas aka sirex __o /\ _ \<,_ -- http://t.me/sirexo -- /\/ \ ___(_)/_(_)_____________________________/_/ \ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
data:image/s3,"s3://crabby-images/68ae1/68ae1f5d88b08fe071effbe4f0f475302f816f2e" alt=""
After more digging I found, that the XML is invalid. For example, this will work: from StringIO import StringIO from lxml import etree etree.parse(StringIO('''\ <?xml version="1.0" encoding="UTF-8" standalone="no"?> <root xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:gmd="http://www.isotc211.org/2005/gmd" xsi:schemaLocation="http://www.isotc211.org/2005/gmd http://schemas.opengis.net/iso/19139/20060504/gmd/gmd.xsd"> </root> ''')) But if I change xsi:schemaLocation to xmlns:schemaLocation, then lxml fails: etree.parse(StringIO('''\ <?xml version="1.0" encoding="UTF-8" standalone="no"?> <root xmlns:gmd="http://www.isotc211.org/2005/gmd" xmlns:schemaLocation="http://www.isotc211.org/2005/gmd http://schemas.opengis.net/iso/19139/20060504/gmd/gmd.xsd"> </root> ''')) XMLSyntaxError: xmlns:schemaLocation: 'http://www.isotc211.org/2005/gmd http://schemas.opengis.net/iso/19139/20060504/gmd/gmd.xsd' is not a valid URI, line 3, column 120 (line 3) I'm not XML expert, but as I understand, xmlns:schemaLocation with two URLs separated by space is an invalid XML? In that case, how can I fix that? Maybe lxml parser has an option to ignore namespaces completely? On 2017-08-21, sirex wrote:
-- Mantas aka sirex __o /\ _ \<,_ -- http://t.me/sirexo -- /\/ \ ___(_)/_(_)_____________________________/_/ \ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
participants (2)
-
Mantas
-
sirex