Re: Cannot <include> from a network location
Just a thought: Might this be proxy- or https-related? Does it work if you locally serve the xs:included schema with http?
I *think* libxml2 respects http_proxy but I don’t know anything about https support.
Just found that for an example adapted from https://bugs.launchpad.net/lxml/+bug/1234114/comments/3 (xs:include instead of xs:import): ############## # test_schema.py XSD = b"""<?xml version="1.0" encoding="UTF-8"?> <xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema" xmlns:xlink="http://www.w3.org/1999/xlink" xmlns="lxmltest" targetNamespace="http://www.w3.org/1999/xlink" elementFormDefault="qualified" attributeFormDefault="unqualified"> <xs:include schemaLocation="http://www.loc.gov/standards/xlink/xlink.xsd" /> </xs:schema> """ from lxml import etree parser = etree.XMLParser( load_dtd=True, no_network=True, huge_tree=True, resolve_entities=True) tree = etree.fromstring(XSD, parser=parser) schema = etree.XMLSchema(tree) print(schema) ############## I can successfully run this if the xs:include location is http but not https (the xlink.xsd is available both with https and http URLs). If I change it to https I get Traceback (most recent call last): File "test_schema.py", line 16, in <module> schema = etree.XMLSchema(tree) File "src/lxml/xmlschema.pxi", line 88, in lxml.etree.XMLSchema.__init__ lxml.etree.XMLSchemaParseError: Element '{http://www.w3.org/2001/XMLSchema}include': Failed to load the document 'https://www.loc.gov/standards/xlink/xlink.xsd' for inclusion., line 9 As I'm behind a proxy: I also just found out that while curl happily accepts http_proxy=my.proxy.address.net:8080 lxml (libxml2) only works if this is set to http_proxy=http://my.proxy.address.net:8080 i.e. with an explicit <scheme>:// Cheers, H. Landesbank Baden-Wuerttemberg Anstalt des oeffentlichen Rechts Hauptsitze: Stuttgart, Karlsruhe, Mannheim, Mainz HRA 12704 Amtsgericht Stuttgart HRA 4356, HRA 104 440 Amtsgericht Mannheim HRA 40687 Amtsgericht Mainz Die LBBW verarbeitet gemaess Erfordernissen der DSGVO Ihre personenbezogenen Daten. Informationen finden Sie unter https://www.lbbw.de/datenschutz.
Thanks Holger First, I am not behind any proxy - CURL to http: and https: both give the schema Second, Your test_schema.py using http: also works for me, however is https:/www.loc.gov... is used then I get an error
python test_schema.py Traceback (most recent call last): File "G:\lxml-test\test_schema.py", line 14, in <module> schema = etree.XMLSchema(tree) File "src\lxml\xmlschema.pxi", line 88, in lxml.etree.XMLSchema.__init__ lxml.etree.XMLSchemaParseError: Element '{http://www.w3.org/2001/XMLSchema}include': Failed to load the document 'https://www.loc.gov/standards/xlink/xlink.xsd' for inclusion., line 8
Third, 'https://www.loc.gov/standards/xlink/xlink.xsd' does exist - CURL retrieves it OK
curl --head https://www.loc.gov/standards/xlink/xlink.xsd HTTP/2 200 date: Fri, 25 Jun 2021 13:44:27 GMT content-type: text/xml content-length: 3180 last-modified: Thu, 23 Aug 2007 19:02:01 GMT etag: "119c982-c6c-867afc40" accept-ranges: bytes cf-cache-status: DYNAMIC cf-request-id: 0ae5033c37000054ab1ab16000000001 expect-ct: max-age=604800, report-uri="https://report-uri.cloudflare.com/cdn-cgi/beacon/expect-ct" server: cloudflare cf-ray: 664ea1738c7e54ab-MAN
curl --head http://www.loc.gov/standards/xlink/xlink.xsd HTTP/1.1 200 OK Date: Fri, 25 Jun 2021 13:45:16 GMT Content-Type: text/xml Content-Length: 3180 Connection: keep-alive Last-Modified: Thu, 23 Aug 2007 19:02:01 GMT ETag: "119c982-c6c-43862867afc40" Accept-Ranges: bytes X-Frame-Options: deny Set-Cookie: HttpOnly CF-Cache-Status: DYNAMIC cf-request-id: 0ae503f9cd0000000a56ab5000000001 Server: cloudflare CF-RAY: 664ea2a2ed37000a-MAN
Paul -----Original Message----- From: Holger.Joukl@LBBW.de <Holger.Joukl@LBBW.de> Sent: 25 June 2021 13:31 To: lxml@lxml.de Subject: [lxml] Re: Cannot <include> from a network location
Just a thought: Might this be proxy- or https-related? Does it work if you locally serve the xs:included schema with http?
I *think* libxml2 respects http_proxy but I don’t know anything about https support.
Just found that for an example adapted from https://bugs.launchpad.net/lxml/+bug/1234114/comments/3 (xs:include instead of xs:import): ############## # test_schema.py XSD = b"""<?xml version="1.0" encoding="UTF-8"?> <xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema" xmlns:xlink="http://www.w3.org/1999/xlink" xmlns="lxmltest" targetNamespace="http://www.w3.org/1999/xlink" elementFormDefault="qualified" attributeFormDefault="unqualified"> <xs:include schemaLocation="http://www.loc.gov/standards/xlink/xlink.xsd" /> </xs:schema> """ from lxml import etree parser = etree.XMLParser( load_dtd=True, no_network=True, huge_tree=True, resolve_entities=True) tree = etree.fromstring(XSD, parser=parser) schema = etree.XMLSchema(tree) print(schema) ############## I can successfully run this if the xs:include location is http but not https (the xlink.xsd is available both with https and http URLs). If I change it to https I get Traceback (most recent call last): File "test_schema.py", line 16, in <module> schema = etree.XMLSchema(tree) File "src/lxml/xmlschema.pxi", line 88, in lxml.etree.XMLSchema.__init__ lxml.etree.XMLSchemaParseError: Element '{http://www.w3.org/2001/XMLSchema}include': Failed to load the document 'https://www.loc.gov/standards/xlink/xlink.xsd' for inclusion., line 9 As I'm behind a proxy: I also just found out that while curl happily accepts http_proxy=my.proxy.address.net:8080 lxml (libxml2) only works if this is set to http_proxy=http://my.proxy.address.net:8080 i.e. with an explicit <scheme>:// Cheers, H. Landesbank Baden-Wuerttemberg Anstalt des oeffentlichen Rechts Hauptsitze: Stuttgart, Karlsruhe, Mannheim, Mainz HRA 12704 Amtsgericht Stuttgart HRA 4356, HRA 104 440 Amtsgericht Mannheim HRA 40687 Amtsgericht Mainz Die LBBW verarbeitet gemaess Erfordernissen der DSGVO Ihre personenbezogenen Daten. Informationen finden Sie unter https://www.lbbw.de/datenschutz. _______________________________________________ lxml - The Python XML Toolkit mailing list -- lxml@python.org To unsubscribe send an email to lxml-leave@python.org https://mail.python.org/mailman3/lists/lxml.python.org/ Member address: paul_higgs@hotmail.com
Thanks for the hint regarding parsers. After spending a few hours trying to understand what special tricks I needed to put in a resolver, I realized that there were none. The resolver juts needs to fetch the data (I would have expected the Parser to do this itself) schema4=io.StringIO('''<schema xmlns="http://www.w3.org/2001/XMLSchema" xmlns:patch="urn:paulhiggs:my-patch" targetNamespace="urn:paulhiggs:my-patch" elementFormDefault="qualified" attributeFormDefault="unqualified"> <include schemaLocation="https://www.iana.org/assignments/xml-registry/schema/patch-ops.xsd"/> <element name="Patch" type="patch:PatchType"/> <complexType name="PatchType"> <choice minOccurs="1" maxOccurs="unbounded"> <element name="add" type="patch:add"/> <element name="remove" type="patch:remove"/> <element name="replace" type="patch:replace"/> </choice> <attribute name="paulsAttrib" type="string" use="required"/> </complexType> </schema>''') import requests class PrefixResolver(etree.Resolver): # https://lxml.de/resolvers.html def __init__(self, prefix): self.prefix = prefix.lower() def resolve(self, url, pubid, context): if url.lower().startswith(self.prefix): res=requests.get(url, allow_redirects=True) return self.resolve_string(res.text, context) parser=etree.XMLParser(load_dtd=True, no_network=False, huge_tree=True, resolve_entities=True) parser.resolvers.add( PrefixResolver("https") ) parser.resolvers.add( PrefixResolver("http") ) my_schema=etree.XMLSchema(etree.parse(schema4, parser)) This now works OK for an include of HTTP and HTTPS! I need to look into the workings of libxml2 to see if loading for INCLUDE and IMPORT are somehow handled differently – I have never had a problem with an HTTP or HTTPS IMPORT Paul -----Original Message----- From: Paul Higgs <paul_higgs@hotmail.com> Sent: 25 June 2021 14:51 To: Holger.Joukl@LBBW.de; lxml@lxml.de Subject: [lxml] Re: Cannot <include> from a network location Thanks Holger First, I am not behind any proxy - CURL to http: and https: both give the schema Second, Your test_schema.py using http: also works for me, however is https:/www.loc.gov... is used then I get an error
python test_schema.py Traceback (most recent call last): File "G:\lxml-test\test_schema.py", line 14, in <module> schema = etree.XMLSchema(tree) File "src\lxml\xmlschema.pxi", line 88, in lxml.etree.XMLSchema.__init__ lxml.etree.XMLSchemaParseError: Element '{http://www.w3.org/2001/XMLSchema}include': Failed to load the document 'https://www.loc.gov/standards/xlink/xlink.xsd' for inclusion., line 8
Third, 'https://www.loc.gov/standards/xlink/xlink.xsd' does exist - CURL retrieves it OK
curl --head https://www.loc.gov/standards/xlink/xlink.xsd HTTP/2 200 date: Fri, 25 Jun 2021 13:44:27 GMT content-type: text/xml content-length: 3180 last-modified: Thu, 23 Aug 2007 19:02:01 GMT etag: "119c982-c6c-867afc40" accept-ranges: bytes cf-cache-status: DYNAMIC cf-request-id: 0ae5033c37000054ab1ab16000000001 expect-ct: max-age=604800, report-uri="https://report-uri.cloudflare.com/cdn-cgi/beacon/expect-ct" server: cloudflare cf-ray: 664ea1738c7e54ab-MAN
curl --head http://www.loc.gov/standards/xlink/xlink.xsd HTTP/1.1 200 OK Date: Fri, 25 Jun 2021 13:45:16 GMT Content-Type: text/xml Content-Length: 3180 Connection: keep-alive Last-Modified: Thu, 23 Aug 2007 19:02:01 GMT ETag: "119c982-c6c-43862867afc40" Accept-Ranges: bytes X-Frame-Options: deny Set-Cookie: HttpOnly CF-Cache-Status: DYNAMIC cf-request-id: 0ae503f9cd0000000a56ab5000000001 Server: cloudflare CF-RAY: 664ea2a2ed37000a-MAN
Paul -----Original Message----- From: Holger.Joukl@LBBW.de<mailto:Holger.Joukl@LBBW.de> <Holger.Joukl@LBBW.de<mailto:Holger.Joukl@LBBW.de>> Sent: 25 June 2021 13:31 To: lxml@lxml.de<mailto:lxml@lxml.de> Subject: [lxml] Re: Cannot <include> from a network location
Just a thought: Might this be proxy- or https-related? Does it work if you locally serve the xs:included schema with http?
I *think* libxml2 respects http_proxy but I don’t know anything about https support.
Just found that for an example adapted from https://bugs.launchpad.net/lxml/+bug/1234114/comments/3 (xs:include instead of xs:import): ############## # test_schema.py XSD = b"""<?xml version="1.0" encoding="UTF-8"?> <xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema" xmlns:xlink="http://www.w3.org/1999/xlink" xmlns="lxmltest" targetNamespace="http://www.w3.org/1999/xlink" elementFormDefault="qualified" attributeFormDefault="unqualified"> <xs:include schemaLocation="http://www.loc.gov/standards/xlink/xlink.xsd" /> </xs:schema> """ from lxml import etree parser = etree.XMLParser( load_dtd=True, no_network=True, huge_tree=True, resolve_entities=True) tree = etree.fromstring(XSD, parser=parser) schema = etree.XMLSchema(tree) print(schema) ############## I can successfully run this if the xs:include location is http but not https (the xlink.xsd is available both with https and http URLs). If I change it to https I get Traceback (most recent call last): File "test_schema.py", line 16, in <module> schema = etree.XMLSchema(tree) File "src/lxml/xmlschema.pxi", line 88, in lxml.etree.XMLSchema.__init__ lxml.etree.XMLSchemaParseError: Element '{http://www.w3.org/2001/XMLSchema}include': Failed to load the document 'https://www.loc.gov/standards/xlink/xlink.xsd' for inclusion., line 9 As I'm behind a proxy: I also just found out that while curl happily accepts http_proxy=my.proxy.address.net:8080 lxml (libxml2) only works if this is set to http_proxy=http://my.proxy.address.net:8080 i.e. with an explicit <scheme>:// Cheers, H. Landesbank Baden-Wuerttemberg Anstalt des oeffentlichen Rechts Hauptsitze: Stuttgart, Karlsruhe, Mannheim, Mainz HRA 12704 Amtsgericht Stuttgart HRA 4356, HRA 104 440 Amtsgericht Mannheim HRA 40687 Amtsgericht Mainz Die LBBW verarbeitet gemaess Erfordernissen der DSGVO Ihre personenbezogenen Daten. Informationen finden Sie unter https://www.lbbw.de/datenschutz. _______________________________________________ lxml - The Python XML Toolkit mailing list -- lxml@python.org<mailto:lxml@python.org> To unsubscribe send an email to lxml-leave@python.org<mailto:lxml-leave@python.org> https://mail.python.org/mailman3/lists/lxml.python.org/ Member address: paul_higgs@hotmail.com<mailto:paul_higgs@hotmail.com> _______________________________________________ lxml - The Python XML Toolkit mailing list -- lxml@python.org<mailto:lxml@python.org> To unsubscribe send an email to lxml-leave@python.org<mailto:lxml-leave@python.org> https://mail.python.org/mailman3/lists/lxml.python.org/ Member address: paul_higgs@hotmail.com<mailto:paul_higgs@hotmail.com>
participants (2)
-
Holger.Joukl@LBBW.de
-
Paul Higgs