lxml doesnt seem to resolve relative, imported xsd files, when using https.
data:image/s3,"s3://crabby-images/e3fd4/e3fd4f3f21d50678610f82b7932d65ca0600d7c0" alt=""
With Google chrome marking http as "not secure" in the browser I need to set lxml to use https. The url to the main xsd is https://www.pharmac.govt.nz/2006/07/Schedule.xsd. This is redirected to https://www.pharmac.govt.nz/wwwtrs/pub/2006/07/Schedule.xsd. The main xsd file imports further xsd files (relative to the main xsd file). Below are a few of my findings and discoveries. Any fixes, solution or help greatly appreciated. OS, software, and versions ubuntu = 16.4 LTS python = 2.7 lxml = 4.2.1 xsd: The main schema imports multiple xsd files. Link to the main xsd and imported xsd resources : https://www.pharmac.govt.nz/wwwtrs/pub/2006/07/ Example 1 - http - the below works: from lxml import etree schema = 'http://www.pharmac.govt.nz/2006/07/Schedule.xsd' schemadoc = etree.XMLSchema(file = schema) Example 2 - https - produces the below error: from lxml import etree schema = 'https://www.pharmac.govt.nz/2006/07/Schedule.xsd' schemadoc = etree.XMLSchema(file = schema) Traceback (most recent call last): File "xmltest.py", line 3, in <module> schemadoc = etree.XMLSchema(file = schema) File "src/lxml/xmlschema.pxi", line 86, in lxml.etree.XMLSchema.__init__ lxml.etree.XMLSchemaParseError: Failed to locate the main schema resource at 'https://www.pharmac.govt.nz/2006/07/Schedule.xsd'. **Yet calling the url "https://www.pharmac.govt.nz/2006/07/Schedule.xsd" loads the schema. Example 3 - http - the below works: from lxml import etree import urllib2 schema = 'http://www.pharmac.govt.nz/2006/07/Schedule.xsd' schema_src_file = urllib2.urlopen(schema) schema_doc = etree.parse(schema_src_file) schema = etree.XMLSchema(schema_doc) Example 4 - https - produces the below error: from lxml import etree import urllib2 schema = 'https://www.pharmac.govt.nz/2006/07/Schedule.xsd' schema_src_file = urllib2.urlopen(schema) schema_doc = etree.parse(schema_src_file) schema = etree.XMLSchema(schema_doc) Traceback (most recent call last): File "xmltest.py", line 6, in <module> schema = etree.XMLSchema(schema_doc) File "src/lxml/xmlschema.pxi", line 86, in lxml.etree.XMLSchema.__init__ lxml.etree.XMLSchemaParseError: Element '{http://www.w3.org/2001/XMLSchema}element', attribute 'ref': The QName value '{http://www.w3.org/1999/xhtml}div' does not resolve to a(n) element declaration., line 336 Example 5 - https - works (If I first save all the supporting xsd files locally) ** The below suggest the main xsd file can be found online via https; it just doesnt seem to load supporting xsd files via https. from lxml import etree import urllib2 schema = 'https://www.pharmac.govt.nz/2006/07/Schedule.xsd' schema_src_file = urllib2.urlopen(schema) xml_data = schema_src_file.read() schema_doc = etree.parse(xml_data) schema = etree.XMLSchema(schema_doc) Example 6 - works upto the assertValid ** Big THANKS to Roger Duthie for providing the below "class HTTPSResolver" code. To get this to work I had to change the "schemaLocation" from relative to the full path: From: schemaLocation="mathml2.xsd" To: schemaLocation="https://wwwdev.pharmac.govt.nz/2006/07/mathml2/mathml2.xsd" assertValid fails with: "Element '{https://wwwdev.pharmac.govt.nz/2006/07/Schedule#}Schedule No matching global declaration available for the validation root". Which is curious as it suggest lxml is validating the IRI?? from lxml import etree from urlparse import urlparse import requests class HTTPSResolver(etree.Resolver): __name__ = 'HTTPSResolver' def resolve(self, url, id_, context): url_components = urlparse(url) scheme = url_components.scheme # the resolver will only check for redirects if http/s is present if scheme == 'http' or scheme == 'https': head_response = requests.head(url, allow_redirects=True) new_request_url = head_response.url if len(head_response.history) != 0: # recursively, resolve the ultimate redirection target return self.resolve(new_request_url, id, context) else: if scheme == 'http': # libxml2 can handle this resource return self.resolve_filename(new_request_url, context) elif scheme == 'https': # libxml2 cannot resolve this resource, so do the work get_response = requests.get(new_request_url) return self.resolve_string(get_response.content, context, base_url=new_request_url) else: raise Exception("[%s]\ Something odd has happened - scheme should be http or https" % __name__) else: # treat resource as a plain old file return self.resolve_filename(url, context) schema = 'https://www.pharmac.govt.nz/2006/07/Schedule.xsd' parser = etree.XMLParser(load_dtd=True) resolver = HTTPSResolver(schema) parser.resolvers.add(resolver) schemadoc=etree.XMLSchema(etree.parse(schema, parser=parser)) --- === Code to Build the XML document === --- schemadoc.assertValid(Schedule)
data:image/s3,"s3://crabby-images/d8bf5/d8bf52ae8df286d0cd3cdd184a451fcb1879b616" alt=""
Hi Anrik, The failure in the last case might be due to the change in the namespace used by the generated XML, from 'http' to 'https'. The schedule.xsd still has the namespace as 'http'. You can see that the element that is failing is {https://[...]}Schedule: assertValid fails with: "Element '{https://wwwdev.pharmac.govt.nz/2006/07/Schedule#}Schedule No matching global declaration available for the validation root".
From the schedule.xsd file:
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema" xmlns="http://www.pharmac.govt.nz/2006/07/Schedule#" xmlns:math="http://www.w3.org/1998/Math/MathML" xmlns:html="http://www.w3.org/1999/xhtml" xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:nzmt="nzmt.org.nz" targetNamespace="http://www.pharmac.govt.nz/2006/07/Schedule#" elementFormDefault="qualified"> ________________________________ From: Anrik Drenth <anrikd@hotmail.com> Sent: 11 July 2018 01:14:54 To: lxml@lxml.de Subject: lxml doesnt seem to resolve relative, imported xsd files, when using https. With Google chrome marking http as "not secure" in the browser I need to set lxml to use https. The url to the main xsd is https://www.pharmac.govt.nz/2006/07/Schedule.xsd. This is redirected to https://www.pharmac.govt.nz/wwwtrs/pub/2006/07/Schedule.xsd. The main xsd file imports further xsd files (relative to the main xsd file). Below are a few of my findings and discoveries. Any fixes, solution or help greatly appreciated. OS, software, and versions ubuntu = 16.4 LTS python = 2.7 lxml = 4.2.1 xsd: The main schema imports multiple xsd files. Link to the main xsd and imported xsd resources : https://www.pharmac.govt.nz/wwwtrs/pub/2006/07/ Example 1 - http - the below works: from lxml import etree schema = 'http://www.pharmac.govt.nz/2006/07/Schedule.xsd' schemadoc = etree.XMLSchema(file = schema) Example 2 - https - produces the below error: from lxml import etree schema = 'https://www.pharmac.govt.nz/2006/07/Schedule.xsd' schemadoc = etree.XMLSchema(file = schema) Traceback (most recent call last): File "xmltest.py", line 3, in <module> schemadoc = etree.XMLSchema(file = schema) File "src/lxml/xmlschema.pxi", line 86, in lxml.etree.XMLSchema.__init__ lxml.etree.XMLSchemaParseError: Failed to locate the main schema resource at 'https://www.pharmac.govt.nz/2006/07/Schedule.xsd'. **Yet calling the url "https://www.pharmac.govt.nz/2006/07/Schedule.xsd" loads the schema. Example 3 - http - the below works: from lxml import etree import urllib2 schema = 'http://www.pharmac.govt.nz/2006/07/Schedule.xsd' schema_src_file = urllib2.urlopen(schema) schema_doc = etree.parse(schema_src_file) schema = etree.XMLSchema(schema_doc) Example 4 - https - produces the below error: from lxml import etree import urllib2 schema = 'https://www.pharmac.govt.nz/2006/07/Schedule.xsd' schema_src_file = urllib2.urlopen(schema) schema_doc = etree.parse(schema_src_file) schema = etree.XMLSchema(schema_doc) Traceback (most recent call last): File "xmltest.py", line 6, in <module> schema = etree.XMLSchema(schema_doc) File "src/lxml/xmlschema.pxi", line 86, in lxml.etree.XMLSchema.__init__ lxml.etree.XMLSchemaParseError: Element '{http://www.w3.org/2001/XMLSchema}element', attribute 'ref': The QName value '{http://www.w3.org/1999/xhtml}div' does not resolve to a(n) element declaration., line 336 Example 5 - https - works (If I first save all the supporting xsd files locally) ** The below suggest the main xsd file can be found online via https; it just doesnt seem to load supporting xsd files via https. from lxml import etree import urllib2 schema = 'https://www.pharmac.govt.nz/2006/07/Schedule.xsd' schema_src_file = urllib2.urlopen(schema) xml_data = schema_src_file.read() schema_doc = etree.parse(xml_data) schema = etree.XMLSchema(schema_doc) Example 6 - works upto the assertValid ** Big THANKS to Roger Duthie for providing the below "class HTTPSResolver" code. To get this to work I had to change the "schemaLocation" from relative to the full path: From: schemaLocation="mathml2.xsd" To: schemaLocation="https://wwwdev.pharmac.govt.nz/2006/07/mathml2/mathml2.xsd" assertValid fails with: "Element '{https://wwwdev.pharmac.govt.nz/2006/07/Schedule#}Schedule No matching global declaration available for the validation root". Which is curious as it suggest lxml is validating the IRI?? from lxml import etree from urlparse import urlparse import requests class HTTPSResolver(etree.Resolver): __name__ = 'HTTPSResolver' def resolve(self, url, id_, context): url_components = urlparse(url) scheme = url_components.scheme # the resolver will only check for redirects if http/s is present if scheme == 'http' or scheme == 'https': head_response = requests.head(url, allow_redirects=True) new_request_url = head_response.url if len(head_response.history) != 0: # recursively, resolve the ultimate redirection target return self.resolve(new_request_url, id, context) else: if scheme == 'http': # libxml2 can handle this resource return self.resolve_filename(new_request_url, context) elif scheme == 'https': # libxml2 cannot resolve this resource, so do the work get_response = requests.get(new_request_url) return self.resolve_string(get_response.content, context, base_url=new_request_url) else: raise Exception("[%s]\ Something odd has happened - scheme should be http or https" % __name__) else: # treat resource as a plain old file return self.resolve_filename(url, context) schema = 'https://www.pharmac.govt.nz/2006/07/Schedule.xsd' parser = etree.XMLParser(load_dtd=True) resolver = HTTPSResolver(schema) parser.resolvers.add(resolver) schemadoc=etree.XMLSchema(etree.parse(schema, parser=parser)) --- === Code to Build the XML document === --- schemadoc.assertValid(Schedule) ________________________________ This message (and any attachments) is for the recipient only. NERC is subject to the Freedom of Information Act 2000 and the contents of this email and any reply you make may be disclosed by NERC unless it is exempt from release under the Act. Any material supplied to NERC may be stored in an electronic records management system. ________________________________
data:image/s3,"s3://crabby-images/d8bf5/d8bf52ae8df286d0cd3cdd184a451fcb1879b616" alt=""
...pressed enter too early. The excerpt from the schedule.xsd shows that the namespace is still 'http://wwwdev.pharmac.govt.nz/2006/07/Schedule#'. Does changing back the namespace in the generated XML to 'http' allow it to work? Rg. ________________________________ From: Duthie, Roger J.A. Sent: 11 July 2018 17:56:02 To: lxml@lxml.de Subject: Re: lxml doesnt seem to resolve relative, imported xsd files, when using https. Hi Anrik, The failure in the last case might be due to the change in the namespace used by the generated XML, from 'http' to 'https'. The schedule.xsd still has the namespace as 'http'. You can see that the element that is failing is {https://[...]}Schedule: assertValid fails with: "Element '{https://wwwdev.pharmac.govt.nz/2006/07/Schedule#}Schedule No matching global declaration available for the validation root".
From the schedule.xsd file:
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema" xmlns="http://www.pharmac.govt.nz/2006/07/Schedule#" xmlns:math="http://www.w3.org/1998/Math/MathML" xmlns:html="http://www.w3.org/1999/xhtml" xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:nzmt="nzmt.org.nz" targetNamespace="http://www.pharmac.govt.nz/2006/07/Schedule#" elementFormDefault="qualified"> ________________________________ From: Anrik Drenth <anrikd@hotmail.com> Sent: 11 July 2018 01:14:54 To: lxml@lxml.de Subject: lxml doesnt seem to resolve relative, imported xsd files, when using https. With Google chrome marking http as "not secure" in the browser I need to set lxml to use https. The url to the main xsd is https://www.pharmac.govt.nz/2006/07/Schedule.xsd. This is redirected to https://www.pharmac.govt.nz/wwwtrs/pub/2006/07/Schedule.xsd. The main xsd file imports further xsd files (relative to the main xsd file). Below are a few of my findings and discoveries. Any fixes, solution or help greatly appreciated. OS, software, and versions ubuntu = 16.4 LTS python = 2.7 lxml = 4.2.1 xsd: The main schema imports multiple xsd files. Link to the main xsd and imported xsd resources : https://www.pharmac.govt.nz/wwwtrs/pub/2006/07/ Example 1 - http - the below works: from lxml import etree schema = 'http://www.pharmac.govt.nz/2006/07/Schedule.xsd' schemadoc = etree.XMLSchema(file = schema) Example 2 - https - produces the below error: from lxml import etree schema = 'https://www.pharmac.govt.nz/2006/07/Schedule.xsd' schemadoc = etree.XMLSchema(file = schema) Traceback (most recent call last): File "xmltest.py", line 3, in <module> schemadoc = etree.XMLSchema(file = schema) File "src/lxml/xmlschema.pxi", line 86, in lxml.etree.XMLSchema.__init__ lxml.etree.XMLSchemaParseError: Failed to locate the main schema resource at 'https://www.pharmac.govt.nz/2006/07/Schedule.xsd'. **Yet calling the url "https://www.pharmac.govt.nz/2006/07/Schedule.xsd" loads the schema. Example 3 - http - the below works: from lxml import etree import urllib2 schema = 'http://www.pharmac.govt.nz/2006/07/Schedule.xsd' schema_src_file = urllib2.urlopen(schema) schema_doc = etree.parse(schema_src_file) schema = etree.XMLSchema(schema_doc) Example 4 - https - produces the below error: from lxml import etree import urllib2 schema = 'https://www.pharmac.govt.nz/2006/07/Schedule.xsd' schema_src_file = urllib2.urlopen(schema) schema_doc = etree.parse(schema_src_file) schema = etree.XMLSchema(schema_doc) Traceback (most recent call last): File "xmltest.py", line 6, in <module> schema = etree.XMLSchema(schema_doc) File "src/lxml/xmlschema.pxi", line 86, in lxml.etree.XMLSchema.__init__ lxml.etree.XMLSchemaParseError: Element '{http://www.w3.org/2001/XMLSchema}element', attribute 'ref': The QName value '{http://www.w3.org/1999/xhtml}div' does not resolve to a(n) element declaration., line 336 Example 5 - https - works (If I first save all the supporting xsd files locally) ** The below suggest the main xsd file can be found online via https; it just doesnt seem to load supporting xsd files via https. from lxml import etree import urllib2 schema = 'https://www.pharmac.govt.nz/2006/07/Schedule.xsd' schema_src_file = urllib2.urlopen(schema) xml_data = schema_src_file.read() schema_doc = etree.parse(xml_data) schema = etree.XMLSchema(schema_doc) Example 6 - works upto the assertValid ** Big THANKS to Roger Duthie for providing the below "class HTTPSResolver" code. To get this to work I had to change the "schemaLocation" from relative to the full path: From: schemaLocation="mathml2.xsd" To: schemaLocation="https://wwwdev.pharmac.govt.nz/2006/07/mathml2/mathml2.xsd" assertValid fails with: "Element '{https://wwwdev.pharmac.govt.nz/2006/07/Schedule#}Schedule No matching global declaration available for the validation root". Which is curious as it suggest lxml is validating the IRI?? from lxml import etree from urlparse import urlparse import requests class HTTPSResolver(etree.Resolver): __name__ = 'HTTPSResolver' def resolve(self, url, id_, context): url_components = urlparse(url) scheme = url_components.scheme # the resolver will only check for redirects if http/s is present if scheme == 'http' or scheme == 'https': head_response = requests.head(url, allow_redirects=True) new_request_url = head_response.url if len(head_response.history) != 0: # recursively, resolve the ultimate redirection target return self.resolve(new_request_url, id, context) else: if scheme == 'http': # libxml2 can handle this resource return self.resolve_filename(new_request_url, context) elif scheme == 'https': # libxml2 cannot resolve this resource, so do the work get_response = requests.get(new_request_url) return self.resolve_string(get_response.content, context, base_url=new_request_url) else: raise Exception("[%s]\ Something odd has happened - scheme should be http or https" % __name__) else: # treat resource as a plain old file return self.resolve_filename(url, context) schema = 'https://www.pharmac.govt.nz/2006/07/Schedule.xsd' parser = etree.XMLParser(load_dtd=True) resolver = HTTPSResolver(schema) parser.resolvers.add(resolver) schemadoc=etree.XMLSchema(etree.parse(schema, parser=parser)) --- === Code to Build the XML document === --- schemadoc.assertValid(Schedule) ________________________________ This message (and any attachments) is for the recipient only. NERC is subject to the Freedom of Information Act 2000 and the contents of this email and any reply you make may be disclosed by NERC unless it is exempt from release under the Act. Any material supplied to NERC may be stored in an electronic records management system. ________________________________
data:image/s3,"s3://crabby-images/d8bf5/d8bf52ae8df286d0cd3cdd184a451fcb1879b616" alt=""
Hi Anrik, The failure in the last case might be due to the change in the namespace used by the generated XML, from 'http' to 'https'. The schedule.xsd still has the namespace as 'http'. You can see that the element that is failing is {https://[...]}Schedule: assertValid fails with: "Element '{https://wwwdev.pharmac.govt.nz/2006/07/Schedule#}Schedule No matching global declaration available for the validation root".
From the schedule.xsd file:
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema" xmlns="http://www.pharmac.govt.nz/2006/07/Schedule#" xmlns:math="http://www.w3.org/1998/Math/MathML" xmlns:html="http://www.w3.org/1999/xhtml" xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:nzmt="nzmt.org.nz" targetNamespace="http://www.pharmac.govt.nz/2006/07/Schedule#" elementFormDefault="qualified"> ________________________________ From: Anrik Drenth <anrikd@hotmail.com> Sent: 11 July 2018 01:14:54 To: lxml@lxml.de Subject: lxml doesnt seem to resolve relative, imported xsd files, when using https. With Google chrome marking http as "not secure" in the browser I need to set lxml to use https. The url to the main xsd is https://www.pharmac.govt.nz/2006/07/Schedule.xsd. This is redirected to https://www.pharmac.govt.nz/wwwtrs/pub/2006/07/Schedule.xsd. The main xsd file imports further xsd files (relative to the main xsd file). Below are a few of my findings and discoveries. Any fixes, solution or help greatly appreciated. OS, software, and versions ubuntu = 16.4 LTS python = 2.7 lxml = 4.2.1 xsd: The main schema imports multiple xsd files. Link to the main xsd and imported xsd resources : https://www.pharmac.govt.nz/wwwtrs/pub/2006/07/ Example 1 - http - the below works: from lxml import etree schema = 'http://www.pharmac.govt.nz/2006/07/Schedule.xsd' schemadoc = etree.XMLSchema(file = schema) Example 2 - https - produces the below error: from lxml import etree schema = 'https://www.pharmac.govt.nz/2006/07/Schedule.xsd' schemadoc = etree.XMLSchema(file = schema) Traceback (most recent call last): File "xmltest.py", line 3, in <module> schemadoc = etree.XMLSchema(file = schema) File "src/lxml/xmlschema.pxi", line 86, in lxml.etree.XMLSchema.__init__ lxml.etree.XMLSchemaParseError: Failed to locate the main schema resource at 'https://www.pharmac.govt.nz/2006/07/Schedule.xsd'. **Yet calling the url "https://www.pharmac.govt.nz/2006/07/Schedule.xsd" loads the schema. Example 3 - http - the below works: from lxml import etree import urllib2 schema = 'http://www.pharmac.govt.nz/2006/07/Schedule.xsd' schema_src_file = urllib2.urlopen(schema) schema_doc = etree.parse(schema_src_file) schema = etree.XMLSchema(schema_doc) Example 4 - https - produces the below error: from lxml import etree import urllib2 schema = 'https://www.pharmac.govt.nz/2006/07/Schedule.xsd' schema_src_file = urllib2.urlopen(schema) schema_doc = etree.parse(schema_src_file) schema = etree.XMLSchema(schema_doc) Traceback (most recent call last): File "xmltest.py", line 6, in <module> schema = etree.XMLSchema(schema_doc) File "src/lxml/xmlschema.pxi", line 86, in lxml.etree.XMLSchema.__init__ lxml.etree.XMLSchemaParseError: Element '{http://www.w3.org/2001/XMLSchema}element', attribute 'ref': The QName value '{http://www.w3.org/1999/xhtml}div' does not resolve to a(n) element declaration., line 336 Example 5 - https - works (If I first save all the supporting xsd files locally) ** The below suggest the main xsd file can be found online via https; it just doesnt seem to load supporting xsd files via https. from lxml import etree import urllib2 schema = 'https://www.pharmac.govt.nz/2006/07/Schedule.xsd' schema_src_file = urllib2.urlopen(schema) xml_data = schema_src_file.read() schema_doc = etree.parse(xml_data) schema = etree.XMLSchema(schema_doc) Example 6 - works upto the assertValid ** Big THANKS to Roger Duthie for providing the below "class HTTPSResolver" code. To get this to work I had to change the "schemaLocation" from relative to the full path: From: schemaLocation="mathml2.xsd" To: schemaLocation="https://wwwdev.pharmac.govt.nz/2006/07/mathml2/mathml2.xsd" assertValid fails with: "Element '{https://wwwdev.pharmac.govt.nz/2006/07/Schedule#}Schedule No matching global declaration available for the validation root". Which is curious as it suggest lxml is validating the IRI?? from lxml import etree from urlparse import urlparse import requests class HTTPSResolver(etree.Resolver): __name__ = 'HTTPSResolver' def resolve(self, url, id_, context): url_components = urlparse(url) scheme = url_components.scheme # the resolver will only check for redirects if http/s is present if scheme == 'http' or scheme == 'https': head_response = requests.head(url, allow_redirects=True) new_request_url = head_response.url if len(head_response.history) != 0: # recursively, resolve the ultimate redirection target return self.resolve(new_request_url, id, context) else: if scheme == 'http': # libxml2 can handle this resource return self.resolve_filename(new_request_url, context) elif scheme == 'https': # libxml2 cannot resolve this resource, so do the work get_response = requests.get(new_request_url) return self.resolve_string(get_response.content, context, base_url=new_request_url) else: raise Exception("[%s]\ Something odd has happened - scheme should be http or https" % __name__) else: # treat resource as a plain old file return self.resolve_filename(url, context) schema = 'https://www.pharmac.govt.nz/2006/07/Schedule.xsd' parser = etree.XMLParser(load_dtd=True) resolver = HTTPSResolver(schema) parser.resolvers.add(resolver) schemadoc=etree.XMLSchema(etree.parse(schema, parser=parser)) --- === Code to Build the XML document === --- schemadoc.assertValid(Schedule) ________________________________ This message (and any attachments) is for the recipient only. NERC is subject to the Freedom of Information Act 2000 and the contents of this email and any reply you make may be disclosed by NERC unless it is exempt from release under the Act. Any material supplied to NERC may be stored in an electronic records management system. ________________________________
data:image/s3,"s3://crabby-images/d8bf5/d8bf52ae8df286d0cd3cdd184a451fcb1879b616" alt=""
...pressed enter too early. The excerpt from the schedule.xsd shows that the namespace is still 'http://wwwdev.pharmac.govt.nz/2006/07/Schedule#'. Does changing back the namespace in the generated XML to 'http' allow it to work? Rg. ________________________________ From: Duthie, Roger J.A. Sent: 11 July 2018 17:56:02 To: lxml@lxml.de Subject: Re: lxml doesnt seem to resolve relative, imported xsd files, when using https. Hi Anrik, The failure in the last case might be due to the change in the namespace used by the generated XML, from 'http' to 'https'. The schedule.xsd still has the namespace as 'http'. You can see that the element that is failing is {https://[...]}Schedule: assertValid fails with: "Element '{https://wwwdev.pharmac.govt.nz/2006/07/Schedule#}Schedule No matching global declaration available for the validation root".
From the schedule.xsd file:
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema" xmlns="http://www.pharmac.govt.nz/2006/07/Schedule#" xmlns:math="http://www.w3.org/1998/Math/MathML" xmlns:html="http://www.w3.org/1999/xhtml" xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:nzmt="nzmt.org.nz" targetNamespace="http://www.pharmac.govt.nz/2006/07/Schedule#" elementFormDefault="qualified"> ________________________________ From: Anrik Drenth <anrikd@hotmail.com> Sent: 11 July 2018 01:14:54 To: lxml@lxml.de Subject: lxml doesnt seem to resolve relative, imported xsd files, when using https. With Google chrome marking http as "not secure" in the browser I need to set lxml to use https. The url to the main xsd is https://www.pharmac.govt.nz/2006/07/Schedule.xsd. This is redirected to https://www.pharmac.govt.nz/wwwtrs/pub/2006/07/Schedule.xsd. The main xsd file imports further xsd files (relative to the main xsd file). Below are a few of my findings and discoveries. Any fixes, solution or help greatly appreciated. OS, software, and versions ubuntu = 16.4 LTS python = 2.7 lxml = 4.2.1 xsd: The main schema imports multiple xsd files. Link to the main xsd and imported xsd resources : https://www.pharmac.govt.nz/wwwtrs/pub/2006/07/ Example 1 - http - the below works: from lxml import etree schema = 'http://www.pharmac.govt.nz/2006/07/Schedule.xsd' schemadoc = etree.XMLSchema(file = schema) Example 2 - https - produces the below error: from lxml import etree schema = 'https://www.pharmac.govt.nz/2006/07/Schedule.xsd' schemadoc = etree.XMLSchema(file = schema) Traceback (most recent call last): File "xmltest.py", line 3, in <module> schemadoc = etree.XMLSchema(file = schema) File "src/lxml/xmlschema.pxi", line 86, in lxml.etree.XMLSchema.__init__ lxml.etree.XMLSchemaParseError: Failed to locate the main schema resource at 'https://www.pharmac.govt.nz/2006/07/Schedule.xsd'. **Yet calling the url "https://www.pharmac.govt.nz/2006/07/Schedule.xsd" loads the schema. Example 3 - http - the below works: from lxml import etree import urllib2 schema = 'http://www.pharmac.govt.nz/2006/07/Schedule.xsd' schema_src_file = urllib2.urlopen(schema) schema_doc = etree.parse(schema_src_file) schema = etree.XMLSchema(schema_doc) Example 4 - https - produces the below error: from lxml import etree import urllib2 schema = 'https://www.pharmac.govt.nz/2006/07/Schedule.xsd' schema_src_file = urllib2.urlopen(schema) schema_doc = etree.parse(schema_src_file) schema = etree.XMLSchema(schema_doc) Traceback (most recent call last): File "xmltest.py", line 6, in <module> schema = etree.XMLSchema(schema_doc) File "src/lxml/xmlschema.pxi", line 86, in lxml.etree.XMLSchema.__init__ lxml.etree.XMLSchemaParseError: Element '{http://www.w3.org/2001/XMLSchema}element', attribute 'ref': The QName value '{http://www.w3.org/1999/xhtml}div' does not resolve to a(n) element declaration., line 336 Example 5 - https - works (If I first save all the supporting xsd files locally) ** The below suggest the main xsd file can be found online via https; it just doesnt seem to load supporting xsd files via https. from lxml import etree import urllib2 schema = 'https://www.pharmac.govt.nz/2006/07/Schedule.xsd' schema_src_file = urllib2.urlopen(schema) xml_data = schema_src_file.read() schema_doc = etree.parse(xml_data) schema = etree.XMLSchema(schema_doc) Example 6 - works upto the assertValid ** Big THANKS to Roger Duthie for providing the below "class HTTPSResolver" code. To get this to work I had to change the "schemaLocation" from relative to the full path: From: schemaLocation="mathml2.xsd" To: schemaLocation="https://wwwdev.pharmac.govt.nz/2006/07/mathml2/mathml2.xsd" assertValid fails with: "Element '{https://wwwdev.pharmac.govt.nz/2006/07/Schedule#}Schedule No matching global declaration available for the validation root". Which is curious as it suggest lxml is validating the IRI?? from lxml import etree from urlparse import urlparse import requests class HTTPSResolver(etree.Resolver): __name__ = 'HTTPSResolver' def resolve(self, url, id_, context): url_components = urlparse(url) scheme = url_components.scheme # the resolver will only check for redirects if http/s is present if scheme == 'http' or scheme == 'https': head_response = requests.head(url, allow_redirects=True) new_request_url = head_response.url if len(head_response.history) != 0: # recursively, resolve the ultimate redirection target return self.resolve(new_request_url, id, context) else: if scheme == 'http': # libxml2 can handle this resource return self.resolve_filename(new_request_url, context) elif scheme == 'https': # libxml2 cannot resolve this resource, so do the work get_response = requests.get(new_request_url) return self.resolve_string(get_response.content, context, base_url=new_request_url) else: raise Exception("[%s]\ Something odd has happened - scheme should be http or https" % __name__) else: # treat resource as a plain old file return self.resolve_filename(url, context) schema = 'https://www.pharmac.govt.nz/2006/07/Schedule.xsd' parser = etree.XMLParser(load_dtd=True) resolver = HTTPSResolver(schema) parser.resolvers.add(resolver) schemadoc=etree.XMLSchema(etree.parse(schema, parser=parser)) --- === Code to Build the XML document === --- schemadoc.assertValid(Schedule) ________________________________ This message (and any attachments) is for the recipient only. NERC is subject to the Freedom of Information Act 2000 and the contents of this email and any reply you make may be disclosed by NERC unless it is exempt from release under the Act. Any material supplied to NERC may be stored in an electronic records management system. ________________________________
participants (2)
-
Anrik Drenth
-
Duthie, Roger J.A.