Mailman 3 lxml doesnt seem to resolve relative, imported xsd files, when using https. - lxml - The Python XML Toolkit

July 11, 2018

      With Google chrome marking http as "not secure" in the browser I need to set lxml to use https.
The url to the main xsd is https://www.pharmac.govt.nz/2006/07/Schedule.xsd.

This is redirected to https://www.pharmac.govt.nz/wwwtrs/pub/2006/07/Schedule.xsd.
The main xsd file imports further xsd files (relative to the main xsd file).

Below are a few of my findings and discoveries. Any fixes, solution or help greatly appreciated.

OS, software, and versions
ubuntu = 16.4 LTS
python = 2.7
lxml = 4.2.1
xsd: The main schema imports multiple xsd files. Link to the main xsd and imported xsd resources : https://www.pharmac.govt.nz/wwwtrs/pub/2006/07/

Example 1 - http - the below works:
  from lxml import etree
  schema = 'http://www.pharmac.govt.nz/2006/07/Schedule.xsd'
  schemadoc = etree.XMLSchema(file = schema)

Example 2 - https - produces the below error:
  from lxml import etree
  schema = 'https://www.pharmac.govt.nz/2006/07/Schedule.xsd'
  schemadoc = etree.XMLSchema(file = schema)

  Traceback (most recent call last):
    File "xmltest.py", line 3, in <module>
      schemadoc = etree.XMLSchema(file = schema)
    File "src/lxml/xmlschema.pxi", line 86, in lxml.etree.XMLSchema.__init__
  lxml.etree.XMLSchemaParseError: Failed to locate the main schema resource at 'https://www.pharmac.govt.nz/2006/07/Schedule.xsd'.

  **Yet calling the url "https://www.pharmac.govt.nz/2006/07/Schedule.xsd" loads the schema.

Example 3 - http - the below works:
  from lxml import etree
  import urllib2
  schema = 'http://www.pharmac.govt.nz/2006/07/Schedule.xsd'
  schema_src_file = urllib2.urlopen(schema)
  schema_doc = etree.parse(schema_src_file)
  schema = etree.XMLSchema(schema_doc)

Example 4 - https - produces the below error:
  from lxml import etree
  import urllib2
  schema = 'https://www.pharmac.govt.nz/2006/07/Schedule.xsd'
  schema_src_file = urllib2.urlopen(schema)
  schema_doc = etree.parse(schema_src_file)
  schema = etree.XMLSchema(schema_doc)

  Traceback (most recent call last):
    File "xmltest.py", line 6, in <module>
      schema = etree.XMLSchema(schema_doc)
    File "src/lxml/xmlschema.pxi", line 86, in lxml.etree.XMLSchema.__init__
  lxml.etree.XMLSchemaParseError: Element '{http://www.w3.org/2001/XMLSchema}element', attribute 'ref': The QName value '{http://www.w3.org/1999/xhtml}div' does not resolve to a(n) element declaration., line 336

Example 5 - https - works (If I first save all the supporting xsd files locally)
  ** The below suggest the main xsd file can be found online via https; it just doesnt seem to load
     supporting xsd files via https.

  from lxml import etree
  import urllib2
  schema = 'https://www.pharmac.govt.nz/2006/07/Schedule.xsd'
  schema_src_file = urllib2.urlopen(schema)
  xml_data = schema_src_file.read()
  schema_doc = etree.parse(xml_data)
  schema = etree.XMLSchema(schema_doc)

Example 6 - works upto the assertValid
  ** Big THANKS to Roger Duthie for providing the below "class HTTPSResolver" code.
     To get this to work I had to change the "schemaLocation" from relative to the full path:
     From: schemaLocation="mathml2.xsd"
     To:   schemaLocation="https://wwwdev.pharmac.govt.nz/2006/07/mathml2/mathml2.xsd"

     assertValid fails with: "Element '{https://wwwdev.pharmac.govt.nz/2006/07/Schedule#}Schedule
     No matching global declaration available for the validation root".

     Which is curious as it suggest lxml is validating the IRI??

  from lxml import etree
  from urlparse import urlparse
  import requests

  class HTTPSResolver(etree.Resolver):

      __name__ = 'HTTPSResolver'

      def resolve(self, url, id_, context):
          url_components = urlparse(url)
          scheme = url_components.scheme
          # the resolver will only check for redirects if http/s is present
          if scheme == 'http' or scheme == 'https':
              head_response = requests.head(url, allow_redirects=True)
              new_request_url = head_response.url
              if len(head_response.history) != 0:
                  # recursively, resolve the ultimate redirection target
                  return self.resolve(new_request_url, id, context)
              else:
                  if scheme == 'http':
                      # libxml2 can handle this resource
                      return self.resolve_filename(new_request_url, context)
                  elif scheme == 'https':
                      # libxml2 cannot resolve this resource, so do the work
                      get_response = requests.get(new_request_url)
                      return self.resolve_string(get_response.content, context,
                                                  base_url=new_request_url)
                  else:
                      raise Exception("[%s]\
  Something odd has happened - scheme should be http or https" %
  __name__)
          else:
              # treat resource as a plain old file
              return self.resolve_filename(url, context)

  schema = 'https://www.pharmac.govt.nz/2006/07/Schedule.xsd'
  parser = etree.XMLParser(load_dtd=True)
  resolver = HTTPSResolver(schema)
  parser.resolvers.add(resolver)
  schemadoc=etree.XMLSchema(etree.parse(schema, parser=parser))

  --- === Code to Build the XML document === ---

  schemadoc.assertValid(Schedule)

lxml doesnt seem to resolve relative, imported xsd files, when using https.

Anrik Drenth

Duthie, Roger J.A.

Duthie, Roger J.A.

Duthie, Roger J.A.

Duthie, Roger J.A.

tags

participants (2)