New subject: WG: lxml can't load external DTD from HTTPS internetaddress

April 24, 2020

      Hi,

I have a problem with a DTD that is stored on a server and can be retrieved from the following path.
https://smile.htw-berlin.de/TMP_dtd/XML_DTDValidation.dtd <-- Internetside 1

Unfortunately the website runs under the HTTPS protocol. Since this is provided by the university, I can't change it.
I have already been informed that I can't get an area that only runs with the HTTP protocol.

Error: warning: failed to load external entity "https://smile.htw-berlin.de/TMP_dtd/XML_DTDValidation.dtd

At the following Internet address:
http://mum.mxndmrtz.de/wp-content/uploads/2020/02/XML_DTDValidation.txt <-- Internetside 2
the same DTD but with a different file suffix, otherwise I could not upload this file on the server.
Works fine.

I'm unsure if libxml2 actually supports https, and thus lxml(?).

For direct primary resource access that shouldn't pose any problem at all, as you can just use urllib2/3 or requests
to retrieve it and hand a file-like object to etree.parse() instead of the URL, s.th. like

etree.parse(io.BytesIO(requests.get("https://www.w3schools.com/xml/note.xml").content))) # or etree.fromstring without BytesIO

However it might get more complicated for included/imported resources and DTDs, as these
dependency URLs will probably just be fed to libxml2 http client, again.

If that's the case and libxml2 does indeed not support https you probably
need to
- use an XML Catalog to "localize"  those DTDs, .i.e. download them and make the catalog pick up the local versions or
- use some kind of custom resolver and mirror/proxy the https resources with http "shims"
(never done that myself)

Probably that's the place to start looking: https://lxml.de/resolvers.html

When I try to check an example XML with LXML using the DTD, it only works with the DTD that is on the web page 2.
As an example the two references to the respective DTD are shown.

<!DOCTYPE book PUBLIC "-//JR//HTW Berlin book//EN" "https://smile.htw-berlin.de/TMP_dtd/XML_DTDValidation.dtd">
<!DOCTYPE book PUBLIC "-//JR//HTW Berlin book//EN" "http://mum.mxndmrtz.de/wp-content/uploads/2020/02/XML_DTDValidation.txt">

One other thing that comes to mind when s.th. works over http but not https is certificate validation problems - but I suspect
that's not the problem here.

Holger

[cid:image001_e3261df1-efb3-4c64-a4de-9b63c5e52cfb.png]<https://www.lbbw.de/>

[cid:image002_15f56f09-b8fe-4ddc-a247-e3cf3d25c71f.png]<https://twitter.com/lbbw> [cid:image003_e01dca75-944e-4ea3-8463-c9c061e5b36b.png] <https://www.linkedin.com/company/lbbw>  [cid:image004_5642e2b4-e190-4f0c-a274-7ceb59fe913c.png] <https://www.xing.com/company/lbbw>  [cid:image005_53d53024-7202-4d60-8d95-dc0f361239ca.png] <https://www.facebook.com/LBBW.Stuttgart/>  [cid:image006_9c0b6462-8935-4b9c-9fe3-0caa90d8623b.png] <https://www.youtube.com/user/LBBWDirekt>  [cid:image007_9a7ebd18-8e4e-4c3e-931a-ecc4bf575ace.png] <https://www.instagram.com/lbbw_karriere/>

Landesbank Baden-Wuerttemberg
Anstalt des oeffentlichen Rechts
Hauptsitze: Stuttgart, Karlsruhe, Mannheim, Mainz
HRA 12704
Amtsgericht Stuttgart
HRA 4356, HRA 104 440
Amtsgericht Mannheim
HRA 40687
Amtsgericht Mainz

Die LBBW verarbeitet gemaess Erfordernissen der DSGVO Ihre personenbezogenen Daten.
Informationen finden Sie unter https://www.lbbw.de/datenschutz.

Re: [lxml] WG: lxml can't load external DTD from HTTPS internetaddress

Holger.Joukl＠LBBW.de

Stefan Behnel

Stefan Behnel

tags

participants (2)