Re: [lxml] WG: lxml can't load external DTD from HTTPS internetaddress
data:image/s3,"s3://crabby-images/8bbe6/8bbe681f08550d13b35a459376ee85cf203c1262" alt=""
Hi, I have a problem with a DTD that is stored on a server and can be retrieved from the following path. https://smile.htw-berlin.de/TMP_dtd/XML_DTDValidation.dtd <-- Internetside 1 Unfortunately the website runs under the HTTPS protocol. Since this is provided by the university, I can't change it. I have already been informed that I can't get an area that only runs with the HTTP protocol. Error: warning: failed to load external entity "https://smile.htw-berlin.de/TMP_dtd/XML_DTDValidation.dtd At the following Internet address: http://mum.mxndmrtz.de/wp-content/uploads/2020/02/XML_DTDValidation.txt <-- Internetside 2 the same DTD but with a different file suffix, otherwise I could not upload this file on the server. Works fine. I'm unsure if libxml2 actually supports https, and thus lxml(?). For direct primary resource access that shouldn't pose any problem at all, as you can just use urllib2/3 or requests to retrieve it and hand a file-like object to etree.parse() instead of the URL, s.th. like etree.parse(io.BytesIO(requests.get("https://www.w3schools.com/xml/note.xml").content))) # or etree.fromstring without BytesIO However it might get more complicated for included/imported resources and DTDs, as these dependency URLs will probably just be fed to libxml2 http client, again. If that's the case and libxml2 does indeed not support https you probably need to - use an XML Catalog to "localize" those DTDs, .i.e. download them and make the catalog pick up the local versions or - use some kind of custom resolver and mirror/proxy the https resources with http "shims" (never done that myself) Probably that's the place to start looking: https://lxml.de/resolvers.html When I try to check an example XML with LXML using the DTD, it only works with the DTD that is on the web page 2. As an example the two references to the respective DTD are shown. <!DOCTYPE book PUBLIC "-//JR//HTW Berlin book//EN" "https://smile.htw-berlin.de/TMP_dtd/XML_DTDValidation.dtd"> <!DOCTYPE book PUBLIC "-//JR//HTW Berlin book//EN" "http://mum.mxndmrtz.de/wp-content/uploads/2020/02/XML_DTDValidation.txt"> One other thing that comes to mind when s.th. works over http but not https is certificate validation problems - but I suspect that's not the problem here. Holger [cid:image001_e3261df1-efb3-4c64-a4de-9b63c5e52cfb.png]<https://www.lbbw.de/> [cid:image002_15f56f09-b8fe-4ddc-a247-e3cf3d25c71f.png]<https://twitter.com/lbbw> [cid:image003_e01dca75-944e-4ea3-8463-c9c061e5b36b.png] <https://www.linkedin.com/company/lbbw> [cid:image004_5642e2b4-e190-4f0c-a274-7ceb59fe913c.png] <https://www.xing.com/company/lbbw> [cid:image005_53d53024-7202-4d60-8d95-dc0f361239ca.png] <https://www.facebook.com/LBBW.Stuttgart/> [cid:image006_9c0b6462-8935-4b9c-9fe3-0caa90d8623b.png] <https://www.youtube.com/user/LBBWDirekt> [cid:image007_9a7ebd18-8e4e-4c3e-931a-ecc4bf575ace.png] <https://www.instagram.com/lbbw_karriere/> Landesbank Baden-Wuerttemberg Anstalt des oeffentlichen Rechts Hauptsitze: Stuttgart, Karlsruhe, Mannheim, Mainz HRA 12704 Amtsgericht Stuttgart HRA 4356, HRA 104 440 Amtsgericht Mannheim HRA 40687 Amtsgericht Mainz Die LBBW verarbeitet gemaess Erfordernissen der DSGVO Ihre personenbezogenen Daten. Informationen finden Sie unter https://www.lbbw.de/datenschutz.
data:image/s3,"s3://crabby-images/4cf20/4cf20edf9c3655e7f5c4e7d874c5fdf3b39d715f" alt=""
Holger.Joukl@LBBW.de schrieb am 24.04.20 um 16:10:
Preferably the latter for strings, yes. parse() is for reading from file(-like) objects. If the data is already in memory, it's much more efficient to read it from there directly.
Absolutely, although the documentation of catalogues could be improved. They are the recommended approach, though. Generally speaking, parsing an XML document should not access remote resources (other than the document itself, if it comes from a network). Going through network access is very slow compared to local files, much more likely to fail or get rate limited, and poses security risks. Stefan
data:image/s3,"s3://crabby-images/4cf20/4cf20edf9c3655e7f5c4e7d874c5fdf3b39d715f" alt=""
Holger.Joukl@LBBW.de schrieb am 24.04.20 um 16:10:
Preferably the latter for strings, yes. parse() is for reading from file(-like) objects. If the data is already in memory, it's much more efficient to read it from there directly.
Absolutely, although the documentation of catalogues could be improved. They are the recommended approach, though. Generally speaking, parsing an XML document should not access remote resources (other than the document itself, if it comes from a network). Going through network access is very slow compared to local files, much more likely to fail or get rate limited, and poses security risks. Stefan
participants (2)
-
Holger.Joukl@LBBW.de
-
Stefan Behnel