data:image/s3,"s3://crabby-images/8bbe6/8bbe681f08550d13b35a459376ee85cf203c1262" alt=""
(cc-ing the mailing list)
That's probably a bit unintuitive from the snippet I gave. It works like this: Since XML_PARSE_NONET is also part of _XML_DEFAULT_PARSE_OPTIONS, the XOR logic here switches it off again when no_network has been explicitly set to False.
It's been a long time since I experimented with includes/imports in XML Schemas in lxml. Can't really remember the workings but in my case it was rather the other way round, i.e. no (external/remote) network access and wanting to load included/imported schemas from a local catalog. Have you tried running with XML_CATALOG_FILES set to an empty value to suppress default catalog settings? E.g. XML_CATALOG_FILES= python myprog.py I found this: https://bugs.launchpad.net/lxml/+bug/1234114 So it seems like it is indeed not "simply" possible to suppress XMLSchema network access (but maybe through catalog setup: "[...] and external imports should always be covered by catalogues (otherwise, that's a configuration problem on the user side)[...]", see the issue conversation). Another thought might be custom URI resolvers but I don't know how they tie into XML Schema handling (https://lxml.de/resolvers.html#uri-resolvers). Holger Landesbank Baden-Wuerttemberg Anstalt des oeffentlichen Rechts Hauptsitze: Stuttgart, Karlsruhe, Mannheim, Mainz HRA 12704 Amtsgericht Stuttgart HRA 4356, HRA 104 440 Amtsgericht Mannheim HRA 40687 Amtsgericht Mainz Die LBBW verarbeitet gemaess Erfordernissen der DSGVO Ihre personenbezogenen Daten. Informationen finden Sie unter https://www.lbbw.de/datenschutz.
data:image/s3,"s3://crabby-images/06ed4/06ed43650c63bbc1f3df8c6c00f2a73042fb58ad" alt=""
I think that launchpad ticket is what I need to understand the issue better! Great :) I will look into it in the weekend. I did try setting other xml catalogs. And I did manage to set up a catalog and local files for my use case such that nothing is downloaded from the Internet. So that's not my mission right now. But in my first version I _thought_ I managed to change everything so nothing was downloaded. But in fact two files were downloaded from w3c.com. There was no noticeable delay so everything seemed fine. Until one point, when a bunch of files were validated in succession. After around 20-60 successful validations, the rest would fail. w3c has some kind of filter/firewall. If you download the resources in rapid succession (e.g. roughly 1 per second, for 10-40 seconds) it will start rejecting requests. It only takes 5-20 seconds for the firewall to forgive you and let you download again. This meant that I got some random / intermittent failures. Thats why I want to _know_ that I have disabled networking. So that any error with incorrectly set up catalog will give an error now, and not later. The above happened with xmllint. With lxml I can load the schema once and use it for validating hundreds of xml files, so I can easily circumvent the w3c filter. But in any case, I would like to set up my lxml code such that any attempt to download resources will result in an error now, and not when that resource is one day unavailable :) Thanks for helping out! On Mon, Mar 4, 2024 at 10:47 PM <Holger.Joukl@lbbw.de> wrote:
data:image/s3,"s3://crabby-images/06ed4/06ed43650c63bbc1f3df8c6c00f2a73042fb58ad" alt=""
I think that launchpad ticket is what I need to understand the issue better! Great :) I will look into it in the weekend. I did try setting other xml catalogs. And I did manage to set up a catalog and local files for my use case such that nothing is downloaded from the Internet. So that's not my mission right now. But in my first version I _thought_ I managed to change everything so nothing was downloaded. But in fact two files were downloaded from w3c.com. There was no noticeable delay so everything seemed fine. Until one point, when a bunch of files were validated in succession. After around 20-60 successful validations, the rest would fail. w3c has some kind of filter/firewall. If you download the resources in rapid succession (e.g. roughly 1 per second, for 10-40 seconds) it will start rejecting requests. It only takes 5-20 seconds for the firewall to forgive you and let you download again. This meant that I got some random / intermittent failures. Thats why I want to _know_ that I have disabled networking. So that any error with incorrectly set up catalog will give an error now, and not later. The above happened with xmllint. With lxml I can load the schema once and use it for validating hundreds of xml files, so I can easily circumvent the w3c filter. But in any case, I would like to set up my lxml code such that any attempt to download resources will result in an error now, and not when that resource is one day unavailable :) Thanks for helping out! On Mon, Mar 4, 2024 at 10:47 PM <Holger.Joukl@lbbw.de> wrote:
participants (2)
-
Holger.Joukl@LBBW.de
-
Thomas Larsen Wessel