(cc-ing the mailing list)
> Thanks for all the feedback :) For now, I will stick to just one part of your feedback.
>
> Consider your example:
>
> if not no_network:
> parse_options = parse_options ^ xmlparser.XML_PARSE_NONET
>
> Won't that always "negate" the XML_PARSE_NONET bit? If 0, it will change to 1. If 1 it will change to 0. Right? So using --no_network will always pick the opposite of the default. Am I wrong?
That's probably a bit unintuitive …
[View More]from the snippet I gave. It works like this: Since XML_PARSE_NONET is also part of
_XML_DEFAULT_PARSE_OPTIONS, the XOR logic here switches it off again when no_network has been explicitly set to False.
> And regarding the default behavior. When it comes to validating according to an xml schema then the default is to download the xsd's that are imported, at least on my system
> (Ubuntu 22.04 with libxml 2.9.12 installed via APT). I tried with both xmllint, xmlstarlet and lxml. Perhaps the default is for
> something other than downloading xsd's? I guess
> there can be references / entities stuff in the target xml document, and those references will not be downloaded. Could that be it?
>
>Maybe the parser used for parsing XML Schemas is set up to ignore the "normal default" and in the case of "lxml" also set up to ignore the options set with no_network.
It's been a long time since I experimented with includes/imports in XML Schemas in lxml. Can't really remember
the workings but in my case it was rather the other way round, i.e. no (external/remote) network access and wanting
to load included/imported schemas from a local catalog.
Have you tried running with XML_CATALOG_FILES set to an empty value to suppress default catalog settings?
E.g. XML_CATALOG_FILES= python myprog.py
I found this: https://bugs.launchpad.net/lxml/+bug/1234114
So it seems like it is indeed not "simply" possible to suppress XMLSchema network access (but maybe through catalog setup:
"[...] and external imports should always be covered by catalogues (otherwise, that's a configuration problem on the user side)[...]",
see the issue conversation).
Another thought might be custom URI resolvers but I don't know how they tie into XML Schema handling
(https://lxml.de/resolvers.html#uri-resolvers).
Holger
Landesbank Baden-Wuerttemberg
Anstalt des oeffentlichen Rechts
Hauptsitze: Stuttgart, Karlsruhe, Mannheim, Mainz
HRA 12704
Amtsgericht Stuttgart
HRA 4356, HRA 104 440
Amtsgericht Mannheim
HRA 40687
Amtsgericht Mainz
Die LBBW verarbeitet gemaess Erfordernissen der DSGVO Ihre personenbezogenen Daten.
Informationen finden Sie unter https://www.lbbw.de/datenschutz.
[View Less]
Hi,
> I know how to set up the parser to not download entities. But I have not found a way to stop XMLCatalog from downloading other xsd's than the root xsd.
>
> from lxml import etree
> parser = etree.XMLParser(no_network=True)
> xsddoc = etree.parse('schemas/ler/2.0_ler.xsd',parser=parser)
> xsd = etree.XMLSchema(xsddoc)
>
> The above code will recursively download the XSD's imported in 2.0_ler.xsd.
>
> I played around with xmllint and I believe that if …
[View More]XML_PARSE_NONET is True, if will not download those. But how do I set that option for the context in which XMLSchema runs?
Hm, from a quick glance at the code XML_PARSE_NONET *is* set through the no_network parser __init__option:
if not no_network:
parse_options = parse_options ^ xmlparser.XML_PARSE_NONET
(https://github.com/lxml/lxml/blob/3ccc7d583e325ceb0ebdf8fc295bbb7fc8cd404d/…)
And it defaults to True, too.
That said, maybe you could use custom XML Catalog setup (https://lxml.de/resolvers.html#xml-catalogs, see also the link to libxml2
catalog setup info there) to prevent unwanted network lookup?
Might even be that some default catalog handling is taking place on your machine and causing the behavior
you observe(?), see https://gitlab.gnome.org/GNOME/libxml2/-/wikis/Catalog-support#how-to-tune-….
E.g. could the imported documents have already been cached, i.e. they're not even loaded from remote? This is
one thing XML catalogs can provide.
Best regards,
Holger
Landesbank Baden-Wuerttemberg
Anstalt des oeffentlichen Rechts
Hauptsitze: Stuttgart, Karlsruhe, Mannheim, Mainz
HRA 12704
Amtsgericht Stuttgart
HRA 4356, HRA 104 440
Amtsgericht Mannheim
HRA 40687
Amtsgericht Mainz
Die LBBW verarbeitet gemaess Erfordernissen der DSGVO Ihre personenbezogenen Daten.
Informationen finden Sie unter https://www.lbbw.de/datenschutz.
[View Less]
I know how to set up the parser to not download entities. But I have not
found a way to stop XMLCatalog from downloading other xsd's than the root
xsd.
from lxml import etree
parser = etree.XMLParser(no_network=True)
xsddoc = etree.parse('schemas/ler/2.0_ler.xsd',parser=parser)
xsd = etree.XMLSchema(xsddoc)
The above code will recursively download the XSD's imported in 2.0_ler.xsd.
I played around with xmllint and I believe that if XML_PARSE_NONET is True,
if will not download those. But how …
[View More]do I set that option for the context in
which XMLSchema runs?
Sincerely, Thomas
[View Less]