Re: [lxml] Default parser with/without no_network

Hello again, An update on my attempts to work out the behaviour of the parsers/resolvers. This thread (Question 2) involved the validation of XML against a schematron and the apparent inability of the parser to resolve remote resources. It seemed that my [real world] example was throwing up a few issues. Going back to my example code and output: https://gist.github.com/rogie-bas/e157501d12cff8ac0aafc0cc120ce8dc (The link in the previous email may have been broken, because the fullstop was added to the URL, the above will hopefully work without modification!) The first test logging (base resolver, no_network=True) shows that the validation fails with an XSLTApplyError, due to a resource not be resolvable (..../gmxCodeLists.xml). Confusingly, it doesn't seem to be the first one on the list of resources that the code has tried to resolve. However, the behaviour is different if no_network=False. It seems that the resource that the code chokes on is a different one (.../SDN:L11::D08). I may have a tentative reason why on this (based on empirical evidence). I broke my schematron file down into single patterns which I deemed targetting parts of the XML file containing various remote resources. This way I could try resolving the various resources with the various parsers, with no_network = {True | False}. The gmxCodeLists.xml resource is not resolved with the base resolver if no_network=True, but is when no_network=False. The resolution of the other resource (SDN:L11::D08) never happens. If using resolve_filename() method rather than the base resolver, the error is different: an OSError is thrown for these resources, as if the resource was deemed to be a file system resource, instead of a remote resource; however, the gmxCodeLists.xml is resolved correctly. The resolver I created using the Requests package doesn't have a problem with this resource. I've now got a hunch that this is perhaps because the URI for this problematic resource is a redirect (checking with curl -IL, shows that it is a 302/301/200 chain). I can't currently rule out that there is something else about this resource that makes it unresolvable, but when I change the references to this resource, to point at the redirect, the validation completes. My next task is to create some simple external resources with redirects. Is any of this behaviour expected? - no_network has no effect apart from with the base resolver - redirects are unresolvable by the base resolver and resolve_filename base resolution - resolution seemingly going ahead (successive calls to the resolver methods without error - see example test #1 in the linked logging), but really it's not Any insights greatly appreciated. Roger ________________________________ This message (and any attachments) is for the recipient only. NERC is subject to the Freedom of Information Act 2000 and the contents of this email and any reply you make may be disclosed by NERC unless it is exempt from release under the Act. Any material supplied to NERC may be stored in an electronic records management system. ________________________________

Hi! Duthie, Roger J.A. schrieb am 25.06.2018 um 13:38:
Hello again,
An update on my attempts to work out the behaviour of the parsers/resolvers. This thread (Question 2) involved the validation of XML against a schematron and the apparent inability of the parser to resolve remote resources.
It seemed that my [real world] example was throwing up a few issues. Going back to my example code and output: https://gist.github.com/rogie-bas/e157501d12cff8ac0aafc0cc120ce8dc
(The link in the previous email may have been broken, because the fullstop was added to the URL, the above will hopefully work without modification!)
The first test logging (base resolver, no_network=True) shows that the validation fails with an XSLTApplyError, due to a resource not be resolvable (..../gmxCodeLists.xml). Confusingly, it doesn't seem to be the first one on the list of resources that the code has tried to resolve. However, the behaviour is different if no_network=False. It seems that the resource that the code chokes on is a different one (.../SDN:L11::D08). I may have a tentative reason why on this (based on empirical evidence).
I broke my schematron file down into single patterns which I deemed targetting parts of the XML file containing various remote resources. This way I could try resolving the various resources with the various parsers, with no_network = {True | False}. The gmxCodeLists.xml resource is not resolved with the base resolver if no_network=True, but is when no_network=False. The resolution of the other resource (SDN:L11::D08) never happens. If using resolve_filename() method rather than the base resolver, the error is different: an OSError is thrown for these resources, as if the resource was deemed to be a file system resource, instead of a remote resource; however, the gmxCodeLists.xml is resolved correctly. The resolver I created using the Requests package doesn't have a problem with this resource.
I find it difficult to follow this description, but at least the gist cases 1-4 looked reasonable to me. Whether an OSError is the best exception for a remote URL, ... well, I guess it's not, but I'm not sure it's worth fixing (as it would probably also break someone's existing code).
I've now got a hunch that this is perhaps because the URI for this problematic resource is a redirect (checking with curl -IL, shows that it is a 302/301/200 chain). I can't currently rule out that there is something else about this resource that makes it unresolvable, but when I change the references to this resource, to point at the redirect, the validation completes. My next task is to create some simple external resources with redirects.
Is any of this behaviour expected?
- no_network has no effect apart from with the base resolver
Basically, when you define your own custom resolver, it's up to that resolver to decide whether it should allow network access or not. If your resolver passes the lookup on to the default resolver, either by resolving to a filename/URL or by not resolving at all, then the option applied again. Furthermore, "no_network" is a parser option. The fact that it also impacts the document loading in isoschematron (i.e. XSLT) is more of a (reasonable) side-effect because that uses the parser of the input document to load related documents. ("Input document" here is the actual input/validated document for XSLT/isoschematron for lookups related to the input document, and the schema document for lookups related to the schema, e.g. includes etc.)
- redirects are unresolvable by the base resolver and resolve_filename base resolution
I don't think I ever tested redirects, but the input layer of libxml2 should handle them. I would expect redirects to work for both the default resolver and after resolving through resolve_filename(). If they don't then that's a bug. OTOH, if custom resolvers are in place then the question is whether each redirect should first call back and ask them. I guess it would be nice if it did, but I'm not sure if this can be done (reasonably easily) with libxml2.
- resolution seemingly going ahead (successive calls to the resolver methods without error - see example test #1 in the linked logging), but really it's not
At least the duplicate calls to the resolver seem ... unexpected. I don't see why that would happen. I sense some room for improvements here... :) Stefan

Hi Stefan & the LXML list, Thanks for the feedback on these issues. Once I have more information, I'm happy to help improve the documentation to explain to new users these quirks and gotchas (as long as they're not bugs). For example: Stefan wrote:
Basically, when you define your own custom resolver, it's up to that resolver to decide whether it should allow network access or not.
^^ This would be useful to have in the docs! However, generally what's going on with the resolvers is not clear from the documentation as it stands. I've already tried to dive into the Cython and (libxml/lixslt) C code and I don't know if I would have much time to help debug this stuff. Is there any developers' documentation, explaining how the package is plumbed? Update on the testing: ---------------------------- I've been testing the redirects with some files hosted on Amazon's S3. So far, with a XMLSchema object at least - i.e. as opposed to an isoschematron object - a validator can be created - i.e. etree.XMLSchema(schema_etree) can resolve all the resources. The test resources feature 301 and 302 redirects. I now suspect that SSL might be to blame, as the [real world] example I had previously had a redirect to an htttps URL. I need to do some tests on this, but setting up an SSL certificate is a little more work. Is it already known that libxml2 cannot handle https? Roger ________________________________ From: lxml <lxml-bounces@lxml.de> on behalf of Stefan Behnel <stefan_ml@behnel.de> Sent: 25 June 2018 20:19:02 To: lxml@lxml.de Subject: Re: [lxml] Default parser with/without no_network Hi! Duthie, Roger J.A. schrieb am 25.06.2018 um 13:38:
Hello again,
An update on my attempts to work out the behaviour of the parsers/resolvers. This thread (Question 2) involved the validation of XML against a schematron and the apparent inability of the parser to resolve remote resources.
It seemed that my [real world] example was throwing up a few issues. Going back to my example code and output: https://gist.github.com/rogie-bas/e157501d12cff8ac0aafc0cc120ce8dc
(The link in the previous email may have been broken, because the fullstop was added to the URL, the above will hopefully work without modification!)
The first test logging (base resolver, no_network=True) shows that the validation fails with an XSLTApplyError, due to a resource not be resolvable (..../gmxCodeLists.xml). Confusingly, it doesn't seem to be the first one on the list of resources that the code has tried to resolve. However, the behaviour is different if no_network=False. It seems that the resource that the code chokes on is a different one (.../SDN:L11::D08). I may have a tentative reason why on this (based on empirical evidence).
I broke my schematron file down into single patterns which I deemed targetting parts of the XML file containing various remote resources. This way I could try resolving the various resources with the various parsers, with no_network = {True | False}. The gmxCodeLists.xml resource is not resolved with the base resolver if no_network=True, but is when no_network=False. The resolution of the other resource (SDN:L11::D08) never happens. If using resolve_filename() method rather than the base resolver, the error is different: an OSError is thrown for these resources, as if the resource was deemed to be a file system resource, instead of a remote resource; however, the gmxCodeLists.xml is resolved correctly. The resolver I created using the Requests package doesn't have a problem with this resource.
I find it difficult to follow this description, but at least the gist cases 1-4 looked reasonable to me. Whether an OSError is the best exception for a remote URL, ... well, I guess it's not, but I'm not sure it's worth fixing (as it would probably also break someone's existing code).
I've now got a hunch that this is perhaps because the URI for this problematic resource is a redirect (checking with curl -IL, shows that it is a 302/301/200 chain). I can't currently rule out that there is something else about this resource that makes it unresolvable, but when I change the references to this resource, to point at the redirect, the validation completes. My next task is to create some simple external resources with redirects.
Is any of this behaviour expected?
- no_network has no effect apart from with the base resolver
Basically, when you define your own custom resolver, it's up to that resolver to decide whether it should allow network access or not. If your resolver passes the lookup on to the default resolver, either by resolving to a filename/URL or by not resolving at all, then the option applied again. Furthermore, "no_network" is a parser option. The fact that it also impacts the document loading in isoschematron (i.e. XSLT) is more of a (reasonable) side-effect because that uses the parser of the input document to load related documents. ("Input document" here is the actual input/validated document for XSLT/isoschematron for lookups related to the input document, and the schema document for lookups related to the schema, e.g. includes etc.)
- redirects are unresolvable by the base resolver and resolve_filename base resolution
I don't think I ever tested redirects, but the input layer of libxml2 should handle them. I would expect redirects to work for both the default resolver and after resolving through resolve_filename(). If they don't then that's a bug. OTOH, if custom resolvers are in place then the question is whether each redirect should first call back and ask them. I guess it would be nice if it did, but I'm not sure if this can be done (reasonably easily) with libxml2.
- resolution seemingly going ahead (successive calls to the resolver methods without error - see example test #1 in the linked logging), but really it's not
At least the duplicate calls to the resolver seem ... unexpected. I don't see why that would happen. I sense some room for improvements here... :) Stefan _________________________________________________________________ Mailing list for the lxml Python XML toolkit - http://lxml.de/ lxml@lxml.de https://mailman-mail5.webfaction.com/listinfo/lxml ________________________________ This message (and any attachments) is for the recipient only. NERC is subject to the Freedom of Information Act 2000 and the contents of this email and any reply you make may be disclosed by NERC unless it is exempt from release under the Act. Any material supplied to NERC may be stored in an electronic records management system. ________________________________
participants (2)
-
Duthie, Roger J.A.
-
Stefan Behnel