insert() makes _ElementTree forget?
Dear LXML.de mailing list, This is my first post to the list. I have been using LXML for a few months now and have been learning the idiosyncrasies of the package over this time. I'm now at the point when I can question the behaviour of the package. Hopefully, someone can explain what is going on in my code and whether it is known and anticipated behaviour. The situation I'd like to present occurs when parsing an XSD, then modifying it and creating an XML Schema validator using LXML. It seems that if I parse an XSD - i.e. parse([...XSD...]) - with relative links - in an <xs:include> element, in this case - then create a validator from the resultant _ElementTree, all is well (cf., Example 1, below); however, if I modify the _ElementTree - by inserting an <xs:import> element, in this case - the validation creating fails (cf., Example 2). If the XSD is modified to have an absolute path specified in the <xs:include> element, the validation creation passes (cf., Example 3). So, I can get round the problem by making further modifications, but why should I have to? (In fact, I'm not sure that this modification would be so trivial in the general case.) Additionally, specifying a base_url at the parse() stage seems to have no effect. It seems like the _ElementTree object forgets it's origin when the tree is modified. Is this expected? I would like to be able to understand the processes LXML is undertaking. Perhaps there is another way to parse, which is better suited. I look forward to hearing sage words of wisdom. Thanks in advance, Roger -------------- Code examples: -------------- This example shows that the XSD is fine if a validator is simply created without modifying the _ElementTree: # Example 1 : this code is fine from lxml import etree # create initial xsd etree gco_etree = etree.parse('http://standards.iso.org/ittf/PubliclyAvailableStandards/ISO_19139_Schemas/g...') # attempt to create a validator - THIS IS FINE! gco_val = etree.XMLSchema(gco_etree) This second one shows that my intended method fails, if I add in an additional _Element into the _ElementTree: # Example 2: this code will fail from lxml import etree # create initial xsd etree gco_etree = etree.parse('http://standards.iso.org/ittf/PubliclyAvailableStandards/ISO_19139_Schemas/g...') # create the SDN extension import element xs_namespace = 'http://www.w3.org/2001/XMLSchema' xs_nsmap = {'xs':xs_namespace} import_tag = '{%s}import' % xs_namespace sdn_import_attrib = {'namespace':'http://www.seadatanet.org','schemaLocation':'http://schemas.seadatanet.org/Standards-Software/Metadata-formats/SDN2_CDI_ISO19139_10.0.1.xsd'} sdn_import_element = etree.Element(import_tag, attrib=sdn_import_attrib, nsmap=xs_nsmap) # inject the SDN import element into the ISO schema gco_etree.getroot().insert(0,sdn_import_element) # attempt to create a validator - THIS WILL FAIL gco_val = etree.XMLSchema(gco_etree) This third one shows that it is the fact that the include element has a relative link that causes the issue: # Example 3 : this code is also fine from lxml import etree # create initial xsd etree gco_etree = etree.parse('http://standards.iso.org/ittf/PubliclyAvailableStandards/ISO_19139_Schemas/g...') # create the SDN extension import element xs_namespace = 'http://www.w3.org/2001/XMLSchema' xs_nsmap = {'xs':xs_namespace} import_tag = '{%s}import' % xs_namespace sdn_import_attrib = {'namespace':'http://www.seadatanet.org','schemaLocation':'http://schemas.seadatanet.org/Standards-Software/Metadata-formats/SDN2_CDI_ISO19139_10.0.1.xsd'} sdn_import_element = etree.Element(import_tag, attrib=sdn_import_attrib, nsmap=xs_nsmap) # remove the old include element bad_include = gco_etree.getroot().find('xs:include',namespaces=xs_nsmap)[0] gco_etree.getroot().remove(bad_include) #create absolute include element include_tag = '{%s}include' % xs_namespace gcoBasicTypes_include_attrib = {'schemaLocation':'http://www.isotc211.org/2005/gco/basicTypes.xsd'} gcoBasicTypes_include_element = etree.Element(include_tag, attrib=gcoBasicTypes_include_attrib, nsmap=xs_nsmap) # inject the new include gco_etree.getroot().insert(0,gcoBasicTypes_include_element) # inject the SDN import element into the ISO schema gco_etree.getroot().insert(0,sdn_import_element) # attempt to create a validator - THIS IS ALSO FINE gco_val = etree.XMLSchema(gco_etree) ----------- System Info ----------- Python : sys.version_info(major=3, minor=4, micro=5, releaselevel='final', serial=0) lxml.etree : (3, 8, 0, 0) libxml used : (2, 9, 3) libxml compiled : (2, 9, 3) libxslt used : (1, 1, 29) libxslt compiled : (1, 1, 29) ________________________________ This message (and any attachments) is for the recipient only. NERC is subject to the Freedom of Information Act 2000 and the contents of this email and any reply you make may be disclosed by NERC unless it is exempt from release under the Act. Any material supplied to NERC may be stored in an electronic records management system. ________________________________
P.S. small error in Example 3: bad_include = gco_etree.getroot().find('xs:include',namespaces=xs_nsmap) No index required, as there is only one _Element returned and bad_include is a new variable. ________________________________ From: lxml <lxml-bounces@lxml.de> on behalf of Duthie, Roger J.A. <rogie@bas.ac.uk> Sent: 08 November 2017 14:31:58 To: lxml@lxml.de Subject: [lxml] insert() makes _ElementTree forget? Dear LXML.de mailing list, This is my first post to the list. I have been using LXML for a few months now and have been learning the idiosyncrasies of the package over this time. I'm now at the point when I can question the behaviour of the package. Hopefully, someone can explain what is going on in my code and whether it is known and anticipated behaviour. The situation I'd like to present occurs when parsing an XSD, then modifying it and creating an XML Schema validator using LXML. It seems that if I parse an XSD - i.e. parse([...XSD...]) - with relative links - in an <xs:include> element, in this case - then create a validator from the resultant _ElementTree, all is well (cf., Example 1, below); however, if I modify the _ElementTree - by inserting an <xs:import> element, in this case - the validation creating fails (cf., Example 2). If the XSD is modified to have an absolute path specified in the <xs:include> element, the validation creation passes (cf., Example 3). So, I can get round the problem by making further modifications, but why should I have to? (In fact, I'm not sure that this modification would be so trivial in the general case.) Additionally, specifying a base_url at the parse() stage seems to have no effect. It seems like the _ElementTree object forgets it's origin when the tree is modified. Is this expected? I would like to be able to understand the processes LXML is undertaking. Perhaps there is another way to parse, which is better suited. I look forward to hearing sage words of wisdom. Thanks in advance, Roger -------------- Code examples: -------------- This example shows that the XSD is fine if a validator is simply created without modifying the _ElementTree: # Example 1 : this code is fine from lxml import etree # create initial xsd etree gco_etree = etree.parse('http://standards.iso.org/ittf/PubliclyAvailableStandards/ISO_19139_Schemas/g...') # attempt to create a validator - THIS IS FINE! gco_val = etree.XMLSchema(gco_etree) This second one shows that my intended method fails, if I add in an additional _Element into the _ElementTree: # Example 2: this code will fail from lxml import etree # create initial xsd etree gco_etree = etree.parse('http://standards.iso.org/ittf/PubliclyAvailableStandards/ISO_19139_Schemas/g...') # create the SDN extension import element xs_namespace = 'http://www.w3.org/2001/XMLSchema' xs_nsmap = {'xs':xs_namespace} import_tag = '{%s}import' % xs_namespace sdn_import_attrib = {'namespace':'http://www.seadatanet.org','schemaLocation':'http://schemas.seadatanet.org/Standards-Software/Metadata-formats/SDN2_CDI_ISO19139_10.0.1.xsd'} sdn_import_element = etree.Element(import_tag, attrib=sdn_import_attrib, nsmap=xs_nsmap) # inject the SDN import element into the ISO schema gco_etree.getroot().insert(0,sdn_import_element) # attempt to create a validator - THIS WILL FAIL gco_val = etree.XMLSchema(gco_etree) This third one shows that it is the fact that the include element has a relative link that causes the issue: # Example 3 : this code is also fine from lxml import etree # create initial xsd etree gco_etree = etree.parse('http://standards.iso.org/ittf/PubliclyAvailableStandards/ISO_19139_Schemas/g...') # create the SDN extension import element xs_namespace = 'http://www.w3.org/2001/XMLSchema' xs_nsmap = {'xs':xs_namespace} import_tag = '{%s}import' % xs_namespace sdn_import_attrib = {'namespace':'http://www.seadatanet.org','schemaLocation':'http://schemas.seadatanet.org/Standards-Software/Metadata-formats/SDN2_CDI_ISO19139_10.0.1.xsd'} sdn_import_element = etree.Element(import_tag, attrib=sdn_import_attrib, nsmap=xs_nsmap) # remove the old include element bad_include = gco_etree.getroot().find('xs:include',namespaces=xs_nsmap)[0] gco_etree.getroot().remove(bad_include) #create absolute include element include_tag = '{%s}include' % xs_namespace gcoBasicTypes_include_attrib = {'schemaLocation':'http://www.isotc211.org/2005/gco/basicTypes.xsd'} gcoBasicTypes_include_element = etree.Element(include_tag, attrib=gcoBasicTypes_include_attrib, nsmap=xs_nsmap) # inject the new include gco_etree.getroot().insert(0,gcoBasicTypes_include_element) # inject the SDN import element into the ISO schema gco_etree.getroot().insert(0,sdn_import_element) # attempt to create a validator - THIS IS ALSO FINE gco_val = etree.XMLSchema(gco_etree) ----------- System Info ----------- Python : sys.version_info(major=3, minor=4, micro=5, releaselevel='final', serial=0) lxml.etree : (3, 8, 0, 0) libxml used : (2, 9, 3) libxml compiled : (2, 9, 3) libxslt used : (1, 1, 29) libxslt compiled : (1, 1, 29) ________________________________ This message (and any attachments) is for the recipient only. NERC is subject to the Freedom of Information Act 2000 and the contents of this email and any reply you make may be disclosed by NERC unless it is exempt from release under the Act. Any material supplied to NERC may be stored in an electronic records management system. ________________________________ ________________________________ This message (and any attachments) is for the recipient only. NERC is subject to the Freedom of Information Act 2000 and the contents of this email and any reply you make may be disclosed by NERC unless it is exempt from release under the Act. Any material supplied to NERC may be stored in an electronic records management system. ________________________________
...and the problem evolves... I found that: 1. if I substitute http://www.isotc211.org/2005/gco/basicTypes.xsd into the _ElementTree, the validator creation works; 2. if I susbsitute http://standards.iso.org/ittf/PubliclyAvailableStandards/ISO_19139_Schemas/g... into the _ElementTree, the validator creation fails. So, what's the difference? The first one is based on the URL given in the namespace, the second is another repository for the standard. I'm not sure, currently, what the difference between these repositories are; I'll look into that. It seems the first one is version="0.1" and generated on 01-26-2005, and the second is version="2012-07-13". However, the important difference, for the sake of understanding the problem that LXML is presenting me with, is that the include links within these two versions of the XSD have different forms: 1. this one has links that look like: "../gco/gcoBase.xsd" 2. this one has links that look like: "gcoBase.xsd" They both rationalise to a valid resource, but it seems that LXML has an issue with the latter. Is this a problem with not specifying the correct type of Resolver? I'm going to have a look into that possibility next. Any suggestions are welcome. Cheers, Roger ________________________________ From: lxml <lxml-bounces@lxml.de> on behalf of Duthie, Roger J.A. <rogie@bas.ac.uk> Sent: 08 November 2017 14:31:58 To: lxml@lxml.de Subject: [lxml] insert() makes _ElementTree forget? Dear LXML.de mailing list, This is my first post to the list. I have been using LXML for a few months now and have been learning the idiosyncrasies of the package over this time. I'm now at the point when I can question the behaviour of the package. Hopefully, someone can explain what is going on in my code and whether it is known and anticipated behaviour. The situation I'd like to present occurs when parsing an XSD, then modifying it and creating an XML Schema validator using LXML. It seems that if I parse an XSD - i.e. parse([...XSD...]) - with relative links - in an <xs:include> element, in this case - then create a validator from the resultant _ElementTree, all is well (cf., Example 1, below); however, if I modify the _ElementTree - by inserting an <xs:import> element, in this case - the validation creating fails (cf., Example 2). If the XSD is modified to have an absolute path specified in the <xs:include> element, the validation creation passes (cf., Example 3). So, I can get round the problem by making further modifications, but why should I have to? (In fact, I'm not sure that this modification would be so trivial in the general case.) Additionally, specifying a base_url at the parse() stage seems to have no effect. It seems like the _ElementTree object forgets it's origin when the tree is modified. Is this expected? I would like to be able to understand the processes LXML is undertaking. Perhaps there is another way to parse, which is better suited. I look forward to hearing sage words of wisdom. Thanks in advance, Roger -------------- Code examples: -------------- This example shows that the XSD is fine if a validator is simply created without modifying the _ElementTree: # Example 1 : this code is fine from lxml import etree # create initial xsd etree gco_etree = etree.parse('http://standards.iso.org/ittf/PubliclyAvailableStandards/ISO_19139_Schemas/g...') # attempt to create a validator - THIS IS FINE! gco_val = etree.XMLSchema(gco_etree) This second one shows that my intended method fails, if I add in an additional _Element into the _ElementTree: # Example 2: this code will fail from lxml import etree # create initial xsd etree gco_etree = etree.parse('http://standards.iso.org/ittf/PubliclyAvailableStandards/ISO_19139_Schemas/g...') # create the SDN extension import element xs_namespace = 'http://www.w3.org/2001/XMLSchema' xs_nsmap = {'xs':xs_namespace} import_tag = '{%s}import' % xs_namespace sdn_import_attrib = {'namespace':'http://www.seadatanet.org','schemaLocation':'http://schemas.seadatanet.org/Standards-Software/Metadata-formats/SDN2_CDI_ISO19139_10.0.1.xsd'} sdn_import_element = etree.Element(import_tag, attrib=sdn_import_attrib, nsmap=xs_nsmap) # inject the SDN import element into the ISO schema gco_etree.getroot().insert(0,sdn_import_element) # attempt to create a validator - THIS WILL FAIL gco_val = etree.XMLSchema(gco_etree) This third one shows that it is the fact that the include element has a relative link that causes the issue: # Example 3 : this code is also fine from lxml import etree # create initial xsd etree gco_etree = etree.parse('http://standards.iso.org/ittf/PubliclyAvailableStandards/ISO_19139_Schemas/g...') # create the SDN extension import element xs_namespace = 'http://www.w3.org/2001/XMLSchema' xs_nsmap = {'xs':xs_namespace} import_tag = '{%s}import' % xs_namespace sdn_import_attrib = {'namespace':'http://www.seadatanet.org','schemaLocation':'http://schemas.seadatanet.org/Standards-Software/Metadata-formats/SDN2_CDI_ISO19139_10.0.1.xsd'} sdn_import_element = etree.Element(import_tag, attrib=sdn_import_attrib, nsmap=xs_nsmap) # remove the old include element bad_include = gco_etree.getroot().find('xs:include',namespaces=xs_nsmap)[0] gco_etree.getroot().remove(bad_include) #create absolute include element include_tag = '{%s}include' % xs_namespace gcoBasicTypes_include_attrib = {'schemaLocation':'http://www.isotc211.org/2005/gco/basicTypes.xsd'} gcoBasicTypes_include_element = etree.Element(include_tag, attrib=gcoBasicTypes_include_attrib, nsmap=xs_nsmap) # inject the new include gco_etree.getroot().insert(0,gcoBasicTypes_include_element) # inject the SDN import element into the ISO schema gco_etree.getroot().insert(0,sdn_import_element) # attempt to create a validator - THIS IS ALSO FINE gco_val = etree.XMLSchema(gco_etree) ----------- System Info ----------- Python : sys.version_info(major=3, minor=4, micro=5, releaselevel='final', serial=0) lxml.etree : (3, 8, 0, 0) libxml used : (2, 9, 3) libxml compiled : (2, 9, 3) libxslt used : (1, 1, 29) libxslt compiled : (1, 1, 29) ________________________________ This message (and any attachments) is for the recipient only. NERC is subject to the Freedom of Information Act 2000 and the contents of this email and any reply you make may be disclosed by NERC unless it is exempt from release under the Act. Any material supplied to NERC may be stored in an electronic records management system. ________________________________ ________________________________ This message (and any attachments) is for the recipient only. NERC is subject to the Freedom of Information Act 2000 and the contents of this email and any reply you make may be disclosed by NERC unless it is exempt from release under the Act. Any material supplied to NERC may be stored in an electronic records management system. ________________________________
participants (1)
-
Duthie, Roger J.A.