lxml 6.0.0: XMLSchemaParseError: Invalid argument
Hi, I’m upgrading a project from lxml 5.4.0 to the newly released lxml 6.0.0 and encountering an unexpected XMLSchemaParseError. I’ve distilled the problem into a minimal, self-contained example and uploaded it as a GitHub gist: https://gist.github.com/AustinMatherne/533a4b6a31a63e11bfd8c09c03c05183 * The same XML and XSD files parse and schema validate cleanly with lxml 5.4.0. * With lxml 6.0.0, calling XMLSchema() raises an XMLSchemaParseError with no obvious culprit. Is this a bug in libxml, lxml, or am I doing something unsupported with the API? Thank you for your time and help, Austin
Hi, Austin Matherne schrieb am 01.07.25 um 04:01:
I’m upgrading a project from lxml 5.4.0 to the newly released lxml 6.0.0 and encountering an unexpected XMLSchemaParseError. I’ve distilled the problem into a minimal, self-contained example and uploaded it as a GitHub gist:
https://gist.github.com/AustinMatherne/533a4b6a31a63e11bfd8c09c03c05183
* The same XML and XSD files parse and schema validate cleanly with lxml 5.4.0. * With lxml 6.0.0, calling XMLSchema() raises an XMLSchemaParseError with no obvious culprit.
Is this a bug in libxml, lxml, or am I doing something unsupported with the API?
So, I added a print(system_url) to your resolver and where the working version downloads a whole pack of schema files transitively, the failing version only gives the following output: """ READ http://www.w3.org/2001/xml.xsd READ http://www.xbrl.org/2013/inlineXBRL/xhtml-inlinexbrl-1_1-modules.xsd Traceback (most recent call last): File "/home/stefan/source/Python/lxml/lxml-hg/TEST/schema_error_ml_20250701/lxml.test.py", line 45, in <module> schema = etree.XMLSchema(schema_tree) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "src/lxml/xmlschema.pxi", line 90, in lxml.etree.XMLSchema.__init__ lxml.etree.XMLSchemaParseError: Invalid argument, line 1, column 37 """ First of all, I highly recommend setting up XML catalogues on your system to avoid downloading the schemas over and over again. It's really a lot of useless network back and forth, server usage, waiting time etc. going on here that can be avoided entirely by installing local copies of the schemas. libxml2 will search the usual system directories automatically when asked to use a schema and thus avoid any network traffic. Then, it seems to fail immediately at the first included schema file, at a suspicious position of 37 characters, which is right after the XML declaration. That hints more at something going wrong in libxml2 than lxml but is so surprisingly obviously not working that it's unlikely to go undetected in libxml2 releases. I recommend bringing this to the attention of the libxml2 developers. Stefan
Stefan Behnel schrieb am 06.07.25 um 08:38:
Austin Matherne schrieb am 01.07.25 um 04:01:
I’m upgrading a project from lxml 5.4.0 to the newly released lxml 6.0.0 and encountering an unexpected XMLSchemaParseError. I’ve distilled the problem into a minimal, self-contained example and uploaded it as a GitHub gist:
https://gist.github.com/AustinMatherne/533a4b6a31a63e11bfd8c09c03c05183
* The same XML and XSD files parse and schema validate cleanly with lxml 5.4.0. * With lxml 6.0.0, calling XMLSchema() raises an XMLSchemaParseError with no obvious culprit.
Is this a bug in libxml, lxml, or am I doing something unsupported with the API?
So, I added a print(system_url) to your resolver and where the working version downloads a whole pack of schema files transitively, the failing version only gives the following output:
""" READ http://www.w3.org/2001/xml.xsd READ http://www.xbrl.org/2013/inlineXBRL/xhtml-inlinexbrl-1_1-modules.xsd Traceback (most recent call last): File "/home/stefan/source/Python/lxml/lxml-hg/TEST/ schema_error_ml_20250701/lxml.test.py", line 45, in <module> schema = etree.XMLSchema(schema_tree) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "src/lxml/xmlschema.pxi", line 90, in lxml.etree.XMLSchema.__init__ lxml.etree.XMLSchemaParseError: Invalid argument, line 1, column 37 """
First of all, I highly recommend setting up XML catalogues on your system to avoid downloading the schemas over and over again. It's really a lot of useless network back and forth, server usage, waiting time etc. going on here that can be avoided entirely by installing local copies of the schemas. libxml2 will search the usual system directories automatically when asked to use a schema and thus avoid any network traffic.
Then, it seems to fail immediately at the first included schema file, at a suspicious position of 37 characters, which is right after the XML declaration. That hints more at something going wrong in libxml2 than lxml but is so surprisingly obviously not working that it's unlikely to go undetected in libxml2 releases. I recommend bringing this to the attention of the libxml2 developers.
Actually, it *was* something that lxml can resolve on its own side. libxml2 got a new API for passing data from resolvers into the parser and lxml didn't use that yet but had to resort to some manual setup that apparently no longer works in libxml2 2.14+. There is a test for this, so I'm not sure why it didn't fail when switching to libxml2 2.14, but in any case, I pushed a fix to the 6.0 branch that resolves it on my side: https://github.com/lxml/lxml/commit/2aae3a9625fcb858f83715a81b4d7182d2529a09 I'll release a bug fix version soon. Stefan
Hi all, Thanks Stefan! I also looked into this, and it appears that index.buf or index->buf (C) is not getting set on the lxml side or the libxml2 side. It looks like the call on line 487 of parser.pxi, c_input = xmlparser.xmlNewInputStream(c_context), calls a deprecated (since at least 11 months ago) function in libxml2's parserInternals.c. So Stefan's fix probably just updates lxml to use the updated libxml2 API, which *does *set buf. For those who want more details: It was probably deprecated because of the new functions starting with xmlNewInputFrom. The containing function in lxml, _local_resolver, is passed into libxml2's xmlSetExternalEntityLoader in _register_document_loader. xmlSetExternalEntityLoader itself replaces xmlDefaultExternalEntityLoader with a custom callback. For reference, xmlDefaultExternalEntityLoader *does* in fact set the input->buf. If you follow a few function calls down to xmlNewInputFromUrl, there is a call to xmlParserInputBufferCreateUrl, which creates the buffer. However, the calls in lxml and the deprecated function leave the buffer as NULL. Best, Abe On Sun, Jul 6, 2025 at 3:59 AM Stefan Behnel via lxml - The Python XML Toolkit <lxml@python.org> wrote:
Stefan Behnel schrieb am 06.07.25 um 08:38:
Austin Matherne schrieb am 01.07.25 um 04:01:
I’m upgrading a project from lxml 5.4.0 to the newly released lxml 6.0.0 and encountering an unexpected XMLSchemaParseError. I’ve distilled the problem into a minimal, self-contained example and uploaded it as a GitHub gist:
https://gist.github.com/AustinMatherne/533a4b6a31a63e11bfd8c09c03c05183
* The same XML and XSD files parse and schema validate cleanly with lxml 5.4.0. * With lxml 6.0.0, calling XMLSchema() raises an XMLSchemaParseError with no obvious culprit.
Is this a bug in libxml, lxml, or am I doing something unsupported with the API?
So, I added a print(system_url) to your resolver and where the working version downloads a whole pack of schema files transitively, the failing version only gives the following output:
""" READ http://www.w3.org/2001/xml.xsd READ http://www.xbrl.org/2013/inlineXBRL/xhtml-inlinexbrl-1_1-modules.xsd Traceback (most recent call last): File "/home/stefan/source/Python/lxml/lxml-hg/TEST/ schema_error_ml_20250701/lxml.test.py", line 45, in <module> schema = etree.XMLSchema(schema_tree) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "src/lxml/xmlschema.pxi", line 90, in lxml.etree.XMLSchema.__init__ lxml.etree.XMLSchemaParseError: Invalid argument, line 1, column 37 """
First of all, I highly recommend setting up XML catalogues on your system to avoid downloading the schemas over and over again. It's really a lot of useless network back and forth, server usage, waiting time etc. going on here that can be avoided entirely by installing local copies of the schemas. libxml2 will search the usual system directories automatically when asked to use a schema and thus avoid any network traffic.
Then, it seems to fail immediately at the first included schema file, at a suspicious position of 37 characters, which is right after the XML declaration. That hints more at something going wrong in libxml2 than lxml but is so surprisingly obviously not working that it's unlikely to go undetected in libxml2 releases. I recommend bringing this to the attention of the libxml2 developers.
Actually, it *was* something that lxml can resolve on its own side. libxml2 got a new API for passing data from resolvers into the parser and lxml didn't use that yet but had to resort to some manual setup that apparently no longer works in libxml2 2.14+.
There is a test for this, so I'm not sure why it didn't fail when switching to libxml2 2.14, but in any case, I pushed a fix to the 6.0 branch that resolves it on my side:
https://github.com/lxml/lxml/commit/2aae3a9625fcb858f83715a81b4d7182d2529a09
I'll release a bug fix version soon.
Stefan
_______________________________________________ lxml - The Python XML Toolkit mailing list -- lxml@python.org To unsubscribe send an email to lxml-leave@python.org https://mail.python.org/mailman3//lists/lxml.python.org Member address: abepolk@gmail.com
My bad, I meant input.buf and input->buf, not index->buf. Best, Abe On Sun, Jul 6, 2025 at 9:11 AM Abraham Polk <abepolk@gmail.com> wrote:
Hi all,
Thanks Stefan! I also looked into this, and it appears that index.buf or index->buf (C) is not getting set on the lxml side or the libxml2 side. It looks like the call on line 487 of parser.pxi, c_input = xmlparser.xmlNewInputStream(c_context), calls a deprecated (since at least 11 months ago) function in libxml2's parserInternals.c. So Stefan's fix probably just updates lxml to use the updated libxml2 API, which *does *set buf.
For those who want more details: It was probably deprecated because of the new functions starting with xmlNewInputFrom. The containing function in lxml, _local_resolver, is passed into libxml2's xmlSetExternalEntityLoader in _register_document_loader. xmlSetExternalEntityLoader itself replaces xmlDefaultExternalEntityLoader with a custom callback. For reference, xmlDefaultExternalEntityLoader *does* in fact set the input->buf. If you follow a few function calls down to xmlNewInputFromUrl, there is a call to xmlParserInputBufferCreateUrl, which creates the buffer. However, the calls in lxml and the deprecated function leave the buffer as NULL.
Best, Abe
On Sun, Jul 6, 2025 at 3:59 AM Stefan Behnel via lxml - The Python XML Toolkit <lxml@python.org> wrote:
Stefan Behnel schrieb am 06.07.25 um 08:38:
Austin Matherne schrieb am 01.07.25 um 04:01:
I’m upgrading a project from lxml 5.4.0 to the newly released lxml 6.0.0 and encountering an unexpected XMLSchemaParseError. I’ve distilled the problem into a minimal, self-contained example and uploaded it as a GitHub gist:
https://gist.github.com/AustinMatherne/533a4b6a31a63e11bfd8c09c03c05183
* The same XML and XSD files parse and schema validate cleanly with
lxml
5.4.0. * With lxml 6.0.0, calling XMLSchema() raises an XMLSchemaParseError with no obvious culprit.
Is this a bug in libxml, lxml, or am I doing something unsupported with the API?
So, I added a print(system_url) to your resolver and where the working version downloads a whole pack of schema files transitively, the failing version only gives the following output:
""" READ http://www.w3.org/2001/xml.xsd READ http://www.xbrl.org/2013/inlineXBRL/xhtml-inlinexbrl-1_1-modules.xsd Traceback (most recent call last): File "/home/stefan/source/Python/lxml/lxml-hg/TEST/ schema_error_ml_20250701/lxml.test.py", line 45, in <module> schema = etree.XMLSchema(schema_tree) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "src/lxml/xmlschema.pxi", line 90, in lxml.etree.XMLSchema.__init__ lxml.etree.XMLSchemaParseError: Invalid argument, line 1, column 37 """
First of all, I highly recommend setting up XML catalogues on your system to avoid downloading the schemas over and over again. It's really a lot of useless network back and forth, server usage, waiting time etc. going on here that can be avoided entirely by installing local copies of the schemas. libxml2 will search the usual system directories automatically when asked to use a schema and thus avoid any network traffic.
Then, it seems to fail immediately at the first included schema file, at a suspicious position of 37 characters, which is right after the XML declaration. That hints more at something going wrong in libxml2 than lxml but is so surprisingly obviously not working that it's unlikely to go undetected in libxml2 releases. I recommend bringing this to the attention of the libxml2 developers.
Actually, it *was* something that lxml can resolve on its own side. libxml2 got a new API for passing data from resolvers into the parser and lxml didn't use that yet but had to resort to some manual setup that apparently no longer works in libxml2 2.14+.
There is a test for this, so I'm not sure why it didn't fail when switching to libxml2 2.14, but in any case, I pushed a fix to the 6.0 branch that resolves it on my side:
https://github.com/lxml/lxml/commit/2aae3a9625fcb858f83715a81b4d7182d2529a09
I'll release a bug fix version soon.
Stefan
_______________________________________________ lxml - The Python XML Toolkit mailing list -- lxml@python.org To unsubscribe send an email to lxml-leave@python.org https://mail.python.org/mailman3//lists/lxml.python.org Member address: abepolk@gmail.com
Thank you for digging into this! I’ll test within our application (open-source Arelle project) as soon as there’s a new build. Regarding catalogs. The application does use them for standard known schemas, but the application processes user provided XML (XBRL packages which have their own spec that includes XML catalogs). Some of these user provided packages include catalogs, but not all of them require it, so we have to handle both cases. Thanks again, Austin On Sun, Jul 6, 2025 at 9:16 AM, Abraham Polk via lxml - The Python XML Toolkit <[lxml@python.org](mailto:On Sun, Jul 6, 2025 at 9:16 AM, Abraham Polk via lxml - The Python XML Toolkit <<a href=)> wrote:
My bad, I meant input.buf and input->buf, not index->buf.
Best, Abe
On Sun, Jul 6, 2025 at 9:11 AM Abraham Polk < abepolk@gmail.com> wrote:
Hi all,
Thanks Stefan! I also looked into this, and it appears that index.buf or index->buf (C) is not getting set on the lxml side or the libxml2 side. It looks like the call on line 487 of parser.pxi, c_input = xmlparser.xmlNewInputStream(c_context), calls a deprecated (since at least 11 months ago) function in libxml2's parserInternals.c . So Stefan's fix probably just updates lxml to use the updated libxml2 API, which does set buf.
For those who want more details: It was probably deprecated because of the new functions starting with xmlNewInputFrom. The containing function in lxml, _local_resolver, is passed into libxml2's xmlSetExternalEntityLoader in _register_document_loader. xmlSetExternalEntityLoader itself replaces xmlDefaultExternalEntityLoader with a custom callback. For reference, xmlDefaultExternalEntityLoader does in fact set the input->buf. If you follow a few function calls down to xmlNewInputFromUrl , there is a call to xmlParserInputBufferCreateUrl, which creates the buffer. However, the calls in lxml and the deprecated function leave the buffer as NULL.
Best, Abe
On Sun, Jul 6, 2025 at 3:59 AM Stefan Behnel via lxml - The Python XML Toolkit < lxml@python.org> wrote:
Stefan Behnel schrieb am 06.07.25 um 08:38:
Austin Matherne schrieb am 01.07.25 um 04:01:
I’m upgrading a project from lxml 5.4.0 to the newly released lxml 6.0.0 and encountering an unexpected XMLSchemaParseError. I’ve distilled the problem into a minimal, self-contained example and uploaded it as a GitHub gist:
https://gist.github.com/AustinMatherne/533a4b6a31a63e11bfd8c09c03c05183
* The same XML and XSD files parse and schema validate cleanly with lxml 5.4.0. * With lxml 6.0.0, calling XMLSchema() raises an XMLSchemaParseError with no obvious culprit.
Is this a bug in libxml, lxml, or am I doing something unsupported with the API?
So, I added a print(system_url) to your resolver and where the working version downloads a whole pack of schema files transitively, the failing version only gives the following output:
""" READ http://www.w3.org/2001/xml.xsd READ http://www.xbrl.org/2013/inlineXBRL/xhtml-inlinexbrl-1_1-modules.xsd Traceback (most recent call last): File "/home/stefan/source/Python/lxml/lxml-hg/TEST/ schema_error_ml_20250701/ lxml.test.py", line 45, in <module> schema = etree.XMLSchema(schema_tree) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "src/lxml/xmlschema.pxi", line 90, in lxml.etree.XMLSchema.__init__ lxml.etree.XMLSchemaParseError: Invalid argument, line 1, column 37 """
First of all, I highly recommend setting up XML catalogues on your system to avoid downloading the schemas over and over again. It's really a lot of useless network back and forth, server usage, waiting time etc. going on here that can be avoided entirely by installing local copies of the schemas. libxml2 will search the usual system directories automatically when asked to use a schema and thus avoid any network traffic.
Then, it seems to fail immediately at the first included schema file, at a suspicious position of 37 characters, which is right after the XML declaration. That hints more at something going wrong in libxml2 than lxml but is so surprisingly obviously not working that it's unlikely to go undetected in libxml2 releases. I recommend bringing this to the attention of the libxml2 developers.
Actually, it *was* something that lxml can resolve on its own side. libxml2 got a new API for passing data from resolvers into the parser and lxml didn't use that yet but had to resort to some manual setup that apparently no longer works in libxml2 2.14+.
There is a test for this, so I'm not sure why it didn't fail when switching to libxml2 2.14, but in any case, I pushed a fix to the 6.0 branch that resolves it on my side:
https://github.com/lxml/lxml/commit/2aae3a9625fcb858f83715a81b4d7182d2529a09
I'll release a bug fix version soon.
Stefan
_______________________________________________ lxml - The Python XML Toolkit mailing list -- lxml@python.org To unsubscribe send an email to lxml-leave@python.org https://mail.python.org/mailman3//lists/lxml.python.org Member address: abepolk@gmail.com
participants (3)
-
Abraham Polk -
Austin Matherne -
Stefan Behnel