Tested value modified in validation error message
Before I file a ticket, I wanted to check here to find out if this is a known problem (I searched the bug tracker, but didn't see any matches). Our customer is using a commercial XML editor and the editor's implementation of the DOM API sometimes garbles whitespace inside text nodes when serializing XML documents for the document's xml property. This causes schema validation errors when we send that serialization off to be validated against the document type's schema using lxml. The problem we have run into with lxml is that the message property of the error object which comes back in the schema object's error_log has altered the value being validated. So, for example, if the text content of an element in the serialized DOM is "Processing\ncomplete" and the value it was supposed to match in the enumerated valid values is "Processing complete" then the message attribute for the corresponding object in the error_log has Element 'ProcessingStatusValue': [facet 'enumeration'] The value 'Processing complete' is not an element of the set {'Ready for English peer review', 'Ready for English scientific review', 'Ready for English OCCM review', 'Ready for Spanish peer review', 'Ready for Spanish OCCM review', 'Ready for publishing', 'Ready for translation', 'Processing complete'}. In effect saying "the value 'Processing complete' does not match the value 'Processing complete'." Took a very long time for us to figure out that the error message was wrong, and that it should have said ... The value 'Processing\ncomplete' is not an element of the set .... -- Bob Kline https://www.rksystems.com mailto:bkline@rksystems.com
On Sun, May 2, 2021 at 8:31 AM Bob Kline <bkline@rksystems.com> wrote:
... Element 'ProcessingStatusValue': [facet 'enumeration'] The value 'Processing complete' is not an element of the set {'Ready for English peer review', 'Ready for English scientific review', 'Ready for English OCCM review', 'Ready for Spanish peer review', 'Ready for Spanish OCCM review', 'Ready for publishing', 'Ready for translation', 'Processing complete'}. ...
I see that the email processing pipeline has somewhere along the way applied some word-wrapping to my previous message. The message attribute for the error object was a single-line string. So something like The value 'A B C' is not an element of the set {'A B C', 'D E F'} -- Bob Kline https://www.rksystems.com mailto:bkline@rksystems.com
Hi Bob If the input is "Processing\ncomplete" and the match string is "Processing complete" then this will not be a match, but if your enumeration value is "Pattern\scomplete" it should be OK (and it will also match "Pattern complete"(. (I guess the "sometimes garbles whitespace inside text nodes" is just the commercial XML editor wrapping lines that could exceed a 'pretty' length. Paul -----Original Message----- From: Bob Kline <bkline@rksystems.com> Sent: 02 May 2021 13:32 To: lxml mailing list <lxml@lxml.de> Subject: [lxml] Tested value modified in validation error message Before I file a ticket, I wanted to check here to find out if this is a known problem (I searched the bug tracker, but didn't see any matches). Our customer is using a commercial XML editor and the editor's implementation of the DOM API sometimes garbles whitespace inside text nodes when serializing XML documents for the document's xml property. This causes schema validation errors when we send that serialization off to be validated against the document type's schema using lxml. The problem we have run into with lxml is that the message property of the error object which comes back in the schema object's error_log has altered the value being validated. So, for example, if the text content of an element in the serialized DOM is "Processing\ncomplete" and the value it was supposed to match in the enumerated valid values is "Processing complete" then the message attribute for the corresponding object in the error_log has Element 'ProcessingStatusValue': [facet 'enumeration'] The value 'Processing complete' is not an element of the set {'Ready for English peer review', 'Ready for English scientific review', 'Ready for English OCCM review', 'Ready for Spanish peer review', 'Ready for Spanish OCCM review', 'Ready for publishing', 'Ready for translation', 'Processing complete'}. In effect saying "the value 'Processing complete' does not match the value 'Processing complete'." Took a very long time for us to figure out that the error message was wrong, and that it should have said ... The value 'Processing\ncomplete' is not an element of the set .... -- Bob Kline https://www.rksystems.com mailto:bkline@rksystems.com _______________________________________________ lxml - The Python XML Toolkit mailing list -- lxml@python.org To unsubscribe send an email to lxml-leave@python.org https://mail.python.org/mailman3/lists/lxml.python.org/ Member address: paul_higgs@hotmail.com
On Sun, May 2, 2021 at 10:37 AM Paul Higgs <paul_higgs@hotmail.com> wrote:
If the input is "Processing\ncomplete" and the match string is "Processing complete" then this will not be a match, but if your enumeration value is "Pattern\scomplete" it should be OK (and it will also match "Pattern complete"(. (I guess the "sometimes garbles whitespace inside text nodes" is just the commercial XML editor wrapping lines that could exceed a 'pretty' length.
Hi, Paul. We're dealing with two bugs. The first bug is the behavior of the editor vendor's implementation of the DOM API, which is giving us back corrupted XML when we ask for the serialized document via the Document object's xml property. We know that it's the implementation of the serialization for that property which is introducing the corruption, rather than corruption of the values in the DOM itself, because when we recursively walk through all the nodes of the DOM to implement our own serialization of the document, the corruption is not present, and the whitespace inside all of the text nodes is intact. We have reported that problem to the editor vendor, and I have implemented a workaround to avoid this first bug. The bug I'm asking about in this forum is the second bug, which is producing an incorrect error message, pretending that the value being submitted for testing was "Processing complete" (without a newline character) when the value being tested was actually "Processing\ncomplete" (with a newline character). The confusion this misleading error message introduced made it much more difficult than it should have been to track down and identify the first bug. Does that make things clearer? -- Bob Kline https://www.rksystems.com mailto:bkline@rksystems.com
Thanks Bob, that’s much clearer Rather than the current error message of Element 'ProcessingStatusValue': [facet 'enumeration'] The value 'Processing complete' is not an element of the set {'Ready for English peer review', 'Ready for English scientific review', 'Ready for English OCCM review', 'Ready for Spanish peer review', 'Ready for Spanish OCCM review', 'Ready for publishing', 'Ready for translation', 'Processing complete'}. You would rather see Element 'ProcessingStatusValue': [facet 'enumeration'] The value 'Processing complete' is not an element of the set {'Ready for English peer review', 'Ready for English scientific review', 'Ready for English OCCM review', 'Ready for Spanish peer review', 'Ready for Spanish OCCM review', 'Ready for publishing', 'Ready for translation', 'Processing complete'}. for example. I know where this error message is being generated in libxml2 (in xmlschemas.c). If you could send a small XML instance and schema that fails, I can run the debugger against XMLint to understand the logic leading to this. It is probably an error in libxml2 rather than in in the lxml python bindings. Cheers Paul -----Original Message----- From: Bob Kline <bkline@rksystems.com> Sent: 02 May 2021 17:57 To: Paul Higgs <paul_higgs@hotmail.com> Cc: lxml mailing list <lxml@lxml.de> Subject: Re: [lxml] Tested value modified in validation error message On Sun, May 2, 2021 at 10:37 AM Paul Higgs <paul_higgs@hotmail.com> wrote:
If the input is "Processing\ncomplete" and the match string is "Processing complete" then this will not be a match, but if your enumeration value is "Pattern\scomplete" it should be OK (and it will also match "Pattern complete"(. (I guess the "sometimes garbles whitespace inside text nodes" is just the commercial XML editor wrapping lines that could exceed a 'pretty' length.
Hi, Paul. We're dealing with two bugs. The first bug is the behavior of the editor vendor's implementation of the DOM API, which is giving us back corrupted XML when we ask for the serialized document via the Document object's xml property. We know that it's the implementation of the serialization for that property which is introducing the corruption, rather than corruption of the values in the DOM itself, because when we recursively walk through all the nodes of the DOM to implement our own serialization of the document, the corruption is not present, and the whitespace inside all of the text nodes is intact. We have reported that problem to the editor vendor, and I have implemented a workaround to avoid this first bug. The bug I'm asking about in this forum is the second bug, which is producing an incorrect error message, pretending that the value being submitted for testing was "Processing complete" (without a newline character) when the value being tested was actually "Processing\ncomplete" (with a newline character). The confusion this misleading error message introduced made it much more difficult than it should have been to track down and identify the first bug. Does that make things clearer? -- Bob Kline https://www.rksystems.com mailto:bkline@rksystems.com
Hi, Bob Kline schrieb am 02.05.21 um 18:56:
The bug I'm asking about in this forum is the second bug, which is producing an incorrect error message, pretending that the value being submitted for testing was "Processing complete" (without a newline character) when the value being tested was actually "Processing\ncomplete" (with a newline character). The confusion this misleading error message introduced made it much more difficult than it should have been to track down and identify the first bug.
It might be that there is some form of space normalisation going on here. In any case, it's not lxml that comes up with this message but libxml2. (In which case this is the wrong place to report this.) You can check whether the "xmllint" tool (that comes with libxml2 as a frontend) shows the same behaviour when you run its XML-Schema validation. Note that lxml probably uses the latest libxml2 on your side (the wheels include it), which may not be same as the library version installed on your system. Stefan
I just created the following Bob1.xsd <?xml version="1.0" encoding="UTF-8"?> <xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema" elementFormDefault="qualified" attributeFormDefault="unqualified"> <xs:element name="root"> <xs:simpleType> <xs:restriction base="xs:string"> <xs:enumeration value="Ready for English peer review"/> <xs:enumeration value="Ready for English scientific review"/> <xs:enumeration value="Ready for English OCCM review"/> <xs:enumeration value="Ready for Spanish peer review"/> <xs:enumeration value="Ready for Spanish OCCM review"/> <xs:enumeration value="Ready for publishing"/> <xs:enumeration value="Ready for translation"/> <xs:enumeration value="Processing complete"/> </xs:restriction> </xs:simpleType> </xs:element> </xs:schema> Bob1.xml <?xml version="1.0" encoding="UTF-8"?> <root xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:noNamespaceSchemaLocation="scema1.xsd">Processing complete</root> .. with a CR after 'Processing' and the command "xmllint --schema Bob1.xsd Bob1.xml" fails with the message...
xmllint --schema schema1.xsd doc1.xml <?xml version="1.0" encoding="UTF-8"?> <root xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:noNamespaceSchemaLocation="scema1.xsd">Processing complete</root> doc1.xml:2: element root: Schemas validity error : Element 'root': [facet 'enumeration'] The value 'Processing complete' is not an element of the set {'Ready for English peer review', 'Ready for English scientific review', 'Ready for English OCCM review', 'Ready for Spanish peer review', 'Ready for Spanish OCCM review', 'Ready for publishing', 'Ready for translation', 'Processing complete'}. doc1.xml:2: element root: Schemas validity error : Element 'root': 'Processing complete' is not a valid value of the local atomic type. doc1.xml fails to validate
I will look into the libxml2 to see where the \n is being converted to a <space>. Paul -----Original Message----- From: Stefan Behnel <stefan_ml@behnel.de> Sent: 02 May 2021 18:28 To: lxml@python.org Subject: [lxml] Re: Tested value modified in validation error message Hi, Bob Kline schrieb am 02.05.21 um 18:56:
The bug I'm asking about in this forum is the second bug, which is producing an incorrect error message, pretending that the value being submitted for testing was "Processing complete" (without a newline character) when the value being tested was actually "Processing\ncomplete" (with a newline character). The confusion this misleading error message introduced made it much more difficult than it should have been to track down and identify the first bug.
It might be that there is some form of space normalisation going on here. In any case, it's not lxml that comes up with this message but libxml2. (In which case this is the wrong place to report this.) You can check whether the "xmllint" tool (that comes with libxml2 as a frontend) shows the same behaviour when you run its XML-Schema validation. Note that lxml probably uses the latest libxml2 on your side (the wheels include it), which may not be same as the library version installed on your system. Stefan _______________________________________________ lxml - The Python XML Toolkit mailing list -- lxml@python.org To unsubscribe send an email to lxml-leave@python.org https://mail.python.org/mailman3/lists/lxml.python.org/ Member address: paul_higgs@hotmail.com
On Sun, May 2, 2021 at 1:45 PM Paul Higgs <paul_higgs@hotmail.com> wrote:
I just created the following ... I will look into the libxml2 to see where the \n is being converted to a <space>. ....
Actually, your repro case is showing the correct behavior. In other words, the \n is NOT being converted to a space character. I created a comparable repro case using lxml.etree, and observed the correct behavior there, as well. So I dug a little further, and found that the second bug is actually in our own code. So, I am suitably embarrassed, and I apologize for the wild goose chase. Thank you both for helping to nudge me back onto the path of truth. :-) Regards, Bob -- Bob Kline https://www.rksystems.com mailto:bkline@rksystems.com
Hi Bob I'm glad you found a solution for your issue. I got stuck on a validation issue once which said "'<' expected" and there certainly was one - in the text editor, but in a binary file editor I could see a couple of non-printable characters. It still strikes me as strange that an error message reading " 'a b' is not in the set {'a b', 'a c'} is somewhat misleading, so I did some debugging and found that right the way up to the call to vfprintf, the error message contains the '\n'. I will likely raise something on the libxml2 lists/issue tracker to see if there is a belief that non-printable characters should be converted when outputting an error message. Paul -----Original Message----- From: Bob Kline <bkline@rksystems.com> Sent: 02 May 2021 20:24 To: Paul Higgs <paul_higgs@hotmail.com> Cc: Stefan Behnel <stefan_ml@behnel.de>; lxml@python.org Subject: Re: [lxml] Re: Tested value modified in validation error message On Sun, May 2, 2021 at 1:45 PM Paul Higgs <paul_higgs@hotmail.com> wrote:
I just created the following ... I will look into the libxml2 to see where the \n is being converted to a <space>. ....
Actually, your repro case is showing the correct behavior. In other words, the \n is NOT being converted to a space character. I created a comparable repro case using lxml.etree, and observed the correct behavior there, as well. So I dug a little further, and found that the second bug is actually in our own code. So, I am suitably embarrassed, and I apologize for the wild goose chase. Thank you both for helping to nudge me back onto the path of truth. :-) Regards, Bob -- Bob Kline https://www.rksystems.com mailto:bkline@rksystems.com
Well, we wouldn't need any modification to libxml2 to see the non-printable characters. We can use repr() for that. for error in schema.error_log: print(repr(error.message)) "Element 'root': [facet 'enumeration'] The value 'Processing\ncomplete' is not an element of the set {'Ready for English peer review', 'Ready for English scientific review', 'Ready for English OCCM review', 'Ready for Spanish peer review', 'Ready for Spanish OCCM review', 'Ready for publishing', 'Ready for translation', 'Processing complete'}." "Element 'root': 'Processing\ncomplete' is not a valid value of the local atomic type." Cheers, Bob On Mon, May 3, 2021 at 2:43 AM Paul Higgs <paul_higgs@hotmail.com> wrote:
Hi Bob
I'm glad you found a solution for your issue. I got stuck on a validation issue once which said "'<' expected" and there certainly was one - in the text editor, but in a binary file editor I could see a couple of non-printable characters.
It still strikes me as strange that an error message reading " 'a b' is not in the set {'a b', 'a c'} is somewhat misleading, so I did some debugging and found that right the way up to the call to vfprintf, the error message contains the '\n'. I will likely raise something on the libxml2 lists/issue tracker to see if there is a belief that non-printable characters should be converted when outputting an error message.
Paul -----Original Message----- From: Bob Kline <bkline@rksystems.com> Sent: 02 May 2021 20:24 To: Paul Higgs <paul_higgs@hotmail.com> Cc: Stefan Behnel <stefan_ml@behnel.de>; lxml@python.org Subject: Re: [lxml] Re: Tested value modified in validation error message
On Sun, May 2, 2021 at 1:45 PM Paul Higgs <paul_higgs@hotmail.com> wrote:
I just created the following ... I will look into the libxml2 to see where the \n is being converted to a <space>. ....
Actually, your repro case is showing the correct behavior. In other words, the \n is NOT being converted to a space character. I created a comparable repro case using lxml.etree, and observed the correct behavior there, as well. So I dug a little further, and found that the second bug is actually in our own code. So, I am suitably embarrassed, and I apologize for the wild goose chase. Thank you both for helping to nudge me back onto the path of truth. :-)
Regards, Bob
-- Bob Kline https://www.rksystems.com mailto:bkline@rksystems.com
-- Bob Kline https://www.rksystems.com mailto:bkline@rksystems.com
this post is very useful for me thanks for share nice information. https://bit.ly/3nDM67S
participants (4)
-
Bob Kline
-
HelenJOakley@proton.me
-
Paul Higgs
-
Stefan Behnel