Need more information from the schema validator
I tried asking this a week ago, and I must not have done a good job of framing my question, so let's try again (I'll try asking my questions in German if you tell me that it will increase the chances of getting a reply). How do I get the schema validator to give me a reference to the element where an error was found? Python : sys.version_info(major=3, minor=6, micro=0, releaselevel='final', serial=0) lxml.etree : (3, 7, 3, 0) libxml used : (2, 9, 4) libxml compiled : (2, 9, 4) libxslt used : (1, 1, 29) libxslt compiled : (1, 1, 29) Besten Dank im Voraus! -- Bob Kline http://www.rksystems.com mailto:bkline@rksystems.com
Hi Bob,
I tried asking this a week ago, and I must not have done a good job of framing my question, so let's try again (I'll try asking my questions in German if you tell me that it will increase the chances of getting a reply).
(I thought the question perfectly fine but can't really give much help on the topic...)
How do I get the schema validator to give me a reference to the element where an error was found?
I'm no authority on this but I think that currently there's no possibility in lxml for programmatically getting at the node that triggered a validation error. Judging from the struct _xmlError definition, in principle there's libxml2 support for getting node information with an error. Don't know if this is provided for validation errors, though: http://www.xmlsoft.org/html/libxml-xmlerror.html#xmlError You might want to take a look at https://github.com/lxml/lxml/blob/master/src/lxml/xmlerror.pxi if you want tackle how lxml deals with the libxml2 error information. Holger Landesbank Baden-Wuerttemberg Anstalt des oeffentlichen Rechts Hauptsitze: Stuttgart, Karlsruhe, Mannheim, Mainz HRA 12704 Amtsgericht Stuttgart
On Mon, Apr 10, 2017 at 5:18 AM, Holger Joukl <Holger.Joukl@lbbw.de> wrote:
You might want to take a look at https://github.com/lxml/lxml/blob/master/src/lxml/xmlerror.pxi if you want tackle how lxml deals with the libxml2 error information.
Thanks very much, Holger. I'll dig in and see what's possible. If I find that I can make it work I'll submit a patch. Danke sehr! -- Bob Kline http://www.rksystems.com mailto:bkline@rksystems.com
On Mon, Apr 10, 2017 at 12:09 PM, Bob Kline <bkline@rksystems.com> wrote:
On Mon, Apr 10, 2017 at 5:18 AM, Holger Joukl <Holger.Joukl@lbbw.de> wrote:
You might want to take a look at https://github.com/lxml/lxml/blob/master/src/lxml/xmlerror.pxi if you want tackle how lxml deals with the libxml2 error information.
Thanks very much, Holger. I'll dig in and see what's possible. If I find that I can make it work I'll submit a patch.
Well, I've been digging. The information I need is indeed exposed by libxml2's xmlError object. Not the column number (on which they also punt, always storing 0) but the information on the node where the error was found is present. The cleanest solution would be to have the error object carry a reference to the _Element where the error was found. I tried that, but didn't get very far. When I called getProxy() I got back None. That happens because the schema validator creates a copy of what I pass in for the parsed document to be validated, and that copy has been stripped of what it needs for getProxy() to give back an _Element object. Attempts to use _elementFactory() (and _documentFactory(), which I would need to call to get the first argument for _elementFactory(), at least for the first error), but that ran into complaints about the GIL (I haven't yet achieved the level of enlightenment where the first sentence of the Cython documentation ("[Cython] is a programming language that makes writing C extensions for the Python language as easy as Python itself.") rings true). There might be a way to implement that clean solution -- possibly with some optimization which would defer the heavy lifting until the new properties are actually used, or with a flag passed to XMLSchema.__call__() to indicate whether to override the current behavior of passing in a copy of the xmlDoc which doesn't have the 'secret' pointer set to NULL -- but I would have to crawl much further up the learning curve than I have so far to be able to tackle that with any confidence. So what I have done instead is to extract the name of the node where the error was found and the node's attributes -- assuming the node is an element which has attributes -- and store them in the _LogEntry object. It appears to work (testing under Python 2.7 and 3.5), and it does exactly what I need (the element name might be superfluous, as it seems to be present in a consistently parseable position in the message property, but the attributes allow me to implement reliable navigation through the validation errors). At this point I feel the need to come back to the mailing list and ask if I'm wandering down a completely wrong path. If so, I'd very much appreciate some guidance to get me pointed in the right direction. I'm happy to contribute to the project. I know both C and Python pretty well, but I haven't done any serious integration between the two before, and I'm getting exposed to Cython and the lxml internals for the first time. Here's what my work looks like so far: https://github.com/bkline/lxml/commit/a060d0f7b9c6b74654ebc7ce5ba491b54b1ad8... Many thanks for this excellent package, and thanks in advance for any suggestions, Bob
Hi! Thanks for bringing up this topic and looking into it. I can see that this would be a nice feature. Bob Kline schrieb am 17.04.2017 um 17:54:
On Mon, Apr 10, 2017 at 12:09 PM, Bob Kline wrote:
On Mon, Apr 10, 2017 at 5:18 AM, Holger Joukl wrote:
You might want to take a look at https://github.com/lxml/lxml/blob/master/src/lxml/xmlerror.pxi if you want tackle how lxml deals with the libxml2 error information.
Thanks very much, Holger. I'll dig in and see what's possible. If I find that I can make it work I'll submit a patch.
Well, I've been digging. The information I need is indeed exposed by libxml2's xmlError object. Not the column number (on which they also punt, always storing 0) but the information on the node where the error was found is present.
The cleanest solution would be to have the error object carry a reference to the _Element where the error was found.
This has a downside: it would store a reference to the Element in the error log, and thus keep the entire XML tree alive in a difficult to control place. A weakref would be better, but then again, _Elements are not currently weak-referencible... Would it help you to get an XPath expression targeting the node in error? That would allow later lookups in the tree by only storing a single small string value. That expression can be calculated, see xmlGetNodePath() as used in _Element.getpath(). I can see two drawbacks of that proposal: it takes a bit of time to calculate that expression (even though we could store the UTF-8 C string in the end and postpone the conversion to a Python object), and it might not always be clear which tree to apply it to, e.g. in the case of XInclude failures or other cases where multiple trees are involved (schema imports?). But both could be considered acceptable. What do you think about that approach?
I tried that, but didn't get very far. When I called getProxy() I got back None. That happens because the schema validator creates a copy of what I pass in for the parsed document to be validated, and that copy has been stripped of what it needs for getProxy() to give back an _Element object.
Yes, and simply calling getProxy() also isn't the right thing to do. There is some non-trivial machinery involved in the connection between C nodes and their proxy objects, which you obviously couldn't know about.
Attempts to use _elementFactory() (and _documentFactory(), which I would need to call to get the first argument for _elementFactory(), at least for the first error), but that ran into complaints about the GIL
That's not the real problem here, but you can generally use the with-statement to acquire the GIL in Cython when you need it (and don't have it, although you'd normally own it by default): with gil: ... The Cython project is pretty serious about the "writing C extensions for the Python language as easy as Python itself" goal. :)
There might be a way to implement that clean solution -- possibly with some optimization which would defer the heavy lifting until the new properties are actually used
Not easy, because we cannot simply keep a pointer to the xmlNode - it's not reference counted or anything, so it might already be deallocated by the time we try to access it later on.
So what I have done instead is to extract the name of the node where the error was found and the node's attributes -- assuming the node is an element which has attributes -- and store them in the _LogEntry object. It appears to work (testing under Python 2.7 and 3.5), and it does exactly what I need
That's also the drawback: it solves the exact problem that you are facing (and maybe that of some other people), but it's not a general solution to the problem of figuring out which element in the tree produced an error. It could be necessary to look at the parent element in other cases, for example, and that's not covered. And as soon as you start going down that road of covering more use cases, it becomes obvious that you'd really want access to the Element object at some point.
the element name might be superfluous, as it seems to be present in a consistently parseable position in the message property, but the attributes allow me to implement reliable navigation through the validation errors
The element name feels like an obvious, simple and safe addition to the LogEntry API, though. This could be added independently. Stefan
On Fri, Apr 21, 2017 at 9:10 AM, Stefan Behnel <stefan_ml@behnel.de> wrote:
Thanks for bringing up this topic and looking into it. I can see that this would be a nice feature.
Thanks very much for your very thorough reply.
Would it help you to get an XPath expression targeting the node in error?
That would be a perfect solution.
I can see two drawbacks of that proposal: it takes a bit of time to calculate that expression (even though we could store the UTF-8 C string in the end and postpone the conversion to a Python object), and it might not always be clear which tree to apply it to, e.g. in the case of XInclude failures or other cases where multiple trees are involved (schema imports?). But both could be considered acceptable.
I agree that postponing conversion (as you have done for a couple of the other properties) would be helpful. There might even be a way to let lxml know globally whether to collect this information at all (though I'll leave it to you to decide whether that optimization is worth the extra clutter). I'm not sure I understand the second issue. Isn't libxml2 sort of making that decision for us, but giving us the xmlNode pointer, which we'd be passing to their xmlGetNodePath function?
What do you think about that approach?
Ausgezeichnet! Wish I'd thought of it myself. :-) Let me see what I can come up with. Cheers, Bob
On Fri, Apr 21, 2017 at 10:26 AM, Bob Kline <bkline@rksystems.com> wrote:
Let me see what I can come up with.
This is working exactly the way I need it to. https://github.com/bkline/lxml/blob/error-location-info/src/lxml/xmlerror.px... https://github.com/bkline/lxml/commit/a21281020e4565839c727a81fdcc5d85d7f8b6... I haven't done much contribution to open-source projects since before git/github took over the world. I can imagine that looking at the diff from the link I gave above would be a little annoying, since it gives the deltas with my previous attempt, which went in a somewhat different direction than the one Stefan in which which Stefan pointed me. What's the best approach here? Should I create a fresh clone and apply my proposed enhancement to that so you'll have a clean diff to look at? Any pointers to guides to github patch etiquette? Thanks, Bob
On Fri, Apr 21, 2017 at 1:46 PM, Bob Kline <bkline@rksystems.com> wrote:
I haven't done much contribution to open-source projects since before git/github took over the world. I can imagine that looking at the diff from the link I gave above would be a little annoying, since it gives the deltas with my previous attempt, which went in a somewhat different direction than the one Stefan in which which Stefan pointed me. What's the best approach here? Should I create a fresh clone and apply my proposed enhancement to that so you'll have a clean diff to look at? Any pointers to guides to github patch etiquette?
Ah, I just stumbled across git's interactive rebase, which is what I think I needed here. I was able to squash my two commits into one, which results in a more useful diff for the patch. https://github.com/bkline/lxml/commit/e37ed9c6f7c1b6b8f481b6c34dc7919550e8a3... The only odd thing with it is that GitHub shows the commit as having been made four days ago (even though it's the result of an action performed today), but I bet there's a way I could have adjusted that (and I bet I'll find it eventually). At any rate, if I get confirmation that this is heading in a useful direction, I'll put in a pull request. Thanks for your patience, Bob
Bob Kline schrieb am 21.04.2017 um 21:44:
On Fri, Apr 21, 2017 at 1:46 PM, Bob Kline wrote:
I haven't done much contribution to open-source projects since before git/github took over the world. I can imagine that looking at the diff from the link I gave above would be a little annoying, since it gives the deltas with my previous attempt, which went in a somewhat different direction than the one Stefan in which which Stefan pointed me. What's the best approach here? Should I create a fresh clone and apply my proposed enhancement to that so you'll have a clean diff to look at? Any pointers to guides to github patch etiquette?
Ah, I just stumbled across git's interactive rebase, which is what I think I needed here. I was able to squash my two commits into one, which results in a more useful diff for the patch.
https://github.com/bkline/lxml/commit/e37ed9c6f7c1b6b8f481b6c34dc7919550e8a3...
The only odd thing with it is that GitHub shows the commit as having been made four days ago (even though it's the result of an action performed today), but I bet there's a way I could have adjusted that (and I bet I'll find it eventually).
At any rate, if I get confirmation that this is heading in a useful direction, I'll put in a pull request.
I left some comments, looks good otherwise. Could you also add a test for the new feature? I can't say in which cases libxml2 provides a node in the error info, but I think implementing the check as part of the schema tests in test_xmlschema.py would be a good start. Stefan
On Sat, Apr 22, 2017 at 2:31 AM, Stefan Behnel <stefan_ml@behnel.de> wrote:
I left some comments, looks good otherwise. Could you also add a test for the new feature?
I'm about to head out of town for the weekend. When I get back I'll add the test and prepare and submit the pull request. Thanks, Bob
On Sat, Apr 22, 2017 at 8:18 AM, Bob Kline <bkline@rksystems.com> wrote:
On Sat, Apr 22, 2017 at 2:31 AM, Stefan Behnel <stefan_ml@behnel.de> wrote:
I left some comments, looks good otherwise. Could you also add a test for the new feature?
I'm about to head out of town for the weekend. When I get back I'll add the test and prepare and submit the pull request.
Well, that's what I thought would happen. I did add the test, and the test passes perfectly. However, when I run the entire suite of tests, a segmentation fault dumps core. More specifically, if I run any of test_etree.py, test_xslt.py, or test_xpathevaluator.py, I get a core dump. All the others do fine. Here's my modified xmlerror.pxi: https://github.com/bkline/lxml/blob/error-location/src/lxml/xmlerror.pxi And here's the diff for the three changed files: https://github.com/lxml/lxml/compare/master...bkline:error-location I have done as much sleuthing as I could think to do (short of running Python under gdb), but without success. I tried a bunch of different things, described below. I wondered if libxml2 did something bizarre and undocumented instead of just returning NULL when it couldn't come up with a path for the node, so I tried adding a test to avoid calling xmlGetNodePath() unless _isElement(self._c_path) was true. Didn't eliminate the core dumps. I wondered if some code somewhere was instantiating _LogEntry objects without ever calling _setError() or _setGeneric(), thus avoiding the code which initializes self._c_path to NULL, so I tried adding a __cinit__() method, which the Cython docs [1] say is guaranteed to called exactly once, to do that initialization for all instances of the class. Didn't help. Besides, I saw that that same paragraph in the Cython docs says "By the time your __cinit__() method is called, memory has been allocated for the object and any C attributes it has have been initialised to 0 or null." That would mean that what I was doing in the __cinit__() method had already been done. I would also interpret that sentence in the Cython documentation to imply that the C attributes for the object are automatically initialized to 0 or null even if a __cinit__() method is not explicitly provided (though the path for extension types [2] oddly makes no mention of this initialization, unless it's there and I just missed it). If my interpretation is correct, the assignment of self._c_path (and self._c_message and self._c_filename, for that matter) to NULL in self._setError() and self._setGeneric() are superfluous, no? I wondered if xmlFree() (for which I couldn't find much documentation) did not behave the way free() in the standard library does (doing nothing if the passed pointer is NULL). So I added a test in the __dealloc__() method to avoid calling xmlFree() unless self._c_path is not NULL. But that didn't eliminate the core dumps, either. Besides, if this had been the problem, self._c_filename and self._c_message would have been causing segmentation faults. And in fact, even commenting out the call to xmlFree() altogether didn't eliminate the problem (though it would add another problem, leaking memory). I tried some other, even sillier things, but those were all in the "grasping at straws" category, and not worth mentioning. The only way to make the core dumps go away was to comment out the call to xmlGetNodePath() altogether, hard-coding the return value for the property getter. But that's obviously not much of a solution. :-) I did confirm that the baseline code on top of which I added my patch runs the test suite without any errors or core dumps. This is so tantalizing, because Stephan's proposed approach, using xmlGetNodePath(), is so clean and does exactly what I need it to do. At this point, the only thing I can think of is to fall back on the idea I mentioned earlier of making the capture of the node's location conditional, based on some thread-based setting turning it on. But I would much rather understand why xmlGetNodePath() is blowing up, and I would guess that would be everyone else's preference as well. So, what am I missing, Cython/lxml gurus? Bewildered, Bob [1] http://cython.readthedocs.io/en/latest/src/userguide/special_methods.html [2] http://cython.readthedocs.io/en/latest/src/userguide/extension_types.html
On Mon, Apr 24, 2017 at 8:55 AM, Bob Kline <bkline@rksystems.com> wrote:
On Sat, Apr 22, 2017 at 8:18 AM, Bob Kline <bkline@rksystems.com> wrote:
On Sat, Apr 22, 2017 at 2:31 AM, Stefan Behnel <stefan_ml@behnel.de> wrote:
I left some comments, looks good otherwise. Could you also add a test for the new feature?
I'm about to head out of town for the weekend. When I get back I'll add the test and prepare and submit the pull request.
Well, that's what I thought would happen. I did add the test, and the test passes perfectly. However, when I run the entire suite of tests, a segmentation fault dumps core. More specifically, if I run any of test_etree.py, test_xslt.py, or test_xpathevaluator.py, I get a core dump. All the others do fine.
Here's my modified xmlerror.pxi:
https://github.com/bkline/lxml/blob/error-location/src/lxml/xmlerror.pxi
And here's the diff for the three changed files:
https://github.com/lxml/lxml/compare/master...bkline:error-location
I have done as much sleuthing as I could think to do (short of running Python under gdb), but without success. .... [more blah blah blah about what I tried]
OK, I grabbed the libxml2 source code and fired up gdb. It looks like when the tests crash the error.node pointer isn't NULL, but it doesn't point to memory space that belongs to the program. In fact, the "pointer" is an odd number, so obviously garbage, as an odd number would never match the alignment for a struct. The three explanations I can think of are (1) libxml2 has forgotten to make sure the node member of the xmlError struct is set to NULL or a valid pointer to an xmlNode struct, but instead left garbage from the stack or heap in that slot; (2) libxml2 did set the node member of the xmlError struct properly, but subsequently trashed it by some sloppy code (this is why we like Python, right?); or (3) the lxml code got the xmlError struct with a correctly initialized node member, but somehow overwrote the value for the node member with garbage. I would think that these three possible causes of the problem are listed in decreasing order of probability. What I propose to do next is to examine the `domain` member of the xmlError struct (hoping that it hasn't also been stepped on) and see which domains are in play when the corruption occurs. It may be that I can just use the domain to decide when to capture the location of the error, avoiding the need for the thread-specific flag I described in my previous message. I will likely get some additional clues about what's going on by paying attention to the domain values. Wish me luck! Bob
On Mon, Apr 24, 2017 at 11:28 AM, Bob Kline <bkline@rksystems.com> wrote:
Wish me luck!
Well, it turns out that the problematic xmlError structs aren't coming from libxml2 at all, but are instead being assembled fresh in the lxml code. I've found a couple of places where this is happening, and I'm hopeful that fixing them so they're not leaving garbage at the end of the structs (or at least not for the node member) will eliminate the core dumps. In the meantime, I just wanted to confirm that the warnings that come out when I run python (or python3) setup.py build_ext -i --with-cython are expected and benign. Warning: Extension name 'lxml.etree' does not match fully qualified name 'lxml.lxml.etree' of 'src/lxml/lxml.etree.pyx' src/lxml/lxml.etree.pyx: cannot find cimported module 'lxml' src/lxml/lxml.etree.pyx: cannot find cimported module 'lxml.includes' Warning: Extension name 'lxml.objectify' does not match fully qualified name 'lxml.lxml.objectify' of 'src/lxml/lxml.objectify.pyx' src/lxml/lxml.objectify.pyx: cannot find cimported module 'lxml' src/lxml/lxml.objectify.pyx: cannot find cimported module 'lxml.includes' /home/bkline/src/github/lxml/src/lxml/includes/etreepublic.pxd: cannot find cimported module 'lxml.includes' Compiling src/lxml/lxml.etree.pyx because it depends on src/lxml/extensions.pxi. [1/1] Cythonizing src/lxml/lxml.etree.pyx warning: src/lxml/xmlerror.pxi:640:26: local variable 'args' referenced before assignment warning: src/lxml/xmlerror.pxi:648:48: local variable 'args' referenced before assignment warning: src/lxml/xmlerror.pxi:662:44: local variable 'args' referenced before assignment warning: src/lxml/xmlerror.pxi:672:24: local variable 'args' referenced before assignment Thanks, Bob
On Mon, Apr 24, 2017 at 11:58 AM, Bob Kline <bkline@rksystems.com> wrote:
Well, it turns out that the problematic xmlError structs aren't coming from libxml2 at all, but are instead being assembled fresh in the lxml code. I've found a couple of places where this is happening, and I'm hopeful that fixing them so they're not leaving garbage at the end of the structs (or at least not for the node member) will eliminate the core dumps.
Bingo! https://github.com/lxml/lxml/pull/244 I fixed two places that were creating xmlError structs with garbage in the node member. I wasn't absolutely certain that the node in the incoming c_error was safe, so I erred on the side of caution and just set the node member to NULL in _forwardXPathError() (in extensions.pxi). Didn't look like there was anything to work with for a non-NULL node member of the struct created in _receiveXSLTError() (in xmlerror.pxi), so that got a NULL as well. Let me know if I've done anything questionable. Thanks, Bob
Bob Kline schrieb am 21.04.2017 um 16:26:
On Fri, Apr 21, 2017 at 9:10 AM, Stefan Behnel wrote:
Would it help you to get an XPath expression targeting the node in error?
That would be a perfect solution.
I can see two drawbacks of that proposal: it takes a bit of time to calculate that expression (even though we could store the UTF-8 C string in the end and postpone the conversion to a Python object), and it might not always be clear which tree to apply it to, e.g. in the case of XInclude failures or other cases where multiple trees are involved (schema imports?). But both could be considered acceptable.
I agree that postponing conversion (as you have done for a couple of the other properties) would be helpful. There might even be a way to let lxml know globally whether to collect this information at all (though I'll leave it to you to decide whether that optimization is worth the extra clutter).
This would be possible, but might require storing configuration in the thread-local context. I'd rather wait until it's clear that the slowdown is really not acceptable. We are talking about error cases here, after all.
I'm not sure I understand the second issue. Isn't libxml2 sort of making that decision for us, but giving us the xmlNode pointer, which we'd be passing to their xmlGetNodePath function?
I meant the user side, where you get an error back and an XPath expression with it, and then have to know what tree to apply the expression to in order to find the failing element. My intuition tells me that it will be clear most of the time, that's why I think it should be acceptable. Stefan
On Fri, Apr 21, 2017 at 9:10 AM, Stefan Behnel <stefan_ml@behnel.de> wrote:
The element name feels like an obvious, simple and safe addition to the LogEntry API, though. This could be added independently.
Sorry, meant to respond to this last bit earlier. Assuming we are providing the xpath to the node, I would think the element name would be redundant, unless we make capturing the xpath optional. Bob
Bob Kline schrieb am 21.04.2017 um 16:51:
On Fri, Apr 21, 2017 at 9:10 AM, Stefan Behnel wrote:
The element name feels like an obvious, simple and safe addition to the LogEntry API, though. This could be added independently.
Sorry, meant to respond to this last bit earlier.
Assuming we are providing the xpath to the node, I would think the element name would be redundant, unless we make capturing the xpath optional.
I just checked libxml2's source code and the "node" is only really passed during validation. That suggests that it's always clear for users what tree to look at, and that they can apply the XPath expression when they need further information about the node. Let's leave out the name property then. Stefan
participants (3)
-
Bob Kline
-
Holger Joukl
-
Stefan Behnel