
On Mon, Apr 10, 2017 at 12:09 PM, Bob Kline <bkline@rksystems.com> wrote:
On Mon, Apr 10, 2017 at 5:18 AM, Holger Joukl <Holger.Joukl@lbbw.de> wrote:
You might want to take a look at https://github.com/lxml/lxml/blob/master/src/lxml/xmlerror.pxi if you want tackle how lxml deals with the libxml2 error information.
Thanks very much, Holger. I'll dig in and see what's possible. If I find that I can make it work I'll submit a patch.
Well, I've been digging. The information I need is indeed exposed by libxml2's xmlError object. Not the column number (on which they also punt, always storing 0) but the information on the node where the error was found is present. The cleanest solution would be to have the error object carry a reference to the _Element where the error was found. I tried that, but didn't get very far. When I called getProxy() I got back None. That happens because the schema validator creates a copy of what I pass in for the parsed document to be validated, and that copy has been stripped of what it needs for getProxy() to give back an _Element object. Attempts to use _elementFactory() (and _documentFactory(), which I would need to call to get the first argument for _elementFactory(), at least for the first error), but that ran into complaints about the GIL (I haven't yet achieved the level of enlightenment where the first sentence of the Cython documentation ("[Cython] is a programming language that makes writing C extensions for the Python language as easy as Python itself.") rings true). There might be a way to implement that clean solution -- possibly with some optimization which would defer the heavy lifting until the new properties are actually used, or with a flag passed to XMLSchema.__call__() to indicate whether to override the current behavior of passing in a copy of the xmlDoc which doesn't have the 'secret' pointer set to NULL -- but I would have to crawl much further up the learning curve than I have so far to be able to tackle that with any confidence. So what I have done instead is to extract the name of the node where the error was found and the node's attributes -- assuming the node is an element which has attributes -- and store them in the _LogEntry object. It appears to work (testing under Python 2.7 and 3.5), and it does exactly what I need (the element name might be superfluous, as it seems to be present in a consistently parseable position in the message property, but the attributes allow me to implement reliable navigation through the validation errors). At this point I feel the need to come back to the mailing list and ask if I'm wandering down a completely wrong path. If so, I'd very much appreciate some guidance to get me pointed in the right direction. I'm happy to contribute to the project. I know both C and Python pretty well, but I haven't done any serious integration between the two before, and I'm getting exposed to Cython and the lxml internals for the first time. Here's what my work looks like so far: https://github.com/bkline/lxml/commit/a060d0f7b9c6b74654ebc7ce5ba491b54b1ad8... Many thanks for this excellent package, and thanks in advance for any suggestions, Bob