
Hi! Thanks for bringing up this topic and looking into it. I can see that this would be a nice feature. Bob Kline schrieb am 17.04.2017 um 17:54:
On Mon, Apr 10, 2017 at 12:09 PM, Bob Kline wrote:
On Mon, Apr 10, 2017 at 5:18 AM, Holger Joukl wrote:
You might want to take a look at https://github.com/lxml/lxml/blob/master/src/lxml/xmlerror.pxi if you want tackle how lxml deals with the libxml2 error information.
Thanks very much, Holger. I'll dig in and see what's possible. If I find that I can make it work I'll submit a patch.
Well, I've been digging. The information I need is indeed exposed by libxml2's xmlError object. Not the column number (on which they also punt, always storing 0) but the information on the node where the error was found is present.
The cleanest solution would be to have the error object carry a reference to the _Element where the error was found.
This has a downside: it would store a reference to the Element in the error log, and thus keep the entire XML tree alive in a difficult to control place. A weakref would be better, but then again, _Elements are not currently weak-referencible... Would it help you to get an XPath expression targeting the node in error? That would allow later lookups in the tree by only storing a single small string value. That expression can be calculated, see xmlGetNodePath() as used in _Element.getpath(). I can see two drawbacks of that proposal: it takes a bit of time to calculate that expression (even though we could store the UTF-8 C string in the end and postpone the conversion to a Python object), and it might not always be clear which tree to apply it to, e.g. in the case of XInclude failures or other cases where multiple trees are involved (schema imports?). But both could be considered acceptable. What do you think about that approach?
I tried that, but didn't get very far. When I called getProxy() I got back None. That happens because the schema validator creates a copy of what I pass in for the parsed document to be validated, and that copy has been stripped of what it needs for getProxy() to give back an _Element object.
Yes, and simply calling getProxy() also isn't the right thing to do. There is some non-trivial machinery involved in the connection between C nodes and their proxy objects, which you obviously couldn't know about.
Attempts to use _elementFactory() (and _documentFactory(), which I would need to call to get the first argument for _elementFactory(), at least for the first error), but that ran into complaints about the GIL
That's not the real problem here, but you can generally use the with-statement to acquire the GIL in Cython when you need it (and don't have it, although you'd normally own it by default): with gil: ... The Cython project is pretty serious about the "writing C extensions for the Python language as easy as Python itself" goal. :)
There might be a way to implement that clean solution -- possibly with some optimization which would defer the heavy lifting until the new properties are actually used
Not easy, because we cannot simply keep a pointer to the xmlNode - it's not reference counted or anything, so it might already be deallocated by the time we try to access it later on.
So what I have done instead is to extract the name of the node where the error was found and the node's attributes -- assuming the node is an element which has attributes -- and store them in the _LogEntry object. It appears to work (testing under Python 2.7 and 3.5), and it does exactly what I need
That's also the drawback: it solves the exact problem that you are facing (and maybe that of some other people), but it's not a general solution to the problem of figuring out which element in the tree produced an error. It could be necessary to look at the parent element in other cases, for example, and that's not covered. And as soon as you start going down that road of covering more use cases, it becomes obvious that you'd really want access to the Element object at some point.
the element name might be superfluous, as it seems to be present in a consistently parseable position in the message property, but the attributes allow me to implement reliable navigation through the validation errors
The element name feels like an obvious, simple and safe addition to the LogEntry API, though. This could be added independently. Stefan