Schema validation fails after replace() call?

Hello, I'm not quite sure if I'm asking a question, or sharing an observation. Is it possible that an lxml.etree instance validates before a replace() call, but not after? The error messages I get from the lxml validation are almost 200 of the same: <string>:440:0:ERROR:SCHEMASV:SCHEMAV_CVC_IDC: Element 'deviation': No match found for key-sequence ['WtC2eoepX'] of keyref 'deviationstyle-refer'. Looking at the actual XML I can positively confirm that the IDs and IDREFs exist and are valid, before and after the replace() call. The new subtree is equivalent to the old one, but there are elements in the whole tree that refer to elements in the replaced subtree. I suspect that this causes the problem. Interestingly: - xmllint validates the new tree when written to a file, and - if I serialize the entire tree (including the new replaced subtree) and parse it back, it validates. This is intended behavior, an odd side effect, or a bug? Thanks! Jens -- Jens Tröger http://savage.light-speed.de/

Jens Tröger schrieb am 25.05.2016 um 19:48:
My guess what happens is: libxml2 can use an ID index internally that it builds up at parse time, but this mapping isn't updated on lxml's tree changes. This can lead to unexpected behaviour in cases where libxml2 relies entirely on that index and doesn't fall back to searching the tree. Either maintaining or deleting that index could potentially fix this problem. Deleting it is obviously easier and safer, but it remains to be seen if libxml2 really does the right thing if it's not there. Stefan

Thanks Stefan. So your suspicion is that this problem lies with libxml2, but do you know if this is a bug, or an odd side-effect? On my end, as a Py/lxml user, there doesn't seem to be a good workaround for this other than serializing/parsing it back. Other suggestions? Jens On Fri, May 27, 2016 at 09:38:33PM +0200, Stefan Behnel wrote:
-- Jens Tröger http://savage.light-speed.de/

Jens Tröger schrieb am 27.05.2016 um 22:01:
It lies in the interaction of lxml and libxml2 and can thus (most likely) be improved in lxml. Patches welcome.
On my end, as a Py/lxml user, there doesn't seem to be a good workaround for this other than serializing/parsing it back. Other suggestions?
A serialise-parse cycle sounds just fine as a work-around. Both are fast operations, so you won't loose much time. Note also that you can do the schema validation right in the parser. Stefan

Jens Tröger schrieb am 25.05.2016 um 19:48:
My guess what happens is: libxml2 can use an ID index internally that it builds up at parse time, but this mapping isn't updated on lxml's tree changes. This can lead to unexpected behaviour in cases where libxml2 relies entirely on that index and doesn't fall back to searching the tree. Either maintaining or deleting that index could potentially fix this problem. Deleting it is obviously easier and safer, but it remains to be seen if libxml2 really does the right thing if it's not there. Stefan

Thanks Stefan. So your suspicion is that this problem lies with libxml2, but do you know if this is a bug, or an odd side-effect? On my end, as a Py/lxml user, there doesn't seem to be a good workaround for this other than serializing/parsing it back. Other suggestions? Jens On Fri, May 27, 2016 at 09:38:33PM +0200, Stefan Behnel wrote:
-- Jens Tröger http://savage.light-speed.de/

Jens Tröger schrieb am 27.05.2016 um 22:01:
It lies in the interaction of lxml and libxml2 and can thus (most likely) be improved in lxml. Patches welcome.
On my end, as a Py/lxml user, there doesn't seem to be a good workaround for this other than serializing/parsing it back. Other suggestions?
A serialise-parse cycle sounds just fine as a work-around. Both are fast operations, so you won't loose much time. Note also that you can do the schema validation right in the parser. Stefan
participants (2)
-
Jens Tröger
-
Stefan Behnel