html diff changes the structure of the html (and a mystery about debugging)
data:image/s3,"s3://crabby-images/30605/30605ca5df355a021ef75880bfeec57bb6406a42" alt=""
Hi, I'm using lxml's html diff functionality in a project and it has been working well so far. Sometime back though I noticed that the generated diff changes the structure of the html in a manner that's less than ideal for our use case. An example:
As you can see, the div with id=middle has been inserted at the beginning of the document and it encloses the div with id=first. I believe this happens because lxml unconditionally inserts 'unbalanced tags' at the beginning of set of 'chunks' when it surrounds the chunks with the <ins> tags: https://github.com/lxml/lxml/blob/master/src/lxml/html/diff.py#L241 Could we potentially be a bit smarter about this and insert the unbalanced tags, as we encounter them instead ? For instance, closing out the opened `<ins>` tag, inserting the unbalanced tag and opening a new `<ins>` tag. Something like: Secondly, it would be great if someone could help me understand (or give me pointers to) why I'm seeing this differences of behaviour between what's being returned from the compiled version of the code and executing the exact same set of functions from the REPL !?! (assuming the same context in the REPL as above)
The htmldiff_tokens() is obviously getting 2 opcodes, either an insert or replace, followed by a delete, but if I instantiate InsensitiveSequenceMatcher() from the REPL, it generates only the delete ! This is driving me nuts ! Am I doing something wrong ? cheers, Steve
data:image/s3,"s3://crabby-images/30605/30605ca5df355a021ef75880bfeec57bb6406a42" alt=""
Hi, Just following up on this. I noticed that I hadn't provided the actual alternate code in my previous mail, although I referenced it. Sorry about that. In any case, I've submitted a PR with a potential fix for the issue as described: https://github.com/lxml/lxml/pull/350 I'm unsure why the diff generation bothers with the twiddling with the leading / trailing spaces around the tags but I've retained the behaviour. Note that the implementation passes all existing tests and also includes a additional test with the example from my previous mail. I'd be happy to receive any kind of feedback about the issue. As an aside, the mystery w.r.t the debugging on the REPL continues to confound me ! cheers, Steve On Mon, 2022-09-05 at 19:13 +0100, Steve wrote:
data:image/s3,"s3://crabby-images/30605/30605ca5df355a021ef75880bfeec57bb6406a42" alt=""
Hi, Just following up on this. I noticed that I hadn't provided the actual alternate code in my previous mail, although I referenced it. Sorry about that. In any case, I've submitted a PR with a potential fix for the issue as described: https://github.com/lxml/lxml/pull/350 I'm unsure why the diff generation bothers with the twiddling with the leading / trailing spaces around the tags but I've retained the behaviour. Note that the implementation passes all existing tests and also includes a additional test with the example from my previous mail. I'd be happy to receive any kind of feedback about the issue. As an aside, the mystery w.r.t the debugging on the REPL continues to confound me ! cheers, Steve On Mon, 2022-09-05 at 19:13 +0100, Steve wrote:
participants (1)
-
Steve