
Hi, I'm using lxml's html diff functionality in a project and it has been working well so far. Sometime back though I noticed that the generated diff changes the structure of the html in a manner that's less than ideal for our use case. An example:
from lxml.html import diff a = "<div id='first'>some old text</div><div id='last'>more old text</div>" b = "<div id='first'>some old text</div><div id='middle'>and new text</div><div id='last'>more old text</div>" diff.htmldiff(a, b) ('<div id="middle"> <div id="first"><ins>some old text</ins></div><ins>and new</ins> <del>some old</del> text</div><div id="last">more old ' 'text</div>')
As you can see, the div with id=middle has been inserted at the beginning of the document and it encloses the div with id=first. I believe this happens because lxml unconditionally inserts 'unbalanced tags' at the beginning of set of 'chunks' when it surrounds the chunks with the <ins> tags: https://github.com/lxml/lxml/blob/master/src/lxml/html/diff.py#L241 Could we potentially be a bit smarter about this and insert the unbalanced tags, as we encounter them instead ? For instance, closing out the opened `<ins>` tag, inserting the unbalanced tag and opening a new `<ins>` tag. Something like: Secondly, it would be great if someone could help me understand (or give me pointers to) why I'm seeing this differences of behaviour between what's being returned from the compiled version of the code and executing the exact same set of functions from the REPL !?! (assuming the same context in the REPL as above)
tokens_a = diff.tokenize(a) tokens_b = diff.tokenize(b) diff.htmldiff_tokens(tokens_a, tokens_b) ['<div id="middle"> ', '<ins>', '<div id="first">', 'some ', 'old ', 'text', '</div>', 'and ', 'new', '</ins> ', '<del>', 'some ', 'old', '</del> ', 'text', '</div>', '<div id="last">', 'more ', 'old ', 'text', '</div>'] s = diff.InsensitiveSequenceMatcher(tokens_a, tokens_b) commands = s.get_opcodes() list(commands) [('delete', 0, 9, 0, 0)]
The htmldiff_tokens() is obviously getting 2 opcodes, either an insert or replace, followed by a delete, but if I instantiate InsensitiveSequenceMatcher() from the REPL, it generates only the delete ! This is driving me nuts ! Am I doing something wrong ? cheers, Steve