lxml.html.diff.htmldiff does not handle HTML comments well.
data:image/s3,"s3://crabby-images/f456d/f456d99adf8976ed9e43b908659d2775041cec72" alt=""
Hi, basically, HTML comments are currently handled like elements in html.diff's flatten_el, which leads to interesting results:
I added a check to not include comments in the generated token list, which effectively strips all comments from the diff: diff --git a/src/lxml/html/diff.py b/src/lxml/html/diff.py index 5d143bd2..9d4a4f72 100644 --- a/src/lxml/html/diff.py +++ b/src/lxml/html/diff.py @@ -4,7 +4,7 @@ from __future__ import absolute_import import difflib from lxml import etree -from lxml.html import fragment_fromstring +from lxml.html import fragment_fromstring, HtmlComment import re __all__ = ['html_annotate', 'htmldiff'] @@ -688,6 +688,14 @@ def flatten_el(el, include_hrefs, skip_tag=False): If skip_tag is true, then the outermost container tag is not returned (just its contents).""" + + if isinstance(el, HtmlComment): + if el.tail: + end_words = split_words(el.tail) + for word in end_words: + yield html_escape(word) + return + if not skip_tag: if el.tag == 'img': yield ('img', el.get('src'), start_tag(el)) jens
data:image/s3,"s3://crabby-images/4cf20/4cf20edf9c3655e7f5c4e7d874c5fdf3b39d715f" alt=""
Am 11. Februar 2019 17:33:29 MEZ schrieb Jens Quade:
Yes, that's a bug.
I added a check to not include comments in the generated token list, which effectively strips all comments from the diff
Well, I would rather have them display correctly. Their .tag property is the etree.Comment constructor to distinguish them from tags. Any chance you could choose up a patch for that? Stefan
data:image/s3,"s3://crabby-images/f456d/f456d99adf8976ed9e43b908659d2775041cec72" alt=""
On 12. 2 2019, at 07:18, Stefan Behnel <stefan_ml@behnel.de> wrote:
Am 11. Februar 2019 17:33:29 MEZ schrieb Jens Quade:
What would the output look like if a comment differs, is added or removed? htmldiff shows the difference in the visible text in html content, by adding <ins> and <del>. I do not really see the point in trying to diff the invisible parts. jens
data:image/s3,"s3://crabby-images/4cf20/4cf20edf9c3655e7f5c4e7d874c5fdf3b39d715f" alt=""
Am 11. Februar 2019 17:33:29 MEZ schrieb Jens Quade:
Yes, that's a bug.
I added a check to not include comments in the generated token list, which effectively strips all comments from the diff
Well, I would rather have them display correctly. Their .tag property is the etree.Comment constructor to distinguish them from tags. Any chance you could choose up a patch for that? Stefan
data:image/s3,"s3://crabby-images/f456d/f456d99adf8976ed9e43b908659d2775041cec72" alt=""
On 12. 2 2019, at 07:18, Stefan Behnel <stefan_ml@behnel.de> wrote:
Am 11. Februar 2019 17:33:29 MEZ schrieb Jens Quade:
What would the output look like if a comment differs, is added or removed? htmldiff shows the difference in the visible text in html content, by adding <ins> and <del>. I do not really see the point in trying to diff the invisible parts. jens
participants (2)
-
Jens Quade
-
Stefan Behnel