Htmldiff stripping newlines from pre tags
data:image/s3,"s3://crabby-images/68f01/68f0192c5a1ca39508ef270ce5b346aed24c860f" alt=""
Hello, I'm using lxml's htmldiff function to (surprise) diff two HTML snippets. However I have found that if the snippet includes a <pre> tag then any newlines are stripped from it. For example:
Is there a reason for this, or better yet is there a way I can stop this from occurring? I was under the impression that inside a pre tag newlines and whitespace are handled differently and shouldn't be stripped like you can do with other HTML tags. Tom
data:image/s3,"s3://crabby-images/68f01/68f0192c5a1ca39508ef270ce5b346aed24c860f" alt=""
Well perhaps I am going crazy (or making a silly mistake?), but I tested this on my Linux box with Lxml version 3.0.0 through to the latest sources on github with the same results. Here is the test script I used: from lxml.html.diff import htmldiff html = u"""<pre>test test2 test3</pre>""" result = htmldiff(html, html) assert "\n" in result, "Newline not found in %s" % result Inside htmldiff it calls tokenize() on both inputs, which appears to loose all notion of whitespace:
It then calls htmldiff_tokens, which produces output like this: ['<pre>', u'test ', u'test2 ', u'test3', '</pre>'] It then joins those with an empty string, which results in a loss of whitespace. On 26 July 2013 14:49, Simon Sapin <simon.sapin@exyr.org> wrote:
data:image/s3,"s3://crabby-images/4cf20/4cf20edf9c3655e7f5c4e7d874c5fdf3b39d715f" alt=""
Tom ., 26.07.2013 15:21:
Thanks for your patch, I merged it into current master. https://github.com/lxml/lxml/pull/124 I think the change is a good thing, but playing with it a bit, it doesn't really feel completely right yet. Consider this example: >>> print(htmldiff('<p> first\nsecond\nthird</p>', ... '<p> first\n second\nthird </p>')) <p>first second third </p> It still drops the whitespace at the beginning, but not at the end. It also seems to copy the content from the second argument, not the first. Not sure if that's good or bad, but it seems surprising. What do others think here? Stefan
data:image/s3,"s3://crabby-images/68f01/68f0192c5a1ca39508ef270ce5b346aed24c860f" alt=""
I've modified my fork to preserve whitespace after a tag. its a bit hacky ( https://github.com/orf/lxml/blob/master/src/lxml/html/diff.py#L731) but it to works. I've sent a pull request with the changes on Github. Tom On 1 August 2013 05:29, Stefan Behnel <stefan_ml@behnel.de> wrote:
data:image/s3,"s3://crabby-images/68f01/68f0192c5a1ca39508ef270ce5b346aed24c860f" alt=""
Well perhaps I am going crazy (or making a silly mistake?), but I tested this on my Linux box with Lxml version 3.0.0 through to the latest sources on github with the same results. Here is the test script I used: from lxml.html.diff import htmldiff html = u"""<pre>test test2 test3</pre>""" result = htmldiff(html, html) assert "\n" in result, "Newline not found in %s" % result Inside htmldiff it calls tokenize() on both inputs, which appears to loose all notion of whitespace:
It then calls htmldiff_tokens, which produces output like this: ['<pre>', u'test ', u'test2 ', u'test3', '</pre>'] It then joins those with an empty string, which results in a loss of whitespace. On 26 July 2013 14:49, Simon Sapin <simon.sapin@exyr.org> wrote:
data:image/s3,"s3://crabby-images/4cf20/4cf20edf9c3655e7f5c4e7d874c5fdf3b39d715f" alt=""
Tom ., 26.07.2013 15:21:
Thanks for your patch, I merged it into current master. https://github.com/lxml/lxml/pull/124 I think the change is a good thing, but playing with it a bit, it doesn't really feel completely right yet. Consider this example: >>> print(htmldiff('<p> first\nsecond\nthird</p>', ... '<p> first\n second\nthird </p>')) <p>first second third </p> It still drops the whitespace at the beginning, but not at the end. It also seems to copy the content from the second argument, not the first. Not sure if that's good or bad, but it seems surprising. What do others think here? Stefan
data:image/s3,"s3://crabby-images/68f01/68f0192c5a1ca39508ef270ce5b346aed24c860f" alt=""
I've modified my fork to preserve whitespace after a tag. its a bit hacky ( https://github.com/orf/lxml/blob/master/src/lxml/html/diff.py#L731) but it to works. I've sent a pull request with the changes on Github. Tom On 1 August 2013 05:29, Stefan Behnel <stefan_ml@behnel.de> wrote:
participants (3)
-
Simon Sapin
-
Stefan Behnel
-
Tom .