Well perhaps I am going crazy (or making a silly mistake?), but I tested this on my Linux box with Lxml version 3.0.0 through to the latest sources on github with the same results. Here is the test script I used:

from lxml.html.diff import htmldiff

html = u"""<pre>test
test2
test3</pre>"""

result = htmldiff(html, html)
assert "\n" in result, "Newline not found in %s" % result

Inside htmldiff it calls tokenize() on both inputs, which appears to loose all notion of whitespace:

>>> tokenize("""<pre>test

... test2

... test3

... """)

[token(u'test', ['<pre>'], []), token(u'test2', [], []), token(u'test3', [], ['</pre>'])]

It then calls htmldiff_tokens, which produces output like this:

['<pre>', u'test ', u'test2 ', u'test3', '</pre>']

It then joins those with an empty string, which results in a loss of whitespace.

On 26 July 2013 14:49, Simon Sapin <simon.sapin@exyr.org> wrote:

Le 26/07/2013 14:21, Tom . a écrit :

Hello,
I'm using lxml's htmldiff function to (surprise) diff two HTML snippets.
However I have found that if the snippet includes a <pre> tag then any
newlines are stripped from it. For example:

>>> html = "<pre>test\ntest2\ntest3</pre>"
>>> print repr(htmldiff(html, html))
u'<pre>test test2 test3</pre>'

Is there a reason for this, or better yet is there a way I can stop this
from occurring? I was under the impression that inside a pre tag
newlines and whitespace are handled differently and shouldn't be
stripped like you can do with other HTML tags.

IMO htmldiff shouldn’t touch whitespace anywhere. The whitespace handling in pre elements is actually a CSS property that can be applied to any element.

--
Simon Sapin
_________________________________________________________________
Mailing list for the lxml Python XML toolkit - http://lxml.de/
lxml@lxml.de
https://mailman-mail5.webfaction.com/listinfo/lxml