[IPython-dev] Trying to diff ipynb after converting it to HTML

Tue Nov 26 16:11:31 EST 2013

On 11/26/2013 11:48 AM, Matthias BUSSONNIER wrote:
>
> Le 26 nov. 2013 à 19:22, Raniere Silva a écrit :
>
>> Hi all,
>>
>> I'm trying to get a diff of two ipynb files after converting it to HTML. The
>> script I'm using can be found at https://github.com/r-gaia-cs/diffipynb but the
>> div.input_area.box-flex1 are not show properly (no line breaks) in the output.
>>
>> You can found the HTML output from the test instructions of the README in
>> http://www.ime.unicamp.br/~ra092767/diffipynb/diff.html.
>>
>> Any one know how I can fix it?
>
> You should ask on lxml ML,
> but you are apparently not the only one :
>
> https://mailman-mail5.webfaction.com/pipermail/lxml/2013-July/006914.html
>

Looks like all of the relevant code is in the self-contained lxml/html/diff.py
The loss of information (newlines, etc.) happens when tokenizing the input
for a word-based diff (e.g., like "git diff --word-diff"):

----
end_whitespace_re = re.compile(r'[ \t\n\r]$')

...

def split_words(text):
     """ Splits some text into words. Includes trailing whitespace (one
     space) on each word when appropriate.  """
     if not text or not text.strip():
         return []
     words = [w + ' ' for w in text.strip().split()]
     if not end_whitespace_re.search(text):
         words[-1] = words[-1][:-1]
     return words

start_whitespace_re = re.compile(r'^[ \t\n\r]')
----

So it might be reasonable to write a local module that imports most of the methods from diff.py
and customizes split_words and the top level method(s) for <pre> elements (either just removing
the white space stripping or switching to line-based diff for <pre>).

htmldiff appears to be written under the assumption of poorly-formed input, so another approach
(given good markup from IPython) would be to try to use lxml to extract more structure from the
inputs (rather than falling back on regexes).

--Mark