[IPython-dev] Trying to diff ipynb after converting it to HTML

Mark Voorhies mark.voorhies at ucsf.edu
Tue Nov 26 16:11:31 EST 2013

On 11/26/2013 11:48 AM, Matthias BUSSONNIER wrote:
> Le 26 nov. 2013 à 19:22, Raniere Silva a écrit :
>> Hi all,
>> I'm trying to get a diff of two ipynb files after converting it to HTML. The
>> script I'm using can be found at https://github.com/r-gaia-cs/diffipynb but the
>> div.input_area.box-flex1 are not show properly (no line breaks) in the output.
>> You can found the HTML output from the test instructions of the README in
>> http://www.ime.unicamp.br/~ra092767/diffipynb/diff.html.
>> Any one know how I can fix it?
> You should ask on lxml ML,
> but you are apparently not the only one :
> https://mailman-mail5.webfaction.com/pipermail/lxml/2013-July/006914.html

Looks like all of the relevant code is in the self-contained lxml/html/diff.py
The loss of information (newlines, etc.) happens when tokenizing the input
for a word-based diff (e.g., like "git diff --word-diff"):

end_whitespace_re = re.compile(r'[ \t\n\r]$')


def split_words(text):
     """ Splits some text into words. Includes trailing whitespace (one
     space) on each word when appropriate.  """
     if not text or not text.strip():
         return []
     words = [w + ' ' for w in text.strip().split()]
     if not end_whitespace_re.search(text):
         words[-1] = words[-1][:-1]
     return words

start_whitespace_re = re.compile(r'^[ \t\n\r]')

So it might be reasonable to write a local module that imports most of the methods from diff.py
and customizes split_words and the top level method(s) for <pre> elements (either just removing
the white space stripping or switching to line-based diff for <pre>).

htmldiff appears to be written under the assumption of poorly-formed input, so another approach
(given good markup from IPython) would be to try to use lxml to extract more structure from the
inputs (rather than falling back on regexes).


More information about the IPython-dev mailing list