[IPython-dev] Thoughts on the notebook format for version control

MinRK benjaminrk at gmail.com
Sat Nov 5 22:41:31 EDT 2011


On Sat, Nov 5, 2011 at 18:58, Fernando Perez <fperez.net at gmail.com> wrote:

> Hi folks,
>
> I wanted to start a discussion on the notebook format regarding its
> suitability for version control.  I see the notebook format as the way
> in which I'll likely keep (and hopefully many others too) most of my
> research notes/work, and thereore it's important that it's as easy as
> possible to version control notebooks and use them smoothly in a
> version-controlled workflow.  Unfortunately right now, the format we
> have simply doesn't fit that, mainly for two reasons:
>
> 1. The cell inputs (code and text) are stored as a single line in the
> json format.  This means that virtually any edits anywhere in a cell
> will immediately produce VC conflicts.  Furthermore, they are nearly
> impossible to resolve by hand because you have to scan very long lines
> by eye, and can only apply wholesale one version or the other.
>
> 2. The presence of outputs stored inside the file causes two separate
> problems:
> a) The large binary blobs make the files often quite large.
> b) Changes in the binary blobs can't really be inspected by hand, but
> tend to easily cause conflicts.
>
> To get a sense of the problem, here's the diff from a pull request
> made on a simple (mostly for testing purposes) repo:
>
> https://github.com/fperez/nipy-notebooks/pull/1/files
>
> That diff is more or less useless: note the huge horizontal scroll
> bar, and changes in inputs are impossible to understand.
>
> So I think we need to find a solution.  This doesn't have to happen
> necessarily right away, since we're trying to put 0.12 out; I think
> it's OK if for now our format is mostly treated as a binary blob.  But
> we do need to come up with a plan for the medium term.
>
> Here's my proposal, with full credit going to Yarik who suggested the
> idea of splitting outputs into a separate file.  There are basically
> two changes against what we have now:
>
> 1. The notebook would *always* be split into two files, the .ipynb
> containing only inputs, and a companion (say .ipynbo) file with all
> outputs.  If an output file is not available or is detected to have
> problems such as cell number mismatch, it is simply ignored (it can
> always be recreated by rerunning the notebook.
>

There is a *huge* disadvantage in portability to notebooks not being single
files.  I think this still makes
sense, though.  I would treat the output as a 'cache' (along the lines of
.pyc / __cache__),
rather than considering the notebook itself as a multi-file format.  And
you should be able
to embed the outputs in a single file if you want, for easier portability.

Doing it this way would not require changing the notebook format, because
current (output-included)
notebooks would still comply with the spec.


> 2. All inputs would be stored in a json list of strings instead of a
> single string.
>

I like this - splitlines(code) / '\n'.join(lines) makes it easy.  This
change does mean that we need it to be nbformat v3.


>
> With #1, one would naturally only commit to VC the ipynb file, leaving
> the output ones to be always ignored.  People could obviously choose
> to commit the output as well, at their own risk. #2 would make it much
> easier to get line-by-line diffs of any input (code or text).
>
> I think together, these two changes mostly solve the problems I've
> encountered in practice so far.  I'm trying really hard to eat our own
> dogfood by using these tools in actual, everyday research work, so
> that we see the problems first.  And while I think the notebook is
> reaching a point where it's a great working environment (even if we
> have a ton of ideas for improvements already and things we know need
> fixing), it's clear now to me that we fail pretty badly as a
> version-controllable format.
>
> I realize that implementing something like this will add non-trivial
> complexity to the format read/write code in a number of places, so if
> anyone sees a simpler solution to the problem, we're all ears.  But we
> do need to figure out how to make the notebooks first-class citizens
> in a VC world; the (effectively) opaque binary blobs they are now just
> won't cut it in the long run.


Yes, we do need to do better.


>
> Thoughts, ideas?
>

I think this sounds like a good start, with the only change that we still
allow (optionally) outputs in a single file via the download button, rather
than the notebook format being canonically multifile, which is just too
painful.

I think the key-order issue you mention in the addendum is easily fixed by
specifying `sort_keys=True` in the json dump.


>
> Cheers,
>
> f
> _______________________________________________
> IPython-dev mailing list
> IPython-dev at scipy.org
> http://mail.scipy.org/mailman/listinfo/ipython-dev
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/ipython-dev/attachments/20111105/ef1258ec/attachment.html>


More information about the IPython-dev mailing list