[IPython-dev] Some new cell types for describing data analyses in IPy. Notebook

Gabriel Becker gmbecker at ucdavis.edu
Fri Jul 5 15:28:55 EDT 2013


 Brian,

Thank you for taking the time to watch my video and think about the ideas
I'm presenting. It is appreciated.

I am cognizant of the added complexity I am talking about. I disagree a bit
with the size/damage being attributed to it, but the main place we seem to
be on different pages is regarding what it buys us.

The alt cells I'm talking about are not a cool rendering/interaction trick
for linear notebooks. If they were I would absolutely agree that the
nesting is overkill and not worth its cost.

Linear/sequental notebooks and other dynamic document systems reproduce
computational *results*. The documents my advisors (Duncan Temple Lang and
Deborah Nolan) and I are working towards aim to describe and reproduce *the
research itself*.

I once asked a room full of quants to raise their hands if their standard
operating procedure was to read in the data, maybe clean it a bit, fit a
single model, write up the results and be done. Not only did no one raise
their hand, but the mere suggestion that that could be how it worked got a
substantial laugh from the audience. Even though this is not how the work
is done, however, it is the narrative encoded into a linear notebook.

Scientists and data analysts are already doing branching in their work, but
they don't have any good tools to record or describe it that way. So they
comment out large blocks of code in linear scripts, or they save old
attempts and alternate approaches in separate files, or they (hopefully
not) simply delete or overwrite old parameter configurations with new ones
when they decide the old one wasn't right. Not because they think these are
good ideas, but because its all they have available and they are
scientists/analysts.

Their job is to analyze data, extract insight and share their work in a
useful manner; our job is to conceive and implement the tools they need to
do that. They are good at their job, but we are struggling with ours.

In my opinion, the single most important feature on display in my video is
not that the alternative cells are rendered side-by-side, or that they can
be executed and the software knows what that means in terms of executing
their content; it is that the IPython notebook has become an authoring tool
which analysts can use to easily create documents which describe what they
actually did in a way that is reproducible, distributable, and most
importantly, useful to the analyst.

Suddenly they don't need to comment out or overwrite code when they take a
different approach. Suddenly they can distribute a document that actually
describes what they did, instead of just what they found. Suddenly referees
don't need to wonder or ask about whether an alternative analysis method
was investigated. Suddenly professors can show statistics students what
statisticians actually *do*, instead of how final results are generated.

You may read all this and think "that is great but way out of scope for
what we want to achieve with IPython notebook". That is your right. And if
the IPython team feels that way there is, of course, nothing I can do other
than what I have done: explore these ideas on my own. I just want to be
sure we're all talking about the same thing before that decision gets made.

Detailed responses below.


On Wed, Jul 3, 2013 at 6:59 PM, Brian Granger <ellisonbg at gmail.com> wrote:

> Gabriel,
>
> I watched your video and there are some nice ideas here.  We are not
> headed in this direction in terms of *implementation* but I think you
> will find that similar *capabilities* will show up in the notebook
> over time.  A few comments about the implementation aspects:
>
> First, the benefits of having a notebook be a linear sequence of cells
> are massive:
>
> * Simple, simple simple - this makes it very easy to reason about
> notebooks in code.  Nesting leads to complexity that is not worth the
> cost.
>

It does lead to complexity. Whether it is worth the cost depends on what we
gain. In my opinion I think we gain a massive amount, as I tried to
describe above.


> * You can get most of the benefits of nesting without the complexity.
> As Min mentioned, there is an implied hirrarchy in the heading cells.
> We plan on using that to allow group level actions - show/hide, run
> group, move, cut/copy/paste, etc.
>

With respect, I don't know that you will get the actual benefits I am
aiming for this way. Branches (and thus the code contained in them) in this
context are mutually exclusive.  A document which has branching that leads
back into a single computation, such as the example in my
video/screenshots, *no longer makes sense to think about or execute as a
sequential set of code blocks*.

It is not that we *can* view it as a branching structure and only execute
one branch at a time, but that we *must*. Again, people do this already via
deletion or commenting because all the code for all the branches cannot
live in the same block of code. I seek to give them a tool to do it that is
not damaging to the record or reproducibilty of what they did.

If you are familiar with R code and look carefully at the code for the data
cleaning branches in my example, you will see that if we put all of that
code into a sequential notebook, with heading cells to differentiate the
branches, and ran the code start to finish it would generate a plot not
created by *any* of the 18 paths through the structured document. In other
words, it generates the wrong plot regardless of what the "right" set of
choices at the branching points is in that execution context.


> * It is not difficult to think about building a proper diff tool for
> notebooks.  With nested cells this becomes horrific.
>

This does become much more difficult, but there are still things which can
be done. Archambault and others have done work on differencing graphs under
the assumption that one of them modified to create the other ( see
here<http://hal.inria.fr/docs/00/51/41/50/PDF/diffMapExperiment.pdf>and
here<http://www.cs.ubc.ca/nest/imager/tr/2009/Archambault_StructDiff_GI/structDiffPres.pdf>for
quickly retrieved examples)

Thus we can combine graph differencing with differencing of the actual
code, e.g. a combination of difference maps  and normal code diffing.

This would require unique identifiers for all notebook cells, but from
something said previously in this thread I gather that is likely to happen
anyway.


> * Hierarchy puts a significant cognitive load on users
>

Maybe. Again, this is something people are already implicitly doing. It is
possible that users would have trouble thinking about what they are doing
more explicitly in these terms, but its also entirely possible that this
will actually simplify their thinking process.

I admittedly think about these types of documents a lot, but I was
pleasantly surprised at how easy it was to construct and interact with the
example I made for the video once the UI hooks were in place. Constructing
such documents by hand, which I have done a bit, IS quite unpleasant, but
that is one of the issues I seek to address.

Nested branching and other more complex structures are more difficult to
think about, but users would only create those if they had a really good
reason to do so (ie if they were already thinking in those terms) and just
a couple levels of branching buy us an enormous amount of fidelity when
describing data analyses.



>
> Because of these things we don't have any plans on changing the
> notebook document format or notebook UI to allow nested cells.
>
>
As I said above, it is absolutely the right of the IPython core team to
make that call. I just want to make sure we are all talking about the same
costs and the same gains when the cost-benefit analysis to make that
decision is performed.

I intend to continue pursuing these concepts. I would be thrilled to
collaborate with the IPython team if you guys decide it is something you
are interested in, but I will understand and carry on myself if you do not.

Thanks for reading and for the discussion. Its great to talk to people with
differing opinions about this instead of sitting alone in my thesis-filled
bubble stewing :).

~G


Second, while it is tempting to generalize the notion of input to
> include widgety things, it is more appropriate to put these things in
> the output:
>
> * Our output architecture has the notion of multiple representations.
> This allows us to build rich widget as you have done, but to still
> provide static representations (png, jpg, latex).
> * Having the multiple representations of output allows us to build the
> rich widgets, but maintain a clear path for converting notebooks to
> static formats (pdf, html, word, powerpoint).
> * Insisting that input cells are pure code allows you to reason in a
> clear manner about how a notebook works = code runs and leads to
> output.  That reasoning can be applied in an automate manner by
> running notebooks in batch mode, or building a test system based on
> them.
> * Putting widgets in the input area forces you to do regular
> expression matching to replace those variables in the code.  This
> limits you to an extremely simple event model where the only possible
> event you can know about is substitute the regular expression and run
> all the code.  What if you want different UI controls in the browser
> to trigger different bits of code in the kernels when different fine
> grained events happen.  Making the UI controls live on the Python and
> JS side allows us to build this in a natural way.
>
> Th alt-cells you show bring up the issue of providence.  We have some
> very initial thoughts about that, but it is way out of scope for the
> project right now - we have a plates 10x overfull already.  We will
> get there though eventually.
>
> Thanks for sharing your ideas.
>
> PS - for a bit more background about the context of our saying "no" to
> this feature request, see this blog post:
>
> http://brianegranger.com/?p=249
>
> I also gave a talk about this at SciPy and will be posting my slides soon.
>
> Cheers,
>
> Brian
>
> On Wed, Jul 3, 2013 at 6:04 PM, Gabriel Becker <gmbecker at ucdavis.edu>
> wrote:
> > Matthias,
> >
> > Thanks for your detailed response.
> >
> >
> > On Wed, Jul 3, 2013 at 1:25 AM, Matthias BUSSONNIER
> > <bussonniermatthias at gmail.com> wrote:
> >>
> >> Gabriel,
> >>
> >> You screen shot are interesting,
> >> At some point I played with gridster[1]
> >>
> >> and was more or less able to get cell to rearranges, but didn't keep the
> >> code.
> >> You might be interested.
> >>
> >> Keep in mind that the notebook browser-interface we ship is only one
> >> possible
> >> frontend that can interpret ipynb files, nothing prevent you to write a
> >> different frontend that display the notebook in a different format.
> >>
> >> This added to the fact that each cell can support arbitrary metadata,
> you
> >> should be able to arrange preexisting in structure that work together.
> It
> >> might
> >> be a little difficult to do it right now as our javascript is not yet
> >> modular
> >> enough to be easily reused, but we are moving toward it.
> >
> >
> > Respectfully, rolling my own frontend for ipynb files given all the work
> the
> > IPython team has done on the excellent notebook browser interface would
> be
> > an enormous and extremely wasteful duplication of effort. I don't think
> its
> > the right way to pursue these features.
> >
> > Furthermore, if I were going to write an application offering the types
> of
> > features I am talking about from scratch, there wouldn't be any good
> reason
> > to base it on the unaltered ipynb format, as they don't easily support
> the
> > structure required by those features without the additional cell types I
> > implemented in my forked version.
> >
> >>
> >> Right now I thing storing the notebook as a directed graph is
> problematic
> >> in a
> >> few way,
> >
> >
> > I'm not talking about storing the notebook as an actual directed graph
> data
> > structure. There would be benefits to that but its not necessary and it
> > isn't want I did in my forked version.
> >
> > The ability to have nested cells (cells which contain other cells) gets
> us
> > everything we need structure wise, and is the basis of everything seen in
> > both the video (other than interactive code cell stuff) and screenshots I
> > posted. The ipynb file for the notebook pictured in the screenshot looks
> > exactly like a normal ipynb file except that in the json there are cell
> > declarations which have a cells field which contains the json
> descriptions
> > of the cells contained in that cell.
> >
> >
> >>
> >> the first being that it is incompatible with the fact that people want
> >> to be able to run notebook in a headless manner, which if you add
> explicit
> >> choice is not possible.
> >
> >
> > This isn't the case. The json saved versions of notebooks with branching
> > remember which version was most recently run. When an altset cell is
> > executed, it runs only the most recently run (or currently "selected",
> > though that means something else internally) branch. Thus by doing the
> naive
> > thing and looping through all top level cells and executing them, the
> > currently chosen path through the notebook can easily be run in a
> headless
> > environment and give the correct results.
> >
> >>
> >> This also contradict the fact that the notebook capture
> >> both the input and the output of the computation.
> >
> >
> > I don't really understand what you mean by this. In the JSON
> representation
> > of an executed code cell, the input field contains the code, but not any
> > values of variables used by the code, nor any indication of code which
> was
> > run before executing the code cell.
> >
> > Changing and rerunning an earlier code cell without re-executing the
> cell in
> > question can easily invalidate the output stored in the JSON, even
> without
> > the concept of branching or choice.
> >
> >
> >>
> >> As you showed there is
> >> actually 18 different combinations of data analysis, and they are not
> all
> >> stored in the notebook.
> >
> >
> > The notebook knows and records which choices were made. There are 18
> > different combinations of data analysis but only one was chosen by
> analyst
> > as generating the final/most recent result.
> >
> > In the case of "publishing" about an analysis the notebook stores the
> path
> > most chosen by the analyst, while retaining information about what else
> he
> > or she did during the decision process.
> >
> > In the case of instruction, imagine how much easier it would be to teach
> > data analysis if the students could actually see what data analysts do,
> > instead of simply the final method they choose in a particular analysis.
> >
> >
> >>
> >>
> >> I really thing this is an interesting project, and reusing only our
> >> metadata in
> >> the notebook, you should be able to  simulate it (store the dag in
> >> notebook
> >> level metadata, and cell id in cell metadata) then reconstruct the graph
> >> when
> >> needed. Keep in mind that at some point we might/will add cell id to the
> >> notebook.
> >>
> >> To sum up, I don't think the current JS client is in it's current state
> >> the
> >> place to implement such an idea. The Dag for cell order might be an idea
> >> for
> >> future notebook format but need to be well thought, and wait for cell
> IDs.
> >
> >
> > I apologize for not being clear. As I said in a response above, the
> directed
> > graph idea was intended to be conceptual for thinking about the
> documents,
> > not structural for actually storing them.
> >
> > What I actually did was simply allow cell nesting and change indexing so
> > that it is with respect to the parent/container (cell or notebook)
> instead
> > of always with respect to the notebook. This required some machinery
> changes
> > but not too many and it is an extension in the mathematical sense in that
> > indexing will behave identically to the old system for notebooks without
> any
> > nesting while now meaningfully functioning for notebooks with nesting.
> >
> > ~G
> >>
> >>
> >> --
> >> Matthias
> >>
> >>
> >>
> >> [1] http://gridster.net/
> >>
> >> _______________________________________________
> >> IPython-dev mailing list
> >> IPython-dev at scipy.org
> >> http://mail.scipy.org/mailman/listinfo/ipython-dev
> >>
> >
> >
> >
> > --
> > Gabriel Becker
> > Graduate Student
> > Statistics Department
> > University of California, Davis
> >
> > _______________________________________________
> > IPython-dev mailing list
> > IPython-dev at scipy.org
> > http://mail.scipy.org/mailman/listinfo/ipython-dev
> >
>
>
>
> --
> Brian E. Granger
> Cal Poly State University, San Luis Obispo
> bgranger at calpoly.edu and ellisonbg at gmail.com
> _______________________________________________
> IPython-dev mailing list
> IPython-dev at scipy.org
> http://mail.scipy.org/mailman/listinfo/ipython-dev
>



-- 
Gabriel Becker
Graduate Student
Statistics Department
University of California, Davis
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/ipython-dev/attachments/20130705/6881bad7/attachment.html>


More information about the IPython-dev mailing list