[IPython-dev] Some new cell types for describing data analyses in IPy. Notebook

Thu Jul 11 18:04:24 EDT 2013

Brian,

(Specific responses out of order for clarity and idea/narrative flow)

Last things first: I cannot speak for my advisors of course, but I would
love to come down to Berkeley and chat with you guys about this (and other)
stuff!

Now on to the incredibly long (sorry, but there was a lot of ground to
cover! :( ) response to the rest of your mail...

On Tue, Jul 9, 2013 at 7:32 PM, Brian Granger <ellisonbg at gmail.com> wrote:

> Gabriel,
>
> > Thank you for taking the time to watch my video and think about the ideas
> > I'm presenting. It is appreciated.
>
> Cool, I enjoyed it.  Fantastic discussion!
>

I agree. I'm finding this extremely valuable. Being forced to articulate my
ideas/understanding of the situation is extremely useful for me personally.
(I'll probably be cutting and pasting large chunks of what I find myself
writing here when I go to write this stuff up).

>
> Before you get too discouraged, please read on :-)
>
> <snip>
>
> > I once asked a room full of quants to raise their hands if their standard
> > operating procedure was to read in the data, maybe clean it a bit, fit a
> > single model, write up the results and be done. Not only did no one raise
> > their hand, but the mere suggestion that that could be how it worked got
> a
> > substantial laugh from the audience. Even though this is not how the
> work is
> > done, however, it is the narrative encoded into a linear notebook.
>
> Here is my experience of this.  I start out working in a very
> non-linear manner.  As I work I discover things and change my code.
> As I approach the point where I want to share my work, I start to
> linearize it, otherwise it is very difficult for someone to take in.
> In this context branching can be done by it has to be explicit.  In my
> experience this is good.  If I want to run the analysis using 3
> different algorithms, I run them in sequence and then show the results
> of all three in the same place and draw conclusions.  All of this is
> done - at the end of the day - in a linear notebook.
>

That isn't quite the use-case I see branching/alternate supporting
notebooks targeting directly. It is an important one, and one that
branching can be leveraged for, but as you point out, it can be achieved
via careful "linearization" of the analysis.

A main difference between this and what I have in mind is that all three
"branches" are things that need to be run and are intended to be compared
as a necessary component of the final results. In this sense they aren't
actually alternatives in the sense I tend to mean when using that word.

I am  targeting situations where the not-chosen-as-final branches
*don't*contribute directly to the final results. Things that give you
insight into
the data which inform your decision on a final model/algorithm/etc but
aren't actually used to generate the results.

One simple example of this is statistical models which you fit because they
appear appropriate *a priori* based on the type of data and question you
have, but that fail after-the-fact diagnostics. The fact that you fit that
model, the results it gave you, the fact that it failed, and the insight
the particular way it failed gave you into your data are crucial pieces of
your *research*, but reproducing them is not necessary to reproduce your *
results*.

People viewing/running your notebook don't need to look at or execute the
cells where you fit that model ... unless they do.

Branching/DAG notebooks allow a single document to encompass the
*research*you did, while providing easy access to various views
corresponding to the
generation of intermediate, alternative, and final *results*.

These more complex notebooks allow the viewer to ask and answer important
questions such as "What else did (s)he try here?" and potentially even "Why
did (s)he choose this particular analysis strategy?". These questions can
be answered in the text or external supplementary materials in a linear
notebook, but this is a significant barrier to reproducibility of the
research process (as opposed to the analysis results).

Note that sometimes the viewer is you coming back to an analysis after some
span of time so that the reasoning behind your decisions is no longer fresh.

As a practical/UI standpoint unselected branches can be hidden almost
entirely (in theory, not currently in my PoC :p), resulting in a view
equivalent to (any) the only view offered by a linear notebook. This means
that from a viewer (and author since a straight line IS a DAG and nesting
isn't forced) standpoint, what I'm describing is in essense a strict
extension of what the notebook does now, rather than a change.

> BUT, I completely agree that the notebook does not handle certain
> types of branching very well.  Where the notebook starts to really
> suck is for longer analyses that you want to repeat for differing
> parameters or algorithms.  You talk more about this usage case below
> and we have started to think about how we would handle this.  Here are
> our current thoughts:
>
> It would be nice to write a long notebook and then add metadata to the
> notebook that indicates that some variables are to be treated as
> "templated" variables.  Then we would create tools that would enable a
> user to run a notebook over a range of templates:
>
> for x in xvars:
>   for y in yvars:
>     for algo in myalgos
>     run_notebook('MyCoolCode', x, y, algo)
>
> The result would be **something** that allows the user to explore the
> parameter space represented.  A single notebook would be used as the
> "source" for this analysis and the result would be the set of all
> paths through the notebook.  We have even thought about using our
> soon-to-be-designed interactive widget architecture to enable the
> results to be explored using different UI controls (sliders, etc) for
> the xvar, yvar, algos.  This way you could somehow "load" the
> resulting analysis into another notebook and explore things
> interactively - with all of the computations already done.
>
>
This is a very powerful and exciting use-case. In fact it is one I am
investigating myself in the context of a different project unrelated to
IPython notebook. I call the set of results generated by such repeated runs
with different input sets (ie paths through the document) the "robustness
set" of the notebook with respect to the particular output variable being
investigated.

The key here is that the robustness we are talking about is not only with
respect to data/tuning parameters, but also with respect to *the
decisions/choices made during the analysis process* *itself*. These
decisions are often the difference between valid and invalid conclusions,
but are rarely talked about during discussions about reproducible
research/science AFAIK (I'd love to be wrong about that, even if it would
make me look silly/foolish here).

The DAG conceptual model buys us a lot here too though. Instead of having
to run the entire notebook, you can calculate all possible paths through
the DAG for any arbitrary (connected) starting and ending points. So we can
rerun only pieces of  large notebooks to investigate any variable/plot
regardless of whether it constitutes a final result of the
notebook/analsyis.

> We have other people interested in this type of workflow and it can
> all be done within the context of our existing linear notebook model.
> It is just assembling the existing abstractions in different ways.
>
>
That is a plus. There is what I consider to be a pretty major drawback to
this approach though.

It is easy to see how this would work in the case of variables representing
individual number/string/boolean valued parameters without much
perturbation of the code.

Trying to write an analysis script that can graciously handle substantially
dissimilar analysis methods, on the other hand, is more problematic. We can
do it, of course, but at that point we are moving much more into the realm
of a program rather than an analysis script.

Consider the example of classifying new data based on a training set via
KNN, SVM, and GLM approaches. These approaches all need different sets of
parameters, return different types of objects as the output of the fitting
function, may have subtley different behaviour when being used for
prediction, etc.

The abstractions necessary to deal with these differences are likely in my
opinion to be highly costly in terms of how easy it is for readers of the
notebook to follow and understand what the code is doing.

With actual branching, the code in each branch is exactly the same as if it
were in a normal linear notebook which implemented only that one branch,
making it much more likely to be straightforward and easy to read.

One of my targeted use-cases is publications which can more accurately
convey the research which was done while still able to offer the clarity of
focus of what we do now, so I think that is quite important. YMMV.

And now the sticking point.

>
> <snip>

Q: does the new feature violate important abstractions we have in place.
>
> If the answer is no, then we do our normal job of considering the
> costs of adding the feature versus the benefits.
>
> If the answer is yes, then we *stop*.
>

I really do appreciate the IPython team's position. I think there is some
relevant nuance involved in this particular case, however, which makes the
does it change? yes:no test overly coarse. I attempt to make my case for
this below.

I think the answer to the questions "does this new feature violate
important abstractions?" and "is it impossible/burdensomely difficult to
alter important existing abstractions in a way that supports this feature
without affecting the uses of the abstraction?" , may be different here,
despite being the same in the overwhelming majority of cases.  And I would
argue the second test offers identical protections as the first against the
various pitfalls of making major changes to large projects willie-nillie
(which i assure you I do understand are very real).

I'm not advocating a dramatic about-face on the issue complete with parade
and skywriting that "IPython is pursuing an exciting new thing starting
today!". I do, however,  think it is perhapsworth consideration at a
somewhat narrower and more immediate scale than it would be otherwise.

> <snip>
>
> Thinking about your proposed feature from this perspective: both the
> task cells and alt cells introduce hierarchy and nesting into the
> notebook.  This breaks our core abstraction that cells are not nested.
>  In Jan-Feb our core development team had a discussion about this
> abstraction exactly.  We decided that we definitely don't want to move
> in the direction of allowing nesting in the notebook.  Because of this
> we are in the process of removing the 1 level of nesting our notebook
> format currently has, namely worksheets.  So for us, it is not just
> about complexity - it is about breaking the abstractions.
>

I do understand this position. I'd like to think I am bringing up points
not raised during that meeting, but whether or not that is the case
abstractions ARE important.

I guess I am/was thinking about the abstraction in place in IPython
notebook a bit differently than you are describing it.

For the next few paragraphs: process == (render|transform|execute|*)

In my mind the abstraction/computational model is that a notebook is an
ordered set of cells and to process the notebook you simply go through the
cells in order and process them. What process means is dependent on the
type of cell, and there are various pieces of code in various places
(mostly the frontends and nbconvert AFAIK) that know how to handle each
cell type.

Under this formulation the change in abstraction is actually pretty small.
The only addition is the statement that code which processes cells is
responsible for initiating/handling the processing of any child cells those
cells contain. The easy easiest example of this is the execute method on my
task cells, which simply loops through each of its children and (if
applicable) calls their execute method.

With this change we still have a notebook defined as an ordered set of (top
level) cells, and we can still process a notebook by stepping through each
of them in order and processing that cell.

Some changes to the concept of next/previous cells and cell position (for
positional insertion, etc) were required and cells must be aware of their
direct parent (which will be either a cell or the notebook itself), but I
would argue these aren't actually important attributes of the abstraction
itself and the changes were actually fairly narrow and (AFAICS) pretty
painless and straightforward after some careful though/planning.

> The reason that these abstractions are so important is that they
> provide powerful foundations for us to build on.  One place the
> "notebook as a linear sequence of cell" abstraction comes into play is
> in our work on nbconvert that will appear in 1.0 in the next few
> weeks.  This allows to to convert notebooks very easily to a number of
> different formats.

I haven't tackled nbconvert yet on my experimental fork, but I fully intend
to as I agree entirely that the ability to generate things like linear pdfs
and other static views is utterly crucial. The fact that a notebook with
branches can generate a pdf that looks like it came from a linear notebook
(ie the "static article" view) is a *major* selling point/core feature of
what I'm trying to do with branching notebooks. It is key that people be
able to meet the needs they are meeting now; if we can't, meeting the more
nebulous needs they aren't isn't likely to save us (me) from irrelevance.

Under my alternate description of the computational model described above
nbconvert will behave pretty much as it always has: step through the
notebook and process the cells in order into whatever format you are
targeting. The one exception is the cells processing their children, but
the scale of this change is not particularly large for the specific types
of nesting I'm going for.

Tasks would likely simply render their children without making any mark
themselves in most cases, while altsets would do the same, but only for the
"active" branch. This involves a bit of looping and a bunch of calls to
existing code that knows how to transform the existing (core) cell types,
but really not much else.

>  The other place this abstraction comes into play
> is in our keyboard shortcuts.  We are striving for the notebook to be
> usable for people who dont' touch the mouse (your traditional vi/emacs
> users).  Nesting makes that very difficult.
>

I admit this one is tougher, though I've done some small amount thinking
about it (currently hitting "down" on a container cell enters it while
hitting down on the last cell in a container navigates to the cell "after"
the container in my PoC).

I think this is surmountable though, and worth the effort if it were the
only thing holding IPython notebook back from offering a
change/alternative/"fix" to how we talk about research and what we can do
with the documents we use to describe it.

Wow that was a lot of text. Thanks for making it all the way to the end!

~G

<snip>
>
>

-- 
Gabriel Becker
Graduate Student
Statistics Department
University of California, Davis
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/ipython-dev/attachments/20130711/a7628a03/attachment.html>