[IPython-dev] Some new cell types for describing data analyses in IPy. Notebook

Brian Granger ellisonbg at gmail.com
Fri Jul 12 12:21:50 EDT 2013


Gabriel,

Due to some travel for IPython and our family moving, I will probably
drop of the face of the planet in the next few days.  I will reemerge
in early-mid August if you want to continue this discussion at that
point.  Replies inline below...

> Last things first: I cannot speak for my advisors of course, but I would
> love to come down to Berkeley and chat with you guys about this (and other)
> stuff!

Great, let's talk in Sept. to figure out a time that would work.

>> Here is my experience of this.  I start out working in a very
>> non-linear manner.  As I work I discover things and change my code.
>> As I approach the point where I want to share my work, I start to
>> linearize it, otherwise it is very difficult for someone to take in.
>> In this context branching can be done by it has to be explicit.  In my
>> experience this is good.  If I want to run the analysis using 3
>> different algorithms, I run them in sequence and then show the results
>> of all three in the same place and draw conclusions.  All of this is
>> done - at the end of the day - in a linear notebook.
>
>
> That isn't quite the use-case I see branching/alternate supporting notebooks
> targeting directly. It is an important one, and one that branching can be
> leveraged for, but as you point out, it can be achieved via careful
> "linearization" of the analysis.
>
> A main difference between this and what I have in mind is that all three
> "branches" are things that need to be run and are intended to be compared as
> a necessary component of the final results. In this sense they aren't
> actually alternatives in the sense I tend to mean when using that word.
>
> I am  targeting situations where the not-chosen-as-final branches don't
> contribute directly to the final results. Things that give you insight into
> the data which inform your decision on a final model/algorithm/etc but
> aren't actually used to generate the results.

I am following this and this usage case makes sense - although I
haven't found it as often in my own work.

> One simple example of this is statistical models which you fit because they
> appear appropriate a priori based on the type of data and question you have,
> but that fail after-the-fact diagnostics. The fact that you fit that model,
> the results it gave you, the fact that it failed, and the insight the
> particular way it failed gave you into your data are crucial pieces of your
> research, but reproducing them is not necessary to reproduce your results.
>
> People viewing/running your notebook don't need to look at or execute the
> cells where you fit that model ... unless they do.

Yes, in statistical modeling this issue would definitely come up - it
is part of providing the full record of what was done - even the
negative results.

> Branching/DAG notebooks allow a single document to encompass the research
> you did, while providing easy access to various views corresponding to the
> generation of intermediate, alternative, and final results.
>
> These more complex notebooks allow the viewer to ask and answer important
> questions such as "What else did (s)he try here?" and potentially even "Why
> did (s)he choose this particular analysis strategy?". These questions can be
> answered in the text or external supplementary materials in a linear
> notebook, but this is a significant barrier to reproducibility of the
> research process (as opposed to the analysis results).

I can see that, however, I think the pure alt cells lack a critical
feature.  They treat all branches as being equally important.  In
reality, the branch that is chosen as the "best" one will likely
require further analysis and discussion that that other branches
don't.  Putting the different branches side by side makes it a little
like "choose your own adventure" - when in reality, the author of the
research want to steer the reader along a very particular path.  The
alternative paths maybe useful to have around, but they should be be
given equal weight as the "best" one.  But, maybe it is just
presentation and can be accounted for in descriptive text.

At the same time I can image situations where the author really does
want the different branches to receive equal weight as alternatives.

> Note that sometimes the viewer is you coming back to an analysis after some
> span of time so that the reasoning behind your decisions is no longer fresh.

Yes, this is an extremely common - if not the most common - usage case of all...

> As a practical/UI standpoint unselected branches can be hidden almost
> entirely (in theory, not currently in my PoC :p), resulting in a view
> equivalent to (any) the only view offered by a linear notebook. This means
> that from a viewer (and author since a straight line IS a DAG and nesting
> isn't forced) standpoint, what I'm describing is in essense a strict
> extension of what the notebook does now, rather than a change.

I would be *more* interested in alt-cell approaches that present the
notebook as a linear entity in all cases, but that has the alt-cell
logic underneath.  For example, what about the following:

* A user writes the different N alt cells in linear sequence
* The result is a purely linear notebook where one of the N cells should be run.
* We write a JavaScript plugin for the notebook that does a couple of things:

1. It provides a cell toolbar for marking those cells as members of an
alt-set.  This would simple modify the cell level metadata and allow
the author to provide titles of each alt-member.
2. It provides the logic for building a UI for viewing one of the
alt-set members at a time.  It could be as simple as injecting a drop
down menu that shows one and hides the rest.

* This plugin could simple walk the notebook cells and find all the
alt-cell sets and build this supplementary UI.
* This plugin could also have settings that allow the author to select
the "best" member of the alt-set.
* nbconvert Transformers could use the cell level metadata to export
the notebook in different formats.

As I write about this - I think this would be extremely nice, and it
would not be difficult to write at all.  Because of how our JavaScript
plugins work, it could be developed outside IPython initially.  The
question of inclusion in the official code base could be handled
later.  Honestly, this approach should be much easier than the work
you have already done.

Best of all the resulting notebooks would remain standard linear
notebooks that could be shared today on nbviewer, etc.  It would just
work.

Are you interested in taking a shot at this?  I think that would be awesome.

>>
>> BUT, I completely agree that the notebook does not handle certain
>> types of branching very well.  Where the notebook starts to really
>> suck is for longer analyses that you want to repeat for differing
>> parameters or algorithms.  You talk more about this usage case below
>> and we have started to think about how we would handle this.  Here are
>> our current thoughts:
>>
>> It would be nice to write a long notebook and then add metadata to the
>> notebook that indicates that some variables are to be treated as
>> "templated" variables.  Then we would create tools that would enable a
>> user to run a notebook over a range of templates:
>>
>> for x in xvars:
>>   for y in yvars:
>>     for algo in myalgos
>>     run_notebook('MyCoolCode', x, y, algo)
>>
>> The result would be **something** that allows the user to explore the
>> parameter space represented.  A single notebook would be used as the
>> "source" for this analysis and the result would be the set of all
>> paths through the notebook.  We have even thought about using our
>> soon-to-be-designed interactive widget architecture to enable the
>> results to be explored using different UI controls (sliders, etc) for
>> the xvar, yvar, algos.  This way you could somehow "load" the
>> resulting analysis into another notebook and explore things
>> interactively - with all of the computations already done.
>>
>
> This is a very powerful and exciting use-case. In fact it is one I am
> investigating myself in the context of a different project unrelated to
> IPython notebook. I call the set of results generated by such repeated runs
> with different input sets (ie paths through the document) the "robustness
> set" of the notebook with respect to the particular output variable being
> investigated.

Yes, this is a sort to batch mode for the notebook.

> The key here is that the robustness we are talking about is not only with
> respect to data/tuning parameters, but also with respect to the
> decisions/choices made during the analysis process itself. These decisions
> are often the difference between valid and invalid conclusions, but are
> rarely talked about during discussions about reproducible research/science
> AFAIK (I'd love to be wrong about that, even if it would make me look
> silly/foolish here).
>
> The DAG conceptual model buys us a lot here too though. Instead of having to
> run the entire notebook, you can calculate all possible paths through the
> DAG for any arbitrary (connected) starting and ending points. So we can
> rerun only pieces of  large notebooks to investigate any variable/plot
> regardless of whether it constitutes a final result of the
> notebook/analsyis.

Yes, this type of analysis could also be done by the JavaScript plugin
approach above.

>
>>
>> We have other people interested in this type of workflow and it can
>> all be done within the context of our existing linear notebook model.
>> It is just assembling the existing abstractions in different ways.
>>
>
> That is a plus. There is what I consider to be a pretty major drawback to
> this approach though.
>
> It is easy to see how this would work in the case of variables representing
> individual number/string/boolean valued parameters without much perturbation
> of the code.
>
> Trying to write an analysis script that can graciously handle substantially
> dissimilar analysis methods, on the other hand, is more problematic. We can
> do it, of course, but at that point we are moving much more into the realm
> of a program rather than an analysis script.

Yes, definitely.

> Consider the example of classifying new data based on a training set via
> KNN, SVM, and GLM approaches. These approaches all need different sets of
> parameters, return different types of objects as the output of the fitting
> function, may have subtley different behaviour when being used for
> prediction, etc.

Yep, that is the big challenge with the branching idea in general.  It
is not always true that the members of the alt sets can be swapped
out.

> The abstractions necessary to deal with these differences are likely in my
> opinion to be highly costly in terms of how easy it is for readers of the
> notebook to follow and understand what the code is doing.
>
> With actual branching, the code in each branch is exactly the same as if it
> were in a normal linear notebook which implemented only that one branch,
> making it much more likely to be straightforward and easy to read.

But I think the same issue exists with any approach to branching
right?  I am thinking the scripted notebook could have the same type
of API - the important point is that the templated variables, while
simple types, could trigger different code paths.

algo = 0 # a template variable

if algo == 0:
  # alt-cell #1
elif algo == 1:
  # alt-cell #3
...

This is not pretty but it would work...

> One of my targeted use-cases is publications which can more accurately
> convey the research which was done while still able to offer the clarity of
> focus of what we do now, so I think that is quite important. YMMV.
>
> And now the sticking point.
>>
>>
>> <snip>
>>
>> Q: does the new feature violate important abstractions we have in place.
>>
>> If the answer is no, then we do our normal job of considering the
>> costs of adding the feature versus the benefits.
>>
>> If the answer is yes, then we *stop*.
>
>
> I really do appreciate the IPython team's position. I think there is some
> relevant nuance involved in this particular case, however, which makes the
> does it change? yes:no test overly coarse. I attempt to make my case for
> this below.
>
> I think the answer to the questions "does this new feature violate important
> abstractions?" and "is it impossible/burdensomely difficult to alter
> important existing abstractions in a way that supports this feature without
> affecting the uses of the abstraction?" , may be different here, despite
> being the same in the overwhelming majority of cases.  And I would argue the
> second test offers identical protections as the first against the various
> pitfalls of making major changes to large projects willie-nillie (which i
> assure you I do understand are very real).
>
> I'm not advocating a dramatic about-face on the issue complete with parade
> and skywriting that "IPython is pursuing an exciting new thing starting
> today!". I do, however,  think it is perhapsworth consideration at a
> somewhat narrower and more immediate scale than it would be otherwise.

I hope you can see that I really like the general idea and think the
usage cases you are describing are really important.  I think I can
speak for the project in saying that we want the notebook to be useful
for things like this.  But I think our abstractions are important
enough that we make every attempt to see how we can do these while
leveraging our existing abstractions.  This is partially a question
about implementation, but also partly a question about how the new
features are thought about.  The reason we don't like to break
abstractions for new features is that we have found an interesting
relationship between abstraction breaking and new features.  We have
found that when a new feature/idea breaks a core abstraction that we
have thought about very carefully, it is usually because the feature
has not been fully understood.  Time and time again, we have found
that when we take the time to fully understand the feature, it usually
fits within our abstractions beautifully and is even much better that
we ever imagined it could be.

The plugin idea above is a perfect example of this.  By preserving the
abstractions the new feature itself a multiplication of even new
functionality:

* The resulting notebooks can still be version controlled.  This means
that the different alt-cell can be thrown into git and when we develop
a visual diff tool for notebooks, they will *just work*.
* The notebooks can immediately leverage the abstractions we have put
into place for converting notebooks to different formats.  You could
write custom transformers to present the notebook in a reveal.js
giving alt-cells special treatment.
* All of this can be done, and into the hands of user, without going
through those overly conservative IPython developers ;-)
* It will just work with nbviewer as well.
* It provides a cleanly abstracted foundation for other people to build upon

In summary, we are trying to build an architecture that allows a few
simple abstractions (we actually don't have that many!) to combine in
boundless ways to create features we never planned on, but that "just
work".

The upside of this, is that when we have encountered features that are
important to us that really do require us to break or re-vision core
abstractions - we gladly undertake this work.  Mainly because we feel
that the new abstractions will be even more powerful.

>>
>> <snip>
>>
>>
>> Thinking about your proposed feature from this perspective: both the
>> task cells and alt cells introduce hierarchy and nesting into the
>> notebook.  This breaks our core abstraction that cells are not nested.
>>  In Jan-Feb our core development team had a discussion about this
>> abstraction exactly.  We decided that we definitely don't want to move
>> in the direction of allowing nesting in the notebook.  Because of this
>> we are in the process of removing the 1 level of nesting our notebook
>> format currently has, namely worksheets.  So for us, it is not just
>> about complexity - it is about breaking the abstractions.
>
>
> I do understand this position. I'd like to think I am bringing up points not
> raised during that meeting, but whether or not that is the case abstractions
> ARE important.
>
> I guess I am/was thinking about the abstraction in place in IPython notebook
> a bit differently than you are describing it.
>
> For the next few paragraphs: process == (render|transform|execute|*)
>
> In my mind the abstraction/computational model is that a notebook is an
> ordered set of cells and to process the notebook you simply go through the
> cells in order and process them. What process means is dependent on the type
> of cell, and there are various pieces of code in various places (mostly the
> frontends and nbconvert AFAIK) that know how to handle each cell type.
>
> Under this formulation the change in abstraction is actually pretty small.
> The only addition is the statement that code which processes cells is
> responsible for initiating/handling the processing of any child cells those
> cells contain. The easy easiest example of this is the execute method on my
> task cells, which simply loops through each of its children and (if
> applicable) calls their execute method.
>
> With this change we still have a notebook defined as an ordered set of (top
> level) cells, and we can still process a notebook by stepping through each
> of them in order and processing that cell.
>
> Some changes to the concept of next/previous cells and cell position (for
> positional insertion, etc) were required and cells must be aware of their
> direct parent (which will be either a cell or the notebook itself), but I
> would argue these aren't actually important attributes of the abstraction
> itself and the changes were actually fairly narrow and (AFAICS) pretty
> painless and straightforward after some careful though/planning.

This is an interesting way of thinking about nesting that I had not
thought about.

>
>>
>> The reason that these abstractions are so important is that they
>> provide powerful foundations for us to build on.  One place the
>> "notebook as a linear sequence of cell" abstraction comes into play is
>> in our work on nbconvert that will appear in 1.0 in the next few
>> weeks.  This allows to to convert notebooks very easily to a number of
>> different formats.
>
>
> I haven't tackled nbconvert yet on my experimental fork, but I fully intend
> to as I agree entirely that the ability to generate things like linear pdfs
> and other static views is utterly crucial. The fact that a notebook with
> branches can generate a pdf that looks like it came from a linear notebook
> (ie the "static article" view) is a *major* selling point/core feature of
> what I'm trying to do with branching notebooks. It is key that people be
> able to meet the needs they are meeting now; if we can't, meeting the more
> nebulous needs they aren't isn't likely to save us (me) from irrelevance.
>
> Under my alternate description of the computational model described above
> nbconvert will behave pretty much as it always has: step through the
> notebook and process the cells in order into whatever format you are
> targeting. The one exception is the cells processing their children, but the
> scale of this change is not particularly large for the specific types of
> nesting I'm going for.
>
> Tasks would likely simply render their children without making any mark
> themselves in most cases, while altsets would do the same, but only for the
> "active" branch. This involves a bit of looping and a bunch of calls to
> existing code that knows how to transform the existing (core) cell types,
> but really not much else.
>
>
>
>>
>>  The other place this abstraction comes into play
>> is in our keyboard shortcuts.  We are striving for the notebook to be
>> usable for people who dont' touch the mouse (your traditional vi/emacs
>> users).  Nesting makes that very difficult.
>
>
> I admit this one is tougher, though I've done some small amount thinking
> about it (currently hitting "down" on a container cell enters it while
> hitting down on the last cell in a container navigates to the cell "after"
> the container in my PoC).
>
> I think this is surmountable though, and worth the effort if it were the
> only thing holding IPython notebook back from offering a
> change/alternative/"fix" to how we talk about research and what we can do
> with the documents we use to describe it.
>
> Wow that was a lot of text. Thanks for making it all the way to the end!

I made it!

Cheers,

Brian

> ~G
>
>
>> <snip>
>>
>
>
> --
> Gabriel Becker
> Graduate Student
> Statistics Department
> University of California, Davis
>
> _______________________________________________
> IPython-dev mailing list
> IPython-dev at scipy.org
> http://mail.scipy.org/mailman/listinfo/ipython-dev
>



--
Brian E. Granger
Cal Poly State University, San Luis Obispo
bgranger at calpoly.edu and ellisonbg at gmail.com



More information about the IPython-dev mailing list