[IPython-dev] Some new cell types for describing data analyses in IPy. Notebook

Gabriel Becker gmbecker at ucdavis.edu
Sat Jul 13 19:29:33 EDT 2013


On Fri, Jul 12, 2013 at 9:21 AM, Brian Granger <ellisonbg at gmail.com> wrote:

> Gabriel,
>
> Due to some travel for IPython and our family moving, I will probably
> drop of the face of the planet in the next few days.  I will reemerge
> in early-mid August if you want to continue this discussion at that
> point.  Replies inline below...
>

I will look forward to that. My research plan calls for me to be pushing
hard on these concepts this summer, so there will likely be even more to
talk about then :)



>
> > Last things first: I cannot speak for my advisors of course, but I would
> > love to come down to Berkeley and chat with you guys about this (and
> other)
> > stuff!
>
> Great, let's talk in Sept. to figure out a time that would work.
>

Definitely.


>
> >> Here is my experience of this.  I start out working in a very
> >> non-linear manner.  As I work I discover things and change my code.
> >> As I approach the point where I want to share my work, I start to
> >> linearize it, otherwise it is very difficult for someone to take in.
> >> In this context branching can be done by it has to be explicit.  In my
> >> experience this is good.  If I want to run the analysis using 3
> >> different algorithms, I run them in sequence and then show the results
> >> of all three in the same place and draw conclusions.  All of this is
> >> done - at the end of the day - in a linear notebook.
> >
> >
> > That isn't quite the use-case I see branching/alternate supporting
> notebooks
> > targeting directly. It is an important one, and one that branching can be
> > leveraged for, but as you point out, it can be achieved via careful
> > "linearization" of the analysis.
> >
> > A main difference between this and what I have in mind is that all three
> > "branches" are things that need to be run and are intended to be
> compared as
> > a necessary component of the final results. In this sense they aren't
> > actually alternatives in the sense I tend to mean when using that word.
> >
> > I am  targeting situations where the not-chosen-as-final branches don't
> > contribute directly to the final results. Things that give you insight
> into
> > the data which inform your decision on a final model/algorithm/etc but
> > aren't actually used to generate the results.
>
> I am following this and this usage case makes sense - although I
> haven't found it as often in my own work.
>
> > One simple example of this is statistical models which you fit because
> they
> > appear appropriate a priori based on the type of data and question you
> have,
> > but that fail after-the-fact diagnostics. The fact that you fit that
> model,
> > the results it gave you, the fact that it failed, and the insight the
> > particular way it failed gave you into your data are crucial pieces of
> your
> > research, but reproducing them is not necessary to reproduce your
> results.
> >
> > People viewing/running your notebook don't need to look at or execute the
> > cells where you fit that model ... unless they do.
>
> Yes, in statistical modeling this issue would definitely come up - it
> is part of providing the full record of what was done - even the
> negative results.
>
> > Branching/DAG notebooks allow a single document to encompass the research
> > you did, while providing easy access to various views corresponding to
> the
> > generation of intermediate, alternative, and final results.
> >
> > These more complex notebooks allow the viewer to ask and answer important
> > questions such as "What else did (s)he try here?" and potentially even
> "Why
> > did (s)he choose this particular analysis strategy?". These questions
> can be
> > answered in the text or external supplementary materials in a linear
> > notebook, but this is a significant barrier to reproducibility of the
> > research process (as opposed to the analysis results).
>
> I can see that, however, I think the pure alt cells lack a critical
> feature.  They treat all branches as being equally important.  In
> reality, the branch that is chosen as the "best" one will likely
> require further analysis and discussion that that other branches
> don't.  Putting the different branches side by side makes it a little
> like "choose your own adventure" - when in reality, the author of the
> research want to steer the reader along a very particular path.  The
> alternative paths maybe useful to have around, but they should be be
> given equal weight as the "best" one.  But, maybe it is just
> presentation and can be accounted for in descriptive text.
>
> At the same time I can image situations where the author really does
> want the different branches to receive equal weight as alternatives.
>

I agree. The alt/altset cell mechanism (whatever form it takes in the
actual objects) is the scaffolding on which we will build what we actually
want. It definitely is important to have "primary" and "non-primary"
branches in many cases, and other types of differentiation which inform
behavior are also pretty important.

What exactly we want in terms of different types of branches, different
behaviors when authoring/processing/querying non-linear notebook
structures, etc is a major component of what I'm working on right now. So
more on this soon.

I will think about the rest of what you said and suggested and we can take
it back up when you get back.

Have a good trip/move

~G


>
> > Note that sometimes the viewer is you coming back to an analysis after
> some
> > span of time so that the reasoning behind your decisions is no longer
> fresh.
>
> Yes, this is an extremely common - if not the most common - usage case of
> all...
>
> > As a practical/UI standpoint unselected branches can be hidden almost
> > entirely (in theory, not currently in my PoC :p), resulting in a view
> > equivalent to (any) the only view offered by a linear notebook. This
> means
> > that from a viewer (and author since a straight line IS a DAG and nesting
> > isn't forced) standpoint, what I'm describing is in essense a strict
> > extension of what the notebook does now, rather than a change.
>
> I would be *more* interested in alt-cell approaches that present the
> notebook as a linear entity in all cases, but that has the alt-cell
> logic underneath.  For example, what about the following:
>
> * A user writes the different N alt cells in linear sequence
> * The result is a purely linear notebook where one of the N cells should
> be run.
> * We write a JavaScript plugin for the notebook that does a couple of
> things:
>
> 1. It provides a cell toolbar for marking those cells as members of an
> alt-set.  This would simple modify the cell level metadata and allow
> the author to provide titles of each alt-member.
> 2. It provides the logic for building a UI for viewing one of the
> alt-set members at a time.  It could be as simple as injecting a drop
> down menu that shows one and hides the rest.
>
> * This plugin could simple walk the notebook cells and find all the
> alt-cell sets and build this supplementary UI.
> * This plugin could also have settings that allow the author to select
> the "best" member of the alt-set.
> * nbconvert Transformers could use the cell level metadata to export
> the notebook in different formats.
>
> As I write about this - I think this would be extremely nice, and it
> would not be difficult to write at all.  Because of how our JavaScript
> plugins work, it could be developed outside IPython initially.  The
> question of inclusion in the official code base could be handled
> later.  Honestly, this approach should be much easier than the work
> you have already done.
>
> Best of all the resulting notebooks would remain standard linear
> notebooks that could be shared today on nbviewer, etc.  It would just
> work.
>
> Are you interested in taking a shot at this?  I think that would be
> awesome.
>
> >>
> >> BUT, I completely agree that the notebook does not handle certain
> >> types of branching very well.  Where the notebook starts to really
> >> suck is for longer analyses that you want to repeat for differing
> >> parameters or algorithms.  You talk more about this usage case below
> >> and we have started to think about how we would handle this.  Here are
> >> our current thoughts:
> >>
> >> It would be nice to write a long notebook and then add metadata to the
> >> notebook that indicates that some variables are to be treated as
> >> "templated" variables.  Then we would create tools that would enable a
> >> user to run a notebook over a range of templates:
> >>
> >> for x in xvars:
> >>   for y in yvars:
> >>     for algo in myalgos
> >>     run_notebook('MyCoolCode', x, y, algo)
> >>
> >> The result would be **something** that allows the user to explore the
> >> parameter space represented.  A single notebook would be used as the
> >> "source" for this analysis and the result would be the set of all
> >> paths through the notebook.  We have even thought about using our
> >> soon-to-be-designed interactive widget architecture to enable the
> >> results to be explored using different UI controls (sliders, etc) for
> >> the xvar, yvar, algos.  This way you could somehow "load" the
> >> resulting analysis into another notebook and explore things
> >> interactively - with all of the computations already done.
> >>
> >
> > This is a very powerful and exciting use-case. In fact it is one I am
> > investigating myself in the context of a different project unrelated to
> > IPython notebook. I call the set of results generated by such repeated
> runs
> > with different input sets (ie paths through the document) the "robustness
> > set" of the notebook with respect to the particular output variable being
> > investigated.
>
> Yes, this is a sort to batch mode for the notebook.
>
> > The key here is that the robustness we are talking about is not only with
> > respect to data/tuning parameters, but also with respect to the
> > decisions/choices made during the analysis process itself. These
> decisions
> > are often the difference between valid and invalid conclusions, but are
> > rarely talked about during discussions about reproducible
> research/science
> > AFAIK (I'd love to be wrong about that, even if it would make me look
> > silly/foolish here).
> >
> > The DAG conceptual model buys us a lot here too though. Instead of
> having to
> > run the entire notebook, you can calculate all possible paths through the
> > DAG for any arbitrary (connected) starting and ending points. So we can
> > rerun only pieces of  large notebooks to investigate any variable/plot
> > regardless of whether it constitutes a final result of the
> > notebook/analsyis.
>
> Yes, this type of analysis could also be done by the JavaScript plugin
> approach above.
>
> >
> >>
> >> We have other people interested in this type of workflow and it can
> >> all be done within the context of our existing linear notebook model.
> >> It is just assembling the existing abstractions in different ways.
> >>
> >
> > That is a plus. There is what I consider to be a pretty major drawback to
> > this approach though.
> >
> > It is easy to see how this would work in the case of variables
> representing
> > individual number/string/boolean valued parameters without much
> perturbation
> > of the code.
> >
> > Trying to write an analysis script that can graciously handle
> substantially
> > dissimilar analysis methods, on the other hand, is more problematic. We
> can
> > do it, of course, but at that point we are moving much more into the
> realm
> > of a program rather than an analysis script.
>
> Yes, definitely.
>
> > Consider the example of classifying new data based on a training set via
> > KNN, SVM, and GLM approaches. These approaches all need different sets of
> > parameters, return different types of objects as the output of the
> fitting
> > function, may have subtley different behaviour when being used for
> > prediction, etc.
>
> Yep, that is the big challenge with the branching idea in general.  It
> is not always true that the members of the alt sets can be swapped
> out.
>
> > The abstractions necessary to deal with these differences are likely in
> my
> > opinion to be highly costly in terms of how easy it is for readers of the
> > notebook to follow and understand what the code is doing.
> >
> > With actual branching, the code in each branch is exactly the same as if
> it
> > were in a normal linear notebook which implemented only that one branch,
> > making it much more likely to be straightforward and easy to read.
>
> But I think the same issue exists with any approach to branching
> right?  I am thinking the scripted notebook could have the same type
> of API - the important point is that the templated variables, while
> simple types, could trigger different code paths.
>
> algo = 0 # a template variable
>
> if algo == 0:
>   # alt-cell #1
> elif algo == 1:
>   # alt-cell #3
> ...
>
> This is not pretty but it would work...
>
> > One of my targeted use-cases is publications which can more accurately
> > convey the research which was done while still able to offer the clarity
> of
> > focus of what we do now, so I think that is quite important. YMMV.
> >
> > And now the sticking point.
> >>
> >>
> >> <snip>
> >>
> >> Q: does the new feature violate important abstractions we have in place.
> >>
> >> If the answer is no, then we do our normal job of considering the
> >> costs of adding the feature versus the benefits.
> >>
> >> If the answer is yes, then we *stop*.
> >
> >
> > I really do appreciate the IPython team's position. I think there is some
> > relevant nuance involved in this particular case, however, which makes
> the
> > does it change? yes:no test overly coarse. I attempt to make my case for
> > this below.
> >
> > I think the answer to the questions "does this new feature violate
> important
> > abstractions?" and "is it impossible/burdensomely difficult to alter
> > important existing abstractions in a way that supports this feature
> without
> > affecting the uses of the abstraction?" , may be different here, despite
> > being the same in the overwhelming majority of cases.  And I would argue
> the
> > second test offers identical protections as the first against the various
> > pitfalls of making major changes to large projects willie-nillie (which i
> > assure you I do understand are very real).
> >
> > I'm not advocating a dramatic about-face on the issue complete with
> parade
> > and skywriting that "IPython is pursuing an exciting new thing starting
> > today!". I do, however,  think it is perhapsworth consideration at a
> > somewhat narrower and more immediate scale than it would be otherwise.
>
> I hope you can see that I really like the general idea and think the
> usage cases you are describing are really important.  I think I can
> speak for the project in saying that we want the notebook to be useful
> for things like this.  But I think our abstractions are important
> enough that we make every attempt to see how we can do these while
> leveraging our existing abstractions.  This is partially a question
> about implementation, but also partly a question about how the new
> features are thought about.  The reason we don't like to break
> abstractions for new features is that we have found an interesting
> relationship between abstraction breaking and new features.  We have
> found that when a new feature/idea breaks a core abstraction that we
> have thought about very carefully, it is usually because the feature
> has not been fully understood.  Time and time again, we have found
> that when we take the time to fully understand the feature, it usually
> fits within our abstractions beautifully and is even much better that
> we ever imagined it could be.
>
> The plugin idea above is a perfect example of this.  By preserving the
> abstractions the new feature itself a multiplication of even new
> functionality:
>
> * The resulting notebooks can still be version controlled.  This means
> that the different alt-cell can be thrown into git and when we develop
> a visual diff tool for notebooks, they will *just work*.
> * The notebooks can immediately leverage the abstractions we have put
> into place for converting notebooks to different formats.  You could
> write custom transformers to present the notebook in a reveal.js
> giving alt-cells special treatment.
> * All of this can be done, and into the hands of user, without going
> through those overly conservative IPython developers ;-)
> * It will just work with nbviewer as well.
> * It provides a cleanly abstracted foundation for other people to build
> upon
>
> In summary, we are trying to build an architecture that allows a few
> simple abstractions (we actually don't have that many!) to combine in
> boundless ways to create features we never planned on, but that "just
> work".
>
> The upside of this, is that when we have encountered features that are
> important to us that really do require us to break or re-vision core
> abstractions - we gladly undertake this work.  Mainly because we feel
> that the new abstractions will be even more powerful.
>
> >>
> >> <snip>
> >>
> >>
> >> Thinking about your proposed feature from this perspective: both the
> >> task cells and alt cells introduce hierarchy and nesting into the
> >> notebook.  This breaks our core abstraction that cells are not nested.
> >>  In Jan-Feb our core development team had a discussion about this
> >> abstraction exactly.  We decided that we definitely don't want to move
> >> in the direction of allowing nesting in the notebook.  Because of this
> >> we are in the process of removing the 1 level of nesting our notebook
> >> format currently has, namely worksheets.  So for us, it is not just
> >> about complexity - it is about breaking the abstractions.
> >
> >
> > I do understand this position. I'd like to think I am bringing up points
> not
> > raised during that meeting, but whether or not that is the case
> abstractions
> > ARE important.
> >
> > I guess I am/was thinking about the abstraction in place in IPython
> notebook
> > a bit differently than you are describing it.
> >
> > For the next few paragraphs: process == (render|transform|execute|*)
> >
> > In my mind the abstraction/computational model is that a notebook is an
> > ordered set of cells and to process the notebook you simply go through
> the
> > cells in order and process them. What process means is dependent on the
> type
> > of cell, and there are various pieces of code in various places (mostly
> the
> > frontends and nbconvert AFAIK) that know how to handle each cell type.
> >
> > Under this formulation the change in abstraction is actually pretty
> small.
> > The only addition is the statement that code which processes cells is
> > responsible for initiating/handling the processing of any child cells
> those
> > cells contain. The easy easiest example of this is the execute method on
> my
> > task cells, which simply loops through each of its children and (if
> > applicable) calls their execute method.
> >
> > With this change we still have a notebook defined as an ordered set of
> (top
> > level) cells, and we can still process a notebook by stepping through
> each
> > of them in order and processing that cell.
> >
> > Some changes to the concept of next/previous cells and cell position (for
> > positional insertion, etc) were required and cells must be aware of their
> > direct parent (which will be either a cell or the notebook itself), but I
> > would argue these aren't actually important attributes of the abstraction
> > itself and the changes were actually fairly narrow and (AFAICS) pretty
> > painless and straightforward after some careful though/planning.
>
> This is an interesting way of thinking about nesting that I had not
> thought about.
>
> >
> >>
> >> The reason that these abstractions are so important is that they
> >> provide powerful foundations for us to build on.  One place the
> >> "notebook as a linear sequence of cell" abstraction comes into play is
> >> in our work on nbconvert that will appear in 1.0 in the next few
> >> weeks.  This allows to to convert notebooks very easily to a number of
> >> different formats.
> >
> >
> > I haven't tackled nbconvert yet on my experimental fork, but I fully
> intend
> > to as I agree entirely that the ability to generate things like linear
> pdfs
> > and other static views is utterly crucial. The fact that a notebook with
> > branches can generate a pdf that looks like it came from a linear
> notebook
> > (ie the "static article" view) is a *major* selling point/core feature of
> > what I'm trying to do with branching notebooks. It is key that people be
> > able to meet the needs they are meeting now; if we can't, meeting the
> more
> > nebulous needs they aren't isn't likely to save us (me) from irrelevance.
> >
> > Under my alternate description of the computational model described above
> > nbconvert will behave pretty much as it always has: step through the
> > notebook and process the cells in order into whatever format you are
> > targeting. The one exception is the cells processing their children, but
> the
> > scale of this change is not particularly large for the specific types of
> > nesting I'm going for.
> >
> > Tasks would likely simply render their children without making any mark
> > themselves in most cases, while altsets would do the same, but only for
> the
> > "active" branch. This involves a bit of looping and a bunch of calls to
> > existing code that knows how to transform the existing (core) cell types,
> > but really not much else.
> >
> >
> >
> >>
> >>  The other place this abstraction comes into play
> >> is in our keyboard shortcuts.  We are striving for the notebook to be
> >> usable for people who dont' touch the mouse (your traditional vi/emacs
> >> users).  Nesting makes that very difficult.
> >
> >
> > I admit this one is tougher, though I've done some small amount thinking
> > about it (currently hitting "down" on a container cell enters it while
> > hitting down on the last cell in a container navigates to the cell
> "after"
> > the container in my PoC).
> >
> > I think this is surmountable though, and worth the effort if it were the
> > only thing holding IPython notebook back from offering a
> > change/alternative/"fix" to how we talk about research and what we can do
> > with the documents we use to describe it.
> >
> > Wow that was a lot of text. Thanks for making it all the way to the end!
>
> I made it!
>
> Cheers,
>
> Brian
>
> > ~G
> >
> >
> >> <snip>
> >>
> >
> >
> > --
> > Gabriel Becker
> > Graduate Student
> > Statistics Department
> > University of California, Davis
> >
> > _______________________________________________
> > IPython-dev mailing list
> > IPython-dev at scipy.org
> > http://mail.scipy.org/mailman/listinfo/ipython-dev
> >
>
>
>
> --
> Brian E. Granger
> Cal Poly State University, San Luis Obispo
> bgranger at calpoly.edu and ellisonbg at gmail.com
> _______________________________________________
> IPython-dev mailing list
> IPython-dev at scipy.org
> http://mail.scipy.org/mailman/listinfo/ipython-dev
>



-- 
Gabriel Becker
Graduate Student
Statistics Department
University of California, Davis
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/ipython-dev/attachments/20130713/85a723f8/attachment.html>


More information about the IPython-dev mailing list