[IPython-dev] Some new cell types for describing data analyses in IPy. Notebook

Brian Granger ellisonbg at gmail.com
Tue Jul 9 22:32:47 EDT 2013


Gabriel,

> Thank you for taking the time to watch my video and think about the ideas
> I'm presenting. It is appreciated.

Cool, I enjoyed it.  Fantastic discussion!

> I am cognizant of the added complexity I am talking about. I disagree a bit
> with the size/damage being attributed to it, but the main place we seem to
> be on different pages is regarding what it buys us.

I 1/2 agree with you.  Obviously, every feature we add might add
complexity to the code base.  At the same time we are always striving
to try and reduce complexity with better design as we add new
features.  Here is one of the ways we think about new features:

Q: does the new feature violate important abstractions we have in place.

If the answer is no, then we do our normal job of considering the
costs of adding the feature versus the benefits.

If the answer is yes, then we *stop*.  The cost of changing one of our
core abstractions happens at an entirely different level.  It is the
type of thing we would think about for a very long time and have lots
of conversations with lots of different people.  We would possibly
even seek out major (>$100k) funding for the effort.  For these
discussions, we think about the bigger picture outside of the context
of that one feature.  If we decide to make a major change to one of
our core abstractions, we would probably plan at least 3 months - 2
years in advance.  Here is the roadmap for IPython for this type of
work through the end of 2014:

https://github.com/ipython/ipython/wiki/Roadmap:-IPython

Thinking about your proposed feature from this perspective: both the
task cells and alt cells introduce hierarchy and nesting into the
notebook.  This breaks our core abstraction that cells are not nested.
 In Jan-Feb our core development team had a discussion about this
abstraction exactly.  We decided that we definitely don't want to move
in the direction of allowing nesting in the notebook.  Because of this
we are in the process of removing the 1 level of nesting our notebook
format currently has, namely worksheets.  So for us, it is not just
about complexity - it is about breaking the abstractions.

The reason that these abstractions are so important is that they
provide powerful foundations for us to build on.  One place the
"notebook as a linear sequence of cell" abstraction comes into play is
in our work on nbconvert that will appear in 1.0 in the next few
weeks.  This allows to to convert notebooks very easily to a number of
different formats.  The other place this abstraction comes into play
is in our keyboard shortcuts.  We are striving for the notebook to be
usable for people who dont' touch the mouse (your traditional vi/emacs
users).  Nesting makes that very difficult.

Before you get too discouraged, please read on :-)

> The alt cells I'm talking about are not a cool rendering/interaction trick
> for linear notebooks. If they were I would absolutely agree that the nesting
> is overkill and not worth its cost.
>
> Linear/sequental notebooks and other dynamic document systems reproduce
> computational results. The documents my advisors (Duncan Temple Lang and
> Deborah Nolan) and I are working towards aim to describe and reproduce the
> research itself.

I appreciate your sharing more of the context.

> I once asked a room full of quants to raise their hands if their standard
> operating procedure was to read in the data, maybe clean it a bit, fit a
> single model, write up the results and be done. Not only did no one raise
> their hand, but the mere suggestion that that could be how it worked got a
> substantial laugh from the audience. Even though this is not how the work is
> done, however, it is the narrative encoded into a linear notebook.

Here is my experience of this.  I start out working in a very
non-linear manner.  As I work I discover things and change my code.
As I approach the point where I want to share my work, I start to
linearize it, otherwise it is very difficult for someone to take in.
In this context branching can be done by it has to be explicit.  In my
experience this is good.  If I want to run the analysis using 3
different algorithms, I run them in sequence and then show the results
of all three in the same place and draw conclusions.  All of this is
done - at the end of the day - in a linear notebook.

BUT, I completely agree that the notebook does not handle certain
types of branching very well.  Where the notebook starts to really
suck is for longer analyses that you want to repeat for differing
parameters or algorithms.  You talk more about this usage case below
and we have started to think about how we would handle this.  Here are
our current thoughts:

It would be nice to write a long notebook and then add metadata to the
notebook that indicates that some variables are to be treated as
"templated" variables.  Then we would create tools that would enable a
user to run a notebook over a range of templates:

for x in xvars:
  for y in yvars:
    for algo in myalgos
    run_notebook('MyCoolCode', x, y, algo)

The result would be **something** that allows the user to explore the
parameter space represented.  A single notebook would be used as the
"source" for this analysis and the result would be the set of all
paths through the notebook.  We have even thought about using our
soon-to-be-designed interactive widget architecture to enable the
results to be explored using different UI controls (sliders, etc) for
the xvar, yvar, algos.  This way you could somehow "load" the
resulting analysis into another notebook and explore things
interactively - with all of the computations already done.

We have other people interested in this type of workflow and it can
all be done within the context of our existing linear notebook model.
It is just assembling the existing abstractions in different ways.

> Scientists and data analysts are already doing branching in their work, but
> they don't have any good tools to record or describe it that way. So they
> comment out large blocks of code in linear scripts, or they save old
> attempts and alternate approaches in separate files, or they (hopefully not)
> simply delete or overwrite old parameter configurations with new ones when
> they decide the old one wasn't right. Not because they think these are good
> ideas, but because its all they have available and they are
> scientists/analysts.

I completely agree with this.

> Their job is to analyze data, extract insight and share their work in a
> useful manner; our job is to conceive and implement the tools they need to
> do that. They are good at their job, but we are struggling with ours.
>
> In my opinion, the single most important feature on display in my video is
> not that the alternative cells are rendered side-by-side, or that they can
> be executed and the software knows what that means in terms of executing
> their content; it is that the IPython notebook has become an authoring tool
> which analysts can use to easily create documents which describe what they
> actually did in a way that is reproducible, distributable, and most
> importantly, useful to the analyst.

I completely agree with this - I just think it can be done within our
existing abstractions - or at least I want to see us fail before
breaking the abstractions.

> Suddenly they don't need to comment out or overwrite code when they take a
> different approach. Suddenly they can distribute a document that actually
> describes what they did, instead of just what they found. Suddenly referees
> don't need to wonder or ask about whether an alternative analysis method was
> investigated. Suddenly professors can show statistics students what
> statisticians actually do, instead of how final results are generated.

Amen!

> You may read all this and think "that is great but way out of scope for what
> we want to achieve with IPython notebook". That is your right. And if the
> IPython team feels that way there is, of course, nothing I can do other than
> what I have done: explore these ideas on my own. I just want to be sure
> we're all talking about the same thing before that decision gets made.

No I don't think the usage case is out of scope.  I hope I have
convinced you that we are already thinking along these lines.  The
reason that IPython exists is because we are all working scientists
who have used crappy tools for years.  I wrote my entire Ph.D. thesis
and postdoc codes using a rats nest of C++, perl, bash and Makefiles.
It was even *advanced* for the day - I used version control and had
some tests.  But if I had to reproduce this research today, I would
start from scratch.  This wasn't my fault per se - it was the fact
that my tools didn't accurately express the reality of the
abstractions in my work flow.  But you know all of this...

Here is a question: are you and your advisors interested in meeting
with us at Berkeley and talking more about these things - with the
understanding that we are very much interested in the usage case you
describe - but probably not the nested implementation at this point.
Fernando is at Berkeley with much of the IPython dev team and I am 3
hours south at Cal Poly with another part of the team.  I don't think
it will be difficult to write prototypes of these capabilities using
the linear notebook, we just have to think about what the user
experience would look like.  We could probably do this after the
summer madness ends (probably Sept, Oct).

Even if you are not interested in talking with us in person, I hope
you are still willing to continue the discussion. The branching usage
case is very important to us and will be a part of our future work.
It really helps to think more about the design and user experience
questions.

Cheers,

Brian



> Detailed responses below.
>
>
> On Wed, Jul 3, 2013 at 6:59 PM, Brian Granger <ellisonbg at gmail.com> wrote:
>>
>> Gabriel,
>>
>> I watched your video and there are some nice ideas here.  We are not
>> headed in this direction in terms of *implementation* but I think you
>> will find that similar *capabilities* will show up in the notebook
>> over time.  A few comments about the implementation aspects:
>>
>> First, the benefits of having a notebook be a linear sequence of cells
>> are massive:
>>
>> * Simple, simple simple - this makes it very easy to reason about
>> notebooks in code.  Nesting leads to complexity that is not worth the
>> cost.
>
>
> It does lead to complexity. Whether it is worth the cost depends on what we
> gain. In my opinion I think we gain a massive amount, as I tried to describe
> above.
>
>>
>> * You can get most of the benefits of nesting without the complexity.
>> As Min mentioned, there is an implied hirrarchy in the heading cells.
>> We plan on using that to allow group level actions - show/hide, run
>> group, move, cut/copy/paste, etc.
>
>
> With respect, I don't know that you will get the actual benefits I am aiming
> for this way. Branches (and thus the code contained in them) in this context
> are mutually exclusive.  A document which has branching that leads back into
> a single computation, such as the example in my video/screenshots, no longer
> makes sense to think about or execute as a sequential set of code blocks.
>
> It is not that we can view it as a branching structure and only execute one
> branch at a time, but that we must. Again, people do this already via
> deletion or commenting because all the code for all the branches cannot live
> in the same block of code. I seek to give them a tool to do it that is not
> damaging to the record or reproducibilty of what they did.
>
> If you are familiar with R code and look carefully at the code for the data
> cleaning branches in my example, you will see that if we put all of that
> code into a sequential notebook, with heading cells to differentiate the
> branches, and ran the code start to finish it would generate a plot not
> created by any of the 18 paths through the structured document. In other
> words, it generates the wrong plot regardless of what the "right" set of
> choices at the branching points is in that execution context.
>
>>
>> * It is not difficult to think about building a proper diff tool for
>> notebooks.  With nested cells this becomes horrific.
>
>
> This does become much more difficult, but there are still things which can
> be done. Archambault and others have done work on differencing graphs under
> the assumption that one of them modified to create the other ( see here and
> here for quickly retrieved examples)
>
> Thus we can combine graph differencing with differencing of the actual code,
> e.g. a combination of difference maps  and normal code diffing.
>
> This would require unique identifiers for all notebook cells, but from
> something said previously in this thread I gather that is likely to happen
> anyway.
>
>>
>> * Hierarchy puts a significant cognitive load on users
>
>
> Maybe. Again, this is something people are already implicitly doing. It is
> possible that users would have trouble thinking about what they are doing
> more explicitly in these terms, but its also entirely possible that this
> will actually simplify their thinking process.
>
> I admittedly think about these types of documents a lot, but I was
> pleasantly surprised at how easy it was to construct and interact with the
> example I made for the video once the UI hooks were in place. Constructing
> such documents by hand, which I have done a bit, IS quite unpleasant, but
> that is one of the issues I seek to address.
>
> Nested branching and other more complex structures are more difficult to
> think about, but users would only create those if they had a really good
> reason to do so (ie if they were already thinking in those terms) and just a
> couple levels of branching buy us an enormous amount of fidelity when
> describing data analyses.
>
>
>>
>>
>> Because of these things we don't have any plans on changing the
>> notebook document format or notebook UI to allow nested cells.
>>
>
> As I said above, it is absolutely the right of the IPython core team to make
> that call. I just want to make sure we are all talking about the same costs
> and the same gains when the cost-benefit analysis to make that decision is
> performed.
>
> I intend to continue pursuing these concepts. I would be thrilled to
> collaborate with the IPython team if you guys decide it is something you are
> interested in, but I will understand and carry on myself if you do not.
>
> Thanks for reading and for the discussion. Its great to talk to people with
> differing opinions about this instead of sitting alone in my thesis-filled
> bubble stewing :).
>
> ~G
>
>
>> Second, while it is tempting to generalize the notion of input to
>> include widgety things, it is more appropriate to put these things in
>> the output:
>>
>> * Our output architecture has the notion of multiple representations.
>> This allows us to build rich widget as you have done, but to still
>> provide static representations (png, jpg, latex).
>> * Having the multiple representations of output allows us to build the
>> rich widgets, but maintain a clear path for converting notebooks to
>> static formats (pdf, html, word, powerpoint).
>> * Insisting that input cells are pure code allows you to reason in a
>> clear manner about how a notebook works = code runs and leads to
>> output.  That reasoning can be applied in an automate manner by
>> running notebooks in batch mode, or building a test system based on
>> them.
>> * Putting widgets in the input area forces you to do regular
>> expression matching to replace those variables in the code.  This
>> limits you to an extremely simple event model where the only possible
>> event you can know about is substitute the regular expression and run
>> all the code.  What if you want different UI controls in the browser
>> to trigger different bits of code in the kernels when different fine
>> grained events happen.  Making the UI controls live on the Python and
>> JS side allows us to build this in a natural way.
>>
>> Th alt-cells you show bring up the issue of providence.  We have some
>> very initial thoughts about that, but it is way out of scope for the
>> project right now - we have a plates 10x overfull already.  We will
>> get there though eventually.
>>
>> Thanks for sharing your ideas.
>>
>> PS - for a bit more background about the context of our saying "no" to
>> this feature request, see this blog post:
>>
>> http://brianegranger.com/?p=249
>>
>> I also gave a talk about this at SciPy and will be posting my slides soon.
>>
>> Cheers,
>>
>> Brian
>>
>> On Wed, Jul 3, 2013 at 6:04 PM, Gabriel Becker <gmbecker at ucdavis.edu>
>> wrote:
>> > Matthias,
>> >
>> > Thanks for your detailed response.
>> >
>> >
>> > On Wed, Jul 3, 2013 at 1:25 AM, Matthias BUSSONNIER
>> > <bussonniermatthias at gmail.com> wrote:
>> >>
>> >> Gabriel,
>> >>
>> >> You screen shot are interesting,
>> >> At some point I played with gridster[1]
>> >>
>> >> and was more or less able to get cell to rearranges, but didn't keep
>> >> the
>> >> code.
>> >> You might be interested.
>> >>
>> >> Keep in mind that the notebook browser-interface we ship is only one
>> >> possible
>> >> frontend that can interpret ipynb files, nothing prevent you to write a
>> >> different frontend that display the notebook in a different format.
>> >>
>> >> This added to the fact that each cell can support arbitrary metadata,
>> >> you
>> >> should be able to arrange preexisting in structure that work together.
>> >> It
>> >> might
>> >> be a little difficult to do it right now as our javascript is not yet
>> >> modular
>> >> enough to be easily reused, but we are moving toward it.
>> >
>> >
>> > Respectfully, rolling my own frontend for ipynb files given all the work
>> > the
>> > IPython team has done on the excellent notebook browser interface would
>> > be
>> > an enormous and extremely wasteful duplication of effort. I don't think
>> > its
>> > the right way to pursue these features.
>> >
>> > Furthermore, if I were going to write an application offering the types
>> > of
>> > features I am talking about from scratch, there wouldn't be any good
>> > reason
>> > to base it on the unaltered ipynb format, as they don't easily support
>> > the
>> > structure required by those features without the additional cell types I
>> > implemented in my forked version.
>> >
>> >>
>> >> Right now I thing storing the notebook as a directed graph is
>> >> problematic
>> >> in a
>> >> few way,
>> >
>> >
>> > I'm not talking about storing the notebook as an actual directed graph
>> > data
>> > structure. There would be benefits to that but its not necessary and it
>> > isn't want I did in my forked version.
>> >
>> > The ability to have nested cells (cells which contain other cells) gets
>> > us
>> > everything we need structure wise, and is the basis of everything seen
>> > in
>> > both the video (other than interactive code cell stuff) and screenshots
>> > I
>> > posted. The ipynb file for the notebook pictured in the screenshot looks
>> > exactly like a normal ipynb file except that in the json there are cell
>> > declarations which have a cells field which contains the json
>> > descriptions
>> > of the cells contained in that cell.
>> >
>> >
>> >>
>> >> the first being that it is incompatible with the fact that people want
>> >> to be able to run notebook in a headless manner, which if you add
>> >> explicit
>> >> choice is not possible.
>> >
>> >
>> > This isn't the case. The json saved versions of notebooks with branching
>> > remember which version was most recently run. When an altset cell is
>> > executed, it runs only the most recently run (or currently "selected",
>> > though that means something else internally) branch. Thus by doing the
>> > naive
>> > thing and looping through all top level cells and executing them, the
>> > currently chosen path through the notebook can easily be run in a
>> > headless
>> > environment and give the correct results.
>> >
>> >>
>> >> This also contradict the fact that the notebook capture
>> >> both the input and the output of the computation.
>> >
>> >
>> > I don't really understand what you mean by this. In the JSON
>> > representation
>> > of an executed code cell, the input field contains the code, but not any
>> > values of variables used by the code, nor any indication of code which
>> > was
>> > run before executing the code cell.
>> >
>> > Changing and rerunning an earlier code cell without re-executing the
>> > cell in
>> > question can easily invalidate the output stored in the JSON, even
>> > without
>> > the concept of branching or choice.
>> >
>> >
>> >>
>> >> As you showed there is
>> >> actually 18 different combinations of data analysis, and they are not
>> >> all
>> >> stored in the notebook.
>> >
>> >
>> > The notebook knows and records which choices were made. There are 18
>> > different combinations of data analysis but only one was chosen by
>> > analyst
>> > as generating the final/most recent result.
>> >
>> > In the case of "publishing" about an analysis the notebook stores the
>> > path
>> > most chosen by the analyst, while retaining information about what else
>> > he
>> > or she did during the decision process.
>> >
>> > In the case of instruction, imagine how much easier it would be to teach
>> > data analysis if the students could actually see what data analysts do,
>> > instead of simply the final method they choose in a particular analysis.
>> >
>> >
>> >>
>> >>
>> >> I really thing this is an interesting project, and reusing only our
>> >> metadata in
>> >> the notebook, you should be able to  simulate it (store the dag in
>> >> notebook
>> >> level metadata, and cell id in cell metadata) then reconstruct the
>> >> graph
>> >> when
>> >> needed. Keep in mind that at some point we might/will add cell id to
>> >> the
>> >> notebook.
>> >>
>> >> To sum up, I don't think the current JS client is in it's current state
>> >> the
>> >> place to implement such an idea. The Dag for cell order might be an
>> >> idea
>> >> for
>> >> future notebook format but need to be well thought, and wait for cell
>> >> IDs.
>> >
>> >
>> > I apologize for not being clear. As I said in a response above, the
>> > directed
>> > graph idea was intended to be conceptual for thinking about the
>> > documents,
>> > not structural for actually storing them.
>> >
>> > What I actually did was simply allow cell nesting and change indexing so
>> > that it is with respect to the parent/container (cell or notebook)
>> > instead
>> > of always with respect to the notebook. This required some machinery
>> > changes
>> > but not too many and it is an extension in the mathematical sense in
>> > that
>> > indexing will behave identically to the old system for notebooks without
>> > any
>> > nesting while now meaningfully functioning for notebooks with nesting.
>> >
>> > ~G
>> >>
>> >>
>> >> --
>> >> Matthias
>> >>
>> >>
>> >>
>> >> [1] http://gridster.net/
>> >>
>> >> _______________________________________________
>> >> IPython-dev mailing list
>> >> IPython-dev at scipy.org
>> >> http://mail.scipy.org/mailman/listinfo/ipython-dev
>> >>
>> >
>> >
>> >
>> > --
>> > Gabriel Becker
>> > Graduate Student
>> > Statistics Department
>> > University of California, Davis
>> >
>> > _______________________________________________
>> > IPython-dev mailing list
>> > IPython-dev at scipy.org
>> > http://mail.scipy.org/mailman/listinfo/ipython-dev
>> >
>>
>>
>>
>> --
>> Brian E. Granger
>> Cal Poly State University, San Luis Obispo
>> bgranger at calpoly.edu and ellisonbg at gmail.com
>> _______________________________________________
>> IPython-dev mailing list
>> IPython-dev at scipy.org
>> http://mail.scipy.org/mailman/listinfo/ipython-dev
>
>
>
>
> --
> Gabriel Becker
> Graduate Student
> Statistics Department
> University of California, Davis
>
> _______________________________________________
> IPython-dev mailing list
> IPython-dev at scipy.org
> http://mail.scipy.org/mailman/listinfo/ipython-dev
>



--
Brian E. Granger
Cal Poly State University, San Luis Obispo
bgranger at calpoly.edu and ellisonbg at gmail.com



More information about the IPython-dev mailing list