[IPython-dev] Some new cell types for describing data analyses in IPy. Notebook

Brian Granger ellisonbg at gmail.com
Mon Oct 7 14:36:15 EDT 2013


I think we are pushing the limits of email on this discussion.  I
think it would be great to continue the discussion in person or our
Google Hangouts as Fernando mentions below.

> sorry to have been silent, but everyone else is doing a great job on this
> discussion...
> I just wanted to say that we'd love to talk to you at Berkeley, but I'm
> leaving town tonight for a couple weeks, so it won't work until late October
> or more likely November.  But in Nov. I'm giving a talk at Davis, in J.
> Eisen's group. Perhaps at least you and I could meet for coffee while I'm
> there and cover some ground.

That would be a great start to the in person discussions...

> Another alternative for a higher-bandwidth technical discussion is to
> schedule a slot into one of our public dev meetings on Thursdays. This week
> we had Peter Krautzberger, from MathJax, join us and it was very useful.
> That will decouple us from finding a time when everyone can meet in
> Berkeley, and more importantly, will allow others who can't make it in
> person to also follow the discussion.

Let us know if/when you can join us on this.



> Cheers,
> f
> On Sun, Oct 6, 2013 at 4:39 PM, Gabriel Becker <gmbecker at ucdavis.edu> wrote:
>> Hey Brian et al,
>> Just checking in to see if you and/or other team members are still
>> interested in meeting in person and chatting about some of the ideas we had
>> been discussing in this thread.
>> Happy to also continue the conversation here in the meantime.
>> ~G
>> On Tue, Sep 10, 2013 at 6:32 PM, Gabriel Becker <gmbecker at ucdavis.edu>
>> wrote:
>>> Brian et al,
>>> Brian I hope your move/travel/etc was as pleasant as such things can be.
>>> On Fri, Jul 12, 2013 at 9:21 AM, Brian Granger <ellisonbg at gmail.com>
>>> wrote:
>>>> Gabriel,
>>>> <snip>
>>>> Great, let's talk in Sept. to figure out a time that would work.
>>> I'm still quite interested in meeting with you guys. Somewhere near the
>>> end of the month would be best for me, but I'm pretty flexible.
>>>> <snip>
>>>> > Branching/DAG notebooks allow a single document to encompass the
>>>> > research
>>>> > you did, while providing easy access to various views corresponding to
>>>> > the
>>>> > generation of intermediate, alternative, and final results.
>>>> >
>>>> > These more complex notebooks allow the viewer to ask and answer
>>>> > important
>>>> > questions such as "What else did (s)he try here?" and potentially even
>>>> > "Why
>>>> > did (s)he choose this particular analysis strategy?". These questions
>>>> > can be
>>>> > answered in the text or external supplementary materials in a linear
>>>> > notebook, but this is a significant barrier to reproducibility of the
>>>> > research process (as opposed to the analysis results).
>>>> I can see that, however, I think the pure alt cells lack a critical
>>>> feature.  They treat all branches as being equally important.  In
>>>> reality, the branch that is chosen as the "best" one will likely
>>>> require further analysis and discussion that that other branches
>>>> don't.  Putting the different branches side by side makes it a little
>>>> like "choose your own adventure" - when in reality, the author of the
>>>> research want to steer the reader along a very particular path.  The
>>>> alternative paths maybe useful to have around, but they should be be
>>>> given equal weight as the "best" one.  But, maybe it is just
>>>> presentation and can be accounted for in descriptive text.
>>> This is very true. My current thinking calls for both a "default"
>>> designation and a "most recently selected/run" designation, which I believe
>>> deals with the valid concern you raise above.
>>> There are also other important designations for "branch types". The most
>>> notable/easily explained of these is the concept of a "terminal" branch,
>>> which is a branch that records important computations (and prose), and which
>>> a viewer of the notebook  (be it the original author, a reviewer, a student,
>>> or someone looking to extend the work) may want to look at or run, but whose
>>> output is not compatible with the subsequent computations. This arises most
>>> commonly when one analysis strategy is implemented and pursued, but
>>> ultimately abandoned  (hopefully for good reasons, and with this we can
>>> check!) in favor of a different final strategy which produces incompatible
>>> output. The subsequent code then makes assumptions about the output which
>>> are compatible with the final strategy computations, but not the original
>>> strategy ones. A way to gracefully deal with this case is important for any
>>> document/processing/rendering system attempting to pursue these concepts.
>>> There are other cases that arise with these documents, but I will omit a
>>> detailed discussion of them and what I think should be done to support them
>>> here, as that would make this mail burdensomely long and it is not my
>>> primary message.
>>> I will note, though, that while I agree that the final/core/whathaveyou
>>> and secondary/informative/archival branches should not be indistinguishable,
>>> it is important for my usecase that they be easily accessible when the
>>> reader wants to in both interactive (notebook) and headless (nbconvert)
>>> modes.
>>>> <snip>
>>>> > As a practical/UI standpoint unselected branches can be hidden almost
>>>> > entirely (in theory, not currently in my PoC :p), resulting in a view
>>>> > equivalent to (any) the only view offered by a linear notebook. This
>>>> > means
>>>> > that from a viewer (and author since a straight line IS a DAG and
>>>> > nesting
>>>> > isn't forced) standpoint, what I'm describing is in essense a strict
>>>> > extension of what the notebook does now, rather than a change.
>>>> I would be *more* interested in alt-cell approaches that present the
>>>> notebook as a linear entity in all cases, but that has the alt-cell
>>>> logic underneath.  For example, what about the following:
>>>> * A user writes the different N alt cells in linear sequence
>>>> * The result is a purely linear notebook where one of the N cells should
>>>> be run.
>>>> * We write a JavaScript plugin for the notebook that does a couple of
>>>> things:
>>>> 1. It provides a cell toolbar for marking those cells as members of an
>>>> alt-set.  This would simple modify the cell level metadata and allow
>>>> the author to provide titles of each alt-member.
>>> What about branching that is 2 or more levels deep? That happens
>>> naturally with my approach but sounds difficult/annoying to keep track of in
>>> the one you are describing.
>>>> 2. It provides the logic for building a UI for viewing one of the
>>>> alt-set members at a time.  It could be as simple as injecting a drop
>>>> down menu that shows one and hides the rest.
>>> I have an ugly but functional version of this now in my implementation.
>>>> * This plugin could simple walk the notebook cells and find all the
>>>> alt-cell sets and build this supplementary UI.
>>>> * This plugin could also have settings that allow the author to select
>>>> the "best" member of the alt-set.
>>>> * nbconvert Transformers could use the cell level metadata to export
>>>> the notebook in different formats.
>>>> As I write about this - I think this would be extremely nice, and it
>>>> would not be difficult to write at all.  Because of how our JavaScript
>>>> plugins work, it could be developed outside IPython initially.  The
>>>> question of inclusion in the official code base could be handled
>>>> later.  Honestly, this approach should be much easier than the work
>>>> you have already done.
>>> Well, editing the notebook once it exists in this form seems like it
>>> would be much less fun, in terms of adding new cells.
>>> What you're describing is also much more onerous for the author. With
>>> what I have now, you declare a cell to be an altset or task and everything
>>> just sort of works. New cells are inserted in the right places, cells
>>> trivially know who their parents are, etc.
>>> If I understand you correctly, the author would have to write all the
>>> alternatives in a big linear document (not fun or easy to test, see
>>> discussion below) and then click a bunch of buttons to manually select what
>>> cells go in which alternate. That is a much larger cognitive burden on the
>>> author (as well as probably being really annoying...).
>>>> Best of all the resulting notebooks would remain standard linear
>>>> notebooks that could be shared today on nbviewer, etc.  It would just
>>>> work.
>>> Respectfully, this is actually the fatal flaw of this approach IMO, both
>>> in this case and in other cases where a JS plugin/extension uses the
>>> metadata approach to fundamentally modify behavior (as opposed to
>>> aestethics/UI) of the IPython Notebook.
>>> The issue, stated in the context of the nesting/alts/etc cells extension,
>>> is that a notebook that has branching/alternates *requires* that they be
>>> understood as such, rather than simply benefiting from it.
>>> The ability to distribute notebooks I write and have them work properly
>>> is entirely core to my usecase for IPython. If I can't do so, what I
>>> personally can get IPython or IPython notebooks to do on my own machine is
>>> not something I have any real interest in. Now you may be thinking to
>>> yourself "But Gabe, no one is using your fork so you can't do that now with
>>> your implementation anyway". That is true, but if someone without my fork
>>> installed manages to get their hands on a notebook which uses the nesting
>>> features, it will break when they try to load it.
>>> If I create an extension as you are describing, create a complex notebook
>>> using it, and someone without the plugin installed finds it, downloads it,
>>> and runs it, it will run fine and happily give them incorrect results
>>> without even noticing the extra bits I stuck in the metadata.
>>> The core issue here is that running a notebook with branching as a linear
>>> notebook by executing each of the branches in sequence is actually erroneous
>>> and will produce undefined, untrustworthy, and likely incorrect, behavior
>>> and output. The reason for this is that branches/alternatives are assumed to
>>> be mutually exclusive by the computational model, and can alter objects
>>> in-place in manners that can have unintended cumulative effects.
>>> As a very simple example consider branches which handle outliers in a
>>> certain variable by modifying the variable in-place and trimming its values
>>> by .1, 1, 5, and 10%, respectively,  using quantiles and then consider what
>>> would happen if these branches were all run in an arbitrary order.
>>> It is easy to see that the outcome from running all the branches (which
>>> is what will silently happen if the notebook is treated as a standard linear
>>> notebook because the plugin is not being used) does not reflect any of the
>>> choices intended by the author and more complex situations could be
>>> difficult to predict at all without sitting down and thinking about it.
>>> As such, I would not be comfortable distributing branching notebooks
>>> using the extension mechanism as I understand it to exist now because a) I
>>> feel it indirectly damages the type of scientific reprodicibility and result
>>> trustworthiness I seek to advance, and b) I don't want to spend all my time
>>> fielding angry emails/bugreports from notebook authors who sent their
>>> notebooks to collaborators who didn't have the plugin installed.
>>>> <snip>
>>>> > Consider the example of classifying new data based on a training set
>>>> > via
>>>> > KNN, SVM, and GLM approaches. These approaches all need different sets
>>>> > of
>>>> > parameters, return different types of objects as the output of the
>>>> > fitting
>>>> > function, may have subtley different behaviour when being used for
>>>> > prediction, etc.
>>>> Yep, that is the big challenge with the branching idea in general.  It
>>>> is not always true that the members of the alt sets can be swapped
>>>> out.
>>> And under the model I am envisioning, that is actually an informative
>>> and queriable feature, rather than a drawback. See my discussion above
>>> regarding terminal branches.
>>>> <snip>
>>>> I hope you can see that I really like the general idea and think the
>>>> usage cases you are describing are really important.  I think I can
>>>> speak for the project in saying that we want the notebook to be useful
>>>> for things like this.  But I think our abstractions are important
>>>> enough that we make every attempt to see how we can do these while
>>>> leveraging our existing abstractions.  This is partially a question
>>>> about implementation, but also partly a question about how the new
>>>> features are thought about.  The reason we don't like to break
>>>> abstractions for new features is that we have found an interesting
>>>> relationship between abstraction breaking and new features.  We have
>>>> found that when a new feature/idea breaks a core abstraction that we
>>>> have thought about very carefully, it is usually because the feature
>>>> has not been fully understood.  Time and time again, we have found
>>>> that when we take the time to fully understand the feature, it usually
>>>> fits within our abstractions beautifully and is even much better that
>>>> we ever imagined it could be.
>>>> The plugin idea above is a perfect example of this.  By preserving the
>>>> abstractions the new feature itself a multiplication of even new
>>>> functionality:
>>>> * The resulting notebooks can still be version controlled.  This means
>>>> that the different alt-cell can be thrown into git and when we develop
>>>> a visual diff tool for notebooks, they will *just work*.
>>> I don't really understand this point. I have numerous fork-based
>>> non-linear notebooks under version control.
>>> Also, when you have a visual diff tool, it will successfully do something
>>> when given a linear+metadata branching notebook, but whether that something
>>> would be to deliver the information required to understand changes to
>>> non-linear notebooks  is less clear (and seems somewhat unlikely).
>>>> * The notebooks can immediately leverage the abstractions we have put
>>>> into place for converting notebooks to different formats.  You could
>>>> write custom transformers to present the notebook in a reveal.js
>>>> giving alt-cells special treatment.
>>> I could write custom transformers, this is true, but the default behavior
>>> would treat the notebook as if it actually were linear (instead of just
>>> being stored that way) which is problematic.
>>>> * All of this can be done, and into the hands of user, without going
>>>> through those overly conservative IPython developers ;-)
>>>> * It will just work with nbviewer as well.
>>> Again, I disagree. It would *display* in nbviewer, but not work, in that
>>> the display would be actively misleading regarding what the notebook would
>>> do when executed properly.
>>>> * It provides a cleanly abstracted foundation for other people to build
>>>> upon
>>> I agree that this is important, but it is not clear to me that it would
>>> be more true in the case that I created the extension via custom JS than it
>>> would if nesting were supported in the actual ipynb format and core notebook
>>> mechanisms.
>>>> In summary, we are trying to build an architecture that allows a few
>>>> simple abstractions (we actually don't have that many!) to combine in
>>>> boundless ways to create features we never planned on, but that "just
>>>> work".
>>> I agree that the customjs + metadata extensions approach is very powerful
>>> and almost infinitely versatile. I think it is great for extensions which
>>> change appearance/rendering/UI details of how the notebook behaves.
>>> As far as I can see, however,  it has some signficant problems with
>>> regard to extensions which fundamentally change non-rendering behavior of
>>> notebooks (please correct me if I'm wrong), namely:
>>> There is no guarantee that notebooks authored using an extension which
>>> alters fundamental behaviors will work or visibly fail in the absence of
>>> that extension
>>> There is no way for an individual notebook to require a particular
>>> extension
>>> There is no way to ensure that two extensions are compatible with
>>> each-other
>>> There is no standard/unified way for end-users to install extensions
>>> There is no way for users to determine which extensions they have
>>> The first point is not true of extensions which exclusively affect
>>> rendering and UI, making the rest of the points minor nuisances rather than
>>> critical issues.
>>> Looking forward to hearing your (further) thoughts about this stuff and
>>> hopefully meeting you in person soon.
>>> ~G
>>> --
>>> Gabriel Becker
>>> Graduate Student
>>> Statistics Department
>>> University of California, Davis
>> --
>> Gabriel Becker
>> Graduate Student
>> Statistics Department
>> University of California, Davis
>> _______________________________________________
>> IPython-dev mailing list
>> IPython-dev at scipy.org
>> http://mail.scipy.org/mailman/listinfo/ipython-dev
> --
> Fernando Perez (@fperez_org; http://fperez.org)
> fperez.net-at-gmail: mailing lists only (I ignore this when swamped!)
> fernando.perez-at-berkeley: contact me here for any direct mail
> _______________________________________________
> IPython-dev mailing list
> IPython-dev at scipy.org
> http://mail.scipy.org/mailman/listinfo/ipython-dev

Brian E. Granger
Cal Poly State University, San Luis Obispo
bgranger at calpoly.edu and ellisonbg at gmail.com

More information about the IPython-dev mailing list