# [IPython-dev] Some new cell types for describing data analyses in IPy. Notebook

Fernando Perez fperez.net at gmail.com
Sun Oct 6 20:51:52 EDT 2013

Hi Gabriel,

sorry to have been silent, but everyone else is doing a great job on this
discussion...

I just wanted to say that we'd love to talk to you at Berkeley, but I'm
leaving town tonight for a couple weeks, so it won't work until late
October or more likely November.  But in Nov. I'm giving a talk at Davis,
in J. Eisen's group. Perhaps at least you and I could meet for coffee while
I'm there and cover some ground.

Another alternative for a higher-bandwidth technical discussion is to
schedule a slot into one of our public dev meetings on Thursdays. This week
That will decouple us from finding a time when everyone can meet in
Berkeley, and more importantly, will allow others who can't make it in
person to also follow the discussion.

Cheers,

f

On Sun, Oct 6, 2013 at 4:39 PM, Gabriel Becker <gmbecker at ucdavis.edu> wrote:

> Hey Brian et al,
>
> Just checking in to see if you and/or other team members are still
> interested in meeting in person and chatting about some of the ideas we had
> been discussing in this thread.
>
> Happy to also continue the conversation here in the meantime.
>
> ~G
>
>
> On Tue, Sep 10, 2013 at 6:32 PM, Gabriel Becker <gmbecker at ucdavis.edu>wrote:
>
>> Brian et al,
>>
>> Brian I hope your move/travel/etc was as pleasant as such things can be.
>>
>>
>> On Fri, Jul 12, 2013 at 9:21 AM, Brian Granger <ellisonbg at gmail.com>wrote:
>>
>>>  Gabriel,
>>> <snip>
>>>
>>>
>>> Great, let's talk in Sept. to figure out a time that would work.
>>>
>>
>> I'm still quite interested in meeting with you guys. Somewhere near the
>> end of the month would be best for me, but I'm pretty flexible.
>>
>>
>>> <snip>
>>>
>>>  > Branching/DAG notebooks allow a single document to encompass the
>>> research
>>> > you did, while providing easy access to various views corresponding to
>>> the
>>> > generation of intermediate, alternative, and final results.
>>> >
>>> > These more complex notebooks allow the viewer to ask and answer
>>> important
>>> > questions such as "What else did (s)he try here?" and potentially even
>>> "Why
>>> > did (s)he choose this particular analysis strategy?". These questions
>>> can be
>>> > answered in the text or external supplementary materials in a linear
>>> > notebook, but this is a significant barrier to reproducibility of the
>>> > research process (as opposed to the analysis results).
>>>
>>> I can see that, however, I think the pure alt cells lack a critical
>>> feature.  They treat all branches as being equally important.  In
>>> reality, the branch that is chosen as the "best" one will likely
>>> require further analysis and discussion that that other branches
>>> don't.  Putting the different branches side by side makes it a little
>>> like "choose your own adventure" - when in reality, the author of the
>>> research want to steer the reader along a very particular path.  The
>>> alternative paths maybe useful to have around, but they should be be
>>> given equal weight as the "best" one.  But, maybe it is just
>>> presentation and can be accounted for in descriptive text.
>>>
>>
>> This is very true. My current thinking calls for both a "default"
>> designation and a "most recently selected/run" designation, which I believe
>> deals with the valid concern you raise above.
>>
>> There are also other important designations for "branch types". The most
>> notable/easily explained of these is the concept of a "terminal" branch,
>> which is a branch that records important computations (and prose), and
>> which a viewer of the notebook  (be it the original author, a reviewer, a
>> student, or someone looking to extend the work) may want to look at or run,
>> but whose output is not compatible with the subsequent computations. This
>> arises most commonly when one analysis strategy is implemented and pursued,
>> but ultimately abandoned  (hopefully for good reasons, and with this we can
>> check!) in favor of a different final strategy which produces incompatible
>> output. The subsequent code then makes assumptions about the output which
>> are compatible with the final strategy computations, but not the original
>> strategy ones. A way to gracefully deal with this case is important for any
>> document/processing/rendering system attempting to pursue these concepts.
>>
>> There are other cases that arise with these documents, but I will omit a
>> detailed discussion of them and what I think should be done to support them
>> here, as that would make this mail burdensomely long and it is not my
>> primary message.
>>
>> I will note, though, that while I agree that the final/core/whathaveyou
>> and secondary/informative/archival branches should not be
>> indistinguishable, it is important for my usecase that they be easily
>> accessible when the reader wants to in both interactive (notebook) and
>>
>>
>>> <snip>
>>>
>>>
>>> > As a practical/UI standpoint unselected branches can be hidden almost
>>> > entirely (in theory, not currently in my PoC :p), resulting in a view
>>> > equivalent to (any) the only view offered by a linear notebook. This
>>> means
>>> > that from a viewer (and author since a straight line IS a DAG and
>>> nesting
>>> > isn't forced) standpoint, what I'm describing is in essense a strict
>>> > extension of what the notebook does now, rather than a change.
>>>
>>> I would be *more* interested in alt-cell approaches that present the
>>> notebook as a linear entity in all cases, but that has the alt-cell
>>> logic underneath.  For example, what about the following:
>>>
>>> * A user writes the different N alt cells in linear sequence
>>> * The result is a purely linear notebook where one of the N cells should
>>> be run.
>>> * We write a JavaScript plugin for the notebook that does a couple of
>>> things:
>>>
>>> 1. It provides a cell toolbar for marking those cells as members of an
>>> alt-set.  This would simple modify the cell level metadata and allow
>>> the author to provide titles of each alt-member.
>>>
>>
>> What about branching that is 2 or more levels deep? That happens
>> naturally with my approach but sounds difficult/annoying to keep track of
>> in the one you are describing.
>>
>>
>>> 2. It provides the logic for building a UI for viewing one of the
>>> alt-set members at a time.  It could be as simple as injecting a drop
>>> down menu that shows one and hides the rest.
>>>
>>
>> I have an ugly but functional version of this now in my implementation.
>>
>>
>>>
>>> * This plugin could simple walk the notebook cells and find all the
>>> alt-cell sets and build this supplementary UI.
>>> * This plugin could also have settings that allow the author to select
>>> the "best" member of the alt-set.
>>> * nbconvert Transformers could use the cell level metadata to export
>>> the notebook in different formats.
>>>
>>> would not be difficult to write at all.  Because of how our JavaScript
>>> plugins work, it could be developed outside IPython initially.  The
>>> question of inclusion in the official code base could be handled
>>> later.  Honestly, this approach should be much easier than the work
>>>
>>
>> Well, editing the notebook once it exists in this form seems like it
>> would be much less fun, in terms of adding new cells.
>>
>> What you're describing is also much more onerous for the author. With
>> what I have now, you declare a cell to be an altset or task and everything
>> just sort of works. New cells are inserted in the right places, cells
>> trivially know who their parents are, etc.
>>
>> If I understand you correctly, the author would have to write all the
>> alternatives in a big linear document (not fun or easy to test, see
>> discussion below) and then click a bunch of buttons to manually select what
>> cells go in which alternate. That is a much larger cognitive burden on the
>> author (as well as probably being really annoying...).
>>
>>
>>>
>>> Best of all the resulting notebooks would remain standard linear
>>> notebooks that could be shared today on nbviewer, etc.  It would just
>>> work.
>>>
>>
>> Respectfully, this is actually the fatal flaw of this approach IMO, both
>> in this case and in other cases where a JS plugin/extension uses the
>> metadata approach to fundamentally modify *behavior* (as opposed to
>> aestethics/UI) of the IPython Notebook.
>>
>> The issue, stated in the context of the nesting/alts/etc cells extension,
>> is that a notebook that has branching/alternates *requires* that they be
>> understood as such, rather than simply benefiting from it.
>>
>> The ability to distribute notebooks I write and have them work properly
>> is entirely core to my usecase for IPython. If I can't do so, what I
>> personally can get IPython or IPython notebooks to do on my own machine is
>> not something I have any real interest in. Now you may be thinking to
>> yourself "But Gabe, no one is using your fork so you can't do that now with
>> your implementation anyway". That is true, but if someone without my fork
>> installed manages to get their hands on a notebook which uses the nesting
>> features, it will break when they try to load it.
>>
>> If I create an extension as you are describing, create a complex notebook
>> using it, and someone without the plugin installed finds it, downloads it,
>> and runs it, it will *run fine and happily give them incorrect results
>> without even noticing the extra bits I stuck in the metadata*.
>>
>> The core issue here is that running a notebook with branching as a linear
>> notebook by executing each of the branches in sequence is actually
>> erroneous and will produce undefined, untrustworthy, and likely incorrect,
>> behavior and output. The reason for this is that branches/alternatives are
>> assumed to be mutually exclusive by the computational model, and can alter
>> objects in-place in manners that can have unintended cumulative effects.
>>
>> As a very simple example consider branches which handle outliers in a
>> certain variable by modifying the variable in-place and trimming its
>> values  by .1, 1, 5, and 10%, respectively,  using quantiles and then
>> consider what would happen if these branches were all run in an arbitrary
>> order.
>>
>> It is easy to see that the outcome from running all the branches (which
>> is what will silently happen if the notebook is treated as a standard
>> linear notebook because the plugin is not being used) does not reflect any
>> of the choices intended by the author and more complex situations could be
>> difficult to predict at all without sitting down and thinking about it.
>>
>> As such, I would not be comfortable distributing branching notebooks
>> using the extension mechanism as I understand it to exist now because a) I
>> feel it indirectly damages the type of scientific reprodicibility and
>> result trustworthiness I seek to advance, and b) I don't want to spend all
>> my time fielding angry emails/bugreports from notebook authors who sent
>> their notebooks to collaborators who didn't have the plugin installed.
>>
>>
>>
>>>
>>> <snip>
>>>
>>> > Consider the example of classifying new data based on a training set
>>> via
>>> > KNN, SVM, and GLM approaches. These approaches all need different sets
>>> of
>>> > parameters, return different types of objects as the output of the
>>> fitting
>>> > function, may have subtley different behaviour when being used for
>>> > prediction, etc.
>>>
>>> Yep, that is the big challenge with the branching idea in general.  It
>>> is not always true that the members of the alt sets can be swapped
>>> out.
>>>
>>
>> And under the model I am envisioning, that is actually an informative
>> and queriable feature, rather than a drawback. See my discussion above
>> regarding terminal branches.
>>
>>
>>>
>>> <snip>
>>>
>>> I hope you can see that I really like the general idea and think the
>>> usage cases you are describing are really important.  I think I can
>>> speak for the project in saying that we want the notebook to be useful
>>> for things like this.  But I think our abstractions are important
>>> enough that we make every attempt to see how we can do these while
>>> leveraging our existing abstractions.  This is partially a question
>>> about implementation, but also partly a question about how the new
>>> features are thought about.  The reason we don't like to break
>>> abstractions for new features is that we have found an interesting
>>> relationship between abstraction breaking and new features.  We have
>>> found that when a new feature/idea breaks a core abstraction that we
>>> have thought about very carefully, it is usually because the feature
>>> has not been fully understood.  Time and time again, we have found
>>> that when we take the time to fully understand the feature, it usually
>>> fits within our abstractions beautifully and is even much better that
>>> we ever imagined it could be.
>>>
>>> The plugin idea above is a perfect example of this.  By preserving the
>>> abstractions the new feature itself a multiplication of even new
>>> functionality:
>>>
>>> * The resulting notebooks can still be version controlled.  This means
>>> that the different alt-cell can be thrown into git and when we develop
>>> a visual diff tool for notebooks, they will *just work*.
>>>
>>
>> I don't really understand this point. I have numerous fork-based
>> non-linear notebooks under version control.
>>
>> Also, when you have a visual diff tool, it will successfully do *
>> something* when given a linear+metadata branching notebook, but whether
>> that something would be to deliver the information required to understand
>> changes to non-linear notebooks  is less clear (and seems somewhat
>> unlikely).
>>
>>
>>> * The notebooks can immediately leverage the abstractions we have put
>>> into place for converting notebooks to different formats.  You could
>>> write custom transformers to present the notebook in a reveal.js
>>> giving alt-cells special treatment.
>>>
>>
>>
>> I could write custom transformers, this is true, but the default behavior
>> would treat the notebook as if it actually were linear (instead of just
>> being stored that way) which is problematic.
>>
>>
>>
>>> * All of this can be done, and into the hands of user, without going
>>> through those overly conservative IPython developers ;-)
>>> * It will just work with nbviewer as well.
>>>
>>
>> Again, I disagree. It would *display* in nbviewer, but not work, in that
>> the display would be actively misleading regarding what the notebook would
>> do when executed properly.
>>
>>
>>>  * It provides a cleanly abstracted foundation for other people to build
>>> upon
>>>
>>
>> I agree that this is important, but it is not clear to me that it would
>> be more true in the case that I created the extension via custom JS than it
>> would if nesting were supported in the actual ipynb format and core
>> notebook mechanisms.
>>
>>
>>>
>>> In summary, we are trying to build an architecture that allows a few
>>> simple abstractions (we actually don't have that many!) to combine in
>>> boundless ways to create features we never planned on, but that "just
>>> work".
>>>
>>
>> I agree that the customjs + metadata extensions approach is very powerful
>> and almost infinitely versatile. I think it is great for extensions which
>> change appearance/rendering/UI details of how the notebook behaves.
>>
>> As far as I can see, however,  it has some signficant problems with
>> regard to extensions which fundamentally change non-rendering behavior of
>> notebooks (please correct me if I'm wrong), namely:
>>
>>    - There is no guarantee that notebooks authored using an extension
>>    which alters fundamental behaviors will work or visibly fail in the absence
>>    of that extension
>>    - There is no way for an individual notebook to require a particular
>>    extension
>>    - There is no way to ensure that two extensions are compatible with
>>    each-other
>>    - There is no standard/unified way for end-users to install extensions
>>    - There is no way for users to determine which extensions they have
>>
>> The first point is not true of extensions which exclusively affect
>> rendering and UI, making the rest of the points minor nuisances rather than
>> critical issues.
>> hopefully meeting you in person soon.
>>
>> ~G
>>
>> --
>> Gabriel Becker
>> Statistics Department
>> University of California, Davis
>>
>
>
>
> --
> Gabriel Becker
> Statistics Department
> University of California, Davis
>
> _______________________________________________
> IPython-dev mailing list
> IPython-dev at scipy.org
> http://mail.scipy.org/mailman/listinfo/ipython-dev
>
>

--
Fernando Perez (@fperez_org; http://fperez.org)
fperez.net-at-gmail: mailing lists only (I ignore this when swamped!)
fernando.perez-at-berkeley: contact me here for any direct mail
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/ipython-dev/attachments/20131006/3cc7f88c/attachment.html>