[IPython-dev] Some new cell types for describing data analyses in IPy. Notebook

Sun Oct 6 19:39:59 EDT 2013

Hey Brian et al,

Just checking in to see if you and/or other team members are still
interested in meeting in person and chatting about some of the ideas we had
been discussing in this thread.

Happy to also continue the conversation here in the meantime.

~G

On Tue, Sep 10, 2013 at 6:32 PM, Gabriel Becker <gmbecker at ucdavis.edu>wrote:

> Brian et al,
>
> Brian I hope your move/travel/etc was as pleasant as such things can be.
>
>
> On Fri, Jul 12, 2013 at 9:21 AM, Brian Granger <ellisonbg at gmail.com>wrote:
>
>>  Gabriel,
>> <snip>
>>
>>
>> Great, let's talk in Sept. to figure out a time that would work.
>>
>
> I'm still quite interested in meeting with you guys. Somewhere near the
> end of the month would be best for me, but I'm pretty flexible.
>
>
>> <snip>
>>
>>  > Branching/DAG notebooks allow a single document to encompass the
>> research
>> > you did, while providing easy access to various views corresponding to
>> the
>> > generation of intermediate, alternative, and final results.
>> >
>> > These more complex notebooks allow the viewer to ask and answer
>> important
>> > questions such as "What else did (s)he try here?" and potentially even
>> "Why
>> > did (s)he choose this particular analysis strategy?". These questions
>> can be
>> > answered in the text or external supplementary materials in a linear
>> > notebook, but this is a significant barrier to reproducibility of the
>> > research process (as opposed to the analysis results).
>>
>> I can see that, however, I think the pure alt cells lack a critical
>> feature.  They treat all branches as being equally important.  In
>> reality, the branch that is chosen as the "best" one will likely
>> require further analysis and discussion that that other branches
>> don't.  Putting the different branches side by side makes it a little
>> like "choose your own adventure" - when in reality, the author of the
>> research want to steer the reader along a very particular path.  The
>> alternative paths maybe useful to have around, but they should be be
>> given equal weight as the "best" one.  But, maybe it is just
>> presentation and can be accounted for in descriptive text.
>>
>
> This is very true. My current thinking calls for both a "default"
> designation and a "most recently selected/run" designation, which I believe
> deals with the valid concern you raise above.
>
> There are also other important designations for "branch types". The most
> notable/easily explained of these is the concept of a "terminal" branch,
> which is a branch that records important computations (and prose), and
> which a viewer of the notebook  (be it the original author, a reviewer, a
> student, or someone looking to extend the work) may want to look at or run,
> but whose output is not compatible with the subsequent computations. This
> arises most commonly when one analysis strategy is implemented and pursued,
> but ultimately abandoned  (hopefully for good reasons, and with this we can
> check!) in favor of a different final strategy which produces incompatible
> output. The subsequent code then makes assumptions about the output which
> are compatible with the final strategy computations, but not the original
> strategy ones. A way to gracefully deal with this case is important for any
> document/processing/rendering system attempting to pursue these concepts.
>
> There are other cases that arise with these documents, but I will omit a
> detailed discussion of them and what I think should be done to support them
> here, as that would make this mail burdensomely long and it is not my
> primary message.
>
> I will note, though, that while I agree that the final/core/whathaveyou
> and secondary/informative/archival branches should not be
> indistinguishable, it is important for my usecase that they be easily
> accessible when the reader wants to in both interactive (notebook) and
> headless (nbconvert) modes.
>
>
>> <snip>
>>
>>
>> > As a practical/UI standpoint unselected branches can be hidden almost
>> > entirely (in theory, not currently in my PoC :p), resulting in a view
>> > equivalent to (any) the only view offered by a linear notebook. This
>> means
>> > that from a viewer (and author since a straight line IS a DAG and
>> nesting
>> > isn't forced) standpoint, what I'm describing is in essense a strict
>> > extension of what the notebook does now, rather than a change.
>>
>> I would be *more* interested in alt-cell approaches that present the
>> notebook as a linear entity in all cases, but that has the alt-cell
>> logic underneath.  For example, what about the following:
>>
>> * A user writes the different N alt cells in linear sequence
>> * The result is a purely linear notebook where one of the N cells should
>> be run.
>> * We write a JavaScript plugin for the notebook that does a couple of
>> things:
>>
>> 1. It provides a cell toolbar for marking those cells as members of an
>> alt-set.  This would simple modify the cell level metadata and allow
>> the author to provide titles of each alt-member.
>>
>
> What about branching that is 2 or more levels deep? That happens naturally
> with my approach but sounds difficult/annoying to keep track of in the one
> you are describing.
>
>
>> 2. It provides the logic for building a UI for viewing one of the
>> alt-set members at a time.  It could be as simple as injecting a drop
>> down menu that shows one and hides the rest.
>>
>
> I have an ugly but functional version of this now in my implementation.
>
>
>>
>> * This plugin could simple walk the notebook cells and find all the
>> alt-cell sets and build this supplementary UI.
>> * This plugin could also have settings that allow the author to select
>> the "best" member of the alt-set.
>> * nbconvert Transformers could use the cell level metadata to export
>> the notebook in different formats.
>>
>> As I write about this - I think this would be extremely nice, and it
>> would not be difficult to write at all.  Because of how our JavaScript
>> plugins work, it could be developed outside IPython initially.  The
>> question of inclusion in the official code base could be handled
>> later.  Honestly, this approach should be much easier than the work
>> you have already done.
>>
>
> Well, editing the notebook once it exists in this form seems like it would
> be much less fun, in terms of adding new cells.
>
> What you're describing is also much more onerous for the author. With what
> I have now, you declare a cell to be an altset or task and everything just
> sort of works. New cells are inserted in the right places, cells trivially
> know who their parents are, etc.
>
> If I understand you correctly, the author would have to write all the
> alternatives in a big linear document (not fun or easy to test, see
> discussion below) and then click a bunch of buttons to manually select what
> cells go in which alternate. That is a much larger cognitive burden on the
> author (as well as probably being really annoying...).
>
>
>>
>> Best of all the resulting notebooks would remain standard linear
>> notebooks that could be shared today on nbviewer, etc.  It would just
>> work.
>>
>
> Respectfully, this is actually the fatal flaw of this approach IMO, both
> in this case and in other cases where a JS plugin/extension uses the
> metadata approach to fundamentally modify *behavior* (as opposed to
> aestethics/UI) of the IPython Notebook.
>
> The issue, stated in the context of the nesting/alts/etc cells extension,
> is that a notebook that has branching/alternates *requires* that they be
> understood as such, rather than simply benefiting from it.
>
> The ability to distribute notebooks I write and have them work properly is
> entirely core to my usecase for IPython. If I can't do so, what I
> personally can get IPython or IPython notebooks to do on my own machine is
> not something I have any real interest in. Now you may be thinking to
> yourself "But Gabe, no one is using your fork so you can't do that now with
> your implementation anyway". That is true, but if someone without my fork
> installed manages to get their hands on a notebook which uses the nesting
> features, it will break when they try to load it.
>
> If I create an extension as you are describing, create a complex notebook
> using it, and someone without the plugin installed finds it, downloads it,
> and runs it, it will *run fine and happily give them incorrect results
> without even noticing the extra bits I stuck in the metadata*.
>
> The core issue here is that running a notebook with branching as a linear
> notebook by executing each of the branches in sequence is actually
> erroneous and will produce undefined, untrustworthy, and likely incorrect,
> behavior and output. The reason for this is that branches/alternatives are
> assumed to be mutually exclusive by the computational model, and can alter
> objects in-place in manners that can have unintended cumulative effects.
>
> As a very simple example consider branches which handle outliers in a
> certain variable by modifying the variable in-place and trimming its
> values  by .1, 1, 5, and 10%, respectively,  using quantiles and then
> consider what would happen if these branches were all run in an arbitrary
> order.
>
> It is easy to see that the outcome from running all the branches (which is
> what will silently happen if the notebook is treated as a standard linear
> notebook because the plugin is not being used) does not reflect any of the
> choices intended by the author and more complex situations could be
> difficult to predict at all without sitting down and thinking about it.
>
> As such, I would not be comfortable distributing branching notebooks using
> the extension mechanism as I understand it to exist now because a) I feel
> it indirectly damages the type of scientific reprodicibility and result
> trustworthiness I seek to advance, and b) I don't want to spend all my time
> fielding angry emails/bugreports from notebook authors who sent their
> notebooks to collaborators who didn't have the plugin installed.
>
>
>
>>
>> <snip>
>>
>> > Consider the example of classifying new data based on a training set via
>> > KNN, SVM, and GLM approaches. These approaches all need different sets
>> of
>> > parameters, return different types of objects as the output of the
>> fitting
>> > function, may have subtley different behaviour when being used for
>> > prediction, etc.
>>
>> Yep, that is the big challenge with the branching idea in general.  It
>> is not always true that the members of the alt sets can be swapped
>> out.
>>
>
> And under the model I am envisioning, that is actually an informative  and
> queriable feature, rather than a drawback. See my discussion above
> regarding terminal branches.
>
>
>>
>> <snip>
>>
>> I hope you can see that I really like the general idea and think the
>> usage cases you are describing are really important.  I think I can
>> speak for the project in saying that we want the notebook to be useful
>> for things like this.  But I think our abstractions are important
>> enough that we make every attempt to see how we can do these while
>> leveraging our existing abstractions.  This is partially a question
>> about implementation, but also partly a question about how the new
>> features are thought about.  The reason we don't like to break
>> abstractions for new features is that we have found an interesting
>> relationship between abstraction breaking and new features.  We have
>> found that when a new feature/idea breaks a core abstraction that we
>> have thought about very carefully, it is usually because the feature
>> has not been fully understood.  Time and time again, we have found
>> that when we take the time to fully understand the feature, it usually
>> fits within our abstractions beautifully and is even much better that
>> we ever imagined it could be.
>>
>> The plugin idea above is a perfect example of this.  By preserving the
>> abstractions the new feature itself a multiplication of even new
>> functionality:
>>
>> * The resulting notebooks can still be version controlled.  This means
>> that the different alt-cell can be thrown into git and when we develop
>> a visual diff tool for notebooks, they will *just work*.
>>
>
> I don't really understand this point. I have numerous fork-based
> non-linear notebooks under version control.
>
> Also, when you have a visual diff tool, it will successfully do *something
> * when given a linear+metadata branching notebook, but whether that
> something would be to deliver the information required to understand
> changes to non-linear notebooks  is less clear (and seems somewhat
> unlikely).
>
>
>> * The notebooks can immediately leverage the abstractions we have put
>> into place for converting notebooks to different formats.  You could
>> write custom transformers to present the notebook in a reveal.js
>> giving alt-cells special treatment.
>>
>
>
> I could write custom transformers, this is true, but the default behavior
> would treat the notebook as if it actually were linear (instead of just
> being stored that way) which is problematic.
>
>
>
>> * All of this can be done, and into the hands of user, without going
>> through those overly conservative IPython developers ;-)
>> * It will just work with nbviewer as well.
>>
>
> Again, I disagree. It would *display* in nbviewer, but not work, in that
> the display would be actively misleading regarding what the notebook would
> do when executed properly.
>
>
>>  * It provides a cleanly abstracted foundation for other people to build
>> upon
>>
>
> I agree that this is important, but it is not clear to me that it would be
> more true in the case that I created the extension via custom JS than it
> would if nesting were supported in the actual ipynb format and core
> notebook mechanisms.
>
>
>>
>> In summary, we are trying to build an architecture that allows a few
>> simple abstractions (we actually don't have that many!) to combine in
>> boundless ways to create features we never planned on, but that "just
>> work".
>>
>
> I agree that the customjs + metadata extensions approach is very powerful
> and almost infinitely versatile. I think it is great for extensions which
> change appearance/rendering/UI details of how the notebook behaves.
>
> As far as I can see, however,  it has some signficant problems with regard
> to extensions which fundamentally change non-rendering behavior of
> notebooks (please correct me if I'm wrong), namely:
>
>    - There is no guarantee that notebooks authored using an extension
>    which alters fundamental behaviors will work or visibly fail in the absence
>    of that extension
>    - There is no way for an individual notebook to require a particular
>    extension
>    - There is no way to ensure that two extensions are compatible with
>    each-other
>    - There is no standard/unified way for end-users to install extensions
>    - There is no way for users to determine which extensions they have
>
> The first point is not true of extensions which exclusively affect
> rendering and UI, making the rest of the points minor nuisances rather than
> critical issues.
> Looking forward to hearing your (further) thoughts about this stuff and
> hopefully meeting you in person soon.
>
> ~G
>
> --
> Gabriel Becker
> Graduate Student
> Statistics Department
> University of California, Davis
>

-- 
Gabriel Becker
Graduate Student
Statistics Department
University of California, Davis
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/ipython-dev/attachments/20131006/83b89e9e/attachment.html>