[IPython-dev] Some new cell types for describing data analyses in IPy. Notebook

Wed Oct 23 04:17:41 EDT 2013

Yes live meeting are at 10am PST every Thursday.

You can edit the hackpad and add a short summary of what you would like to discuss in the section for the relevant day. Add a small "don't forget to invite me" near your edits. 
-- 
M

Envoyé de mon iPhone

> Le 23 oct. 2013 à 00:03, Gabriel Becker <gmbecker at ucdavis.edu> a écrit :
> 
> Hi all,
> 
> Sorry I disappeared for a bit. I would still very much like to take part in one of the g+ dev hangouts and discuss these ideas.
> 
> Are they still held at 10am on Thursdays? If so I have a previous obligation this week but would be happy to schedule for next week if the itinerary is not yet full. If it is, let me know when there is an opening and I should be able to make it work.
> 
> Thanks and looking forward to talking to you all.
> 
> ~G
> 
> 
>> On Mon, Oct 7, 2013 at 3:54 PM, Damián Avila <damianavila at gmail.com> wrote:
>> >What format would this and my participation be in? Would I be presenting something to get people up to speed or assuming that they have read the novel that this thread has turned into?
>> 
>> Probably most of the core developers have read this one, but is long thread... and the beginning was some months ago. 
>> I think that a quick summary covering the main issues and a little demo can be a nice way to present this "novel" ;-)
>> 
>> 
>> 2013/10/7 Gabriel Becker <gmbecker at ucdavis.edu>
>>> I'm happy to jump in on one of the hangouts to discuss these ideas. I could probably manage this thursday but next thursday might be better. I do agree that the discrete post/respond cycle of emails does prove a bit cumbersome for large detailed discussions like this. I still hope to meet the local(ish) portions of the team in person at some point, but it sounds like the logistics of that are tough and it is of course important to include and engage the non-local people as well.
>>> 
>>> What format would this and my participation be in? Would I be presenting something to get people up to speed or assuming that they have read the novel that this thread has turned into?
>>> 
>>> Also, @fperez, I'd love to grab coffee and sit down with you when you're in Davis. 
>>> 
>>> ~G
>>> 
>>> 
>>>> On Mon, Oct 7, 2013 at 11:36 AM, Brian Granger <ellisonbg at gmail.com> wrote:
>>>> Gabriel,
>>>> 
>>>> I think we are pushing the limits of email on this discussion.  I
>>>> think it would be great to continue the discussion in person or our
>>>> Google Hangouts as Fernando mentions below.
>>>> 
>>>> > sorry to have been silent, but everyone else is doing a great job on this
>>>> > discussion...
>>>> >
>>>> > I just wanted to say that we'd love to talk to you at Berkeley, but I'm
>>>> > leaving town tonight for a couple weeks, so it won't work until late October
>>>> > or more likely November.  But in Nov. I'm giving a talk at Davis, in J.
>>>> > Eisen's group. Perhaps at least you and I could meet for coffee while I'm
>>>> > there and cover some ground.
>>>> 
>>>> That would be a great start to the in person discussions...
>>>> 
>>>> > Another alternative for a higher-bandwidth technical discussion is to
>>>> > schedule a slot into one of our public dev meetings on Thursdays. This week
>>>> > we had Peter Krautzberger, from MathJax, join us and it was very useful.
>>>> > That will decouple us from finding a time when everyone can meet in
>>>> > Berkeley, and more importantly, will allow others who can't make it in
>>>> > person to also follow the discussion.
>>>> 
>>>> Let us know if/when you can join us on this.
>>>> 
>>>> Cheers,
>>>> 
>>>> Brian
>>>> 
>>>> > Cheers,
>>>> >
>>>> > f
>>>> >
>>>> >
>>>> > On Sun, Oct 6, 2013 at 4:39 PM, Gabriel Becker <gmbecker at ucdavis.edu> wrote:
>>>> >>
>>>> >> Hey Brian et al,
>>>> >>
>>>> >> Just checking in to see if you and/or other team members are still
>>>> >> interested in meeting in person and chatting about some of the ideas we had
>>>> >> been discussing in this thread.
>>>> >>
>>>> >> Happy to also continue the conversation here in the meantime.
>>>> >>
>>>> >> ~G
>>>> >>
>>>> >>
>>>> >> On Tue, Sep 10, 2013 at 6:32 PM, Gabriel Becker <gmbecker at ucdavis.edu>
>>>> >> wrote:
>>>> >>>
>>>> >>> Brian et al,
>>>> >>>
>>>> >>> Brian I hope your move/travel/etc was as pleasant as such things can be.
>>>> >>>
>>>> >>>
>>>> >>> On Fri, Jul 12, 2013 at 9:21 AM, Brian Granger <ellisonbg at gmail.com>
>>>> >>> wrote:
>>>> >>>>
>>>> >>>> Gabriel,
>>>> >>>> <snip>
>>>> >>>>
>>>> >>>>
>>>> >>>> Great, let's talk in Sept. to figure out a time that would work.
>>>> >>>
>>>> >>>
>>>> >>> I'm still quite interested in meeting with you guys. Somewhere near the
>>>> >>> end of the month would be best for me, but I'm pretty flexible.
>>>> >>>
>>>> >>>>
>>>> >>>> <snip>
>>>> >>>>
>>>> >>>> > Branching/DAG notebooks allow a single document to encompass the
>>>> >>>> > research
>>>> >>>> > you did, while providing easy access to various views corresponding to
>>>> >>>> > the
>>>> >>>> > generation of intermediate, alternative, and final results.
>>>> >>>> >
>>>> >>>> > These more complex notebooks allow the viewer to ask and answer
>>>> >>>> > important
>>>> >>>> > questions such as "What else did (s)he try here?" and potentially even
>>>> >>>> > "Why
>>>> >>>> > did (s)he choose this particular analysis strategy?". These questions
>>>> >>>> > can be
>>>> >>>> > answered in the text or external supplementary materials in a linear
>>>> >>>> > notebook, but this is a significant barrier to reproducibility of the
>>>> >>>> > research process (as opposed to the analysis results).
>>>> >>>>
>>>> >>>> I can see that, however, I think the pure alt cells lack a critical
>>>> >>>> feature.  They treat all branches as being equally important.  In
>>>> >>>> reality, the branch that is chosen as the "best" one will likely
>>>> >>>> require further analysis and discussion that that other branches
>>>> >>>> don't.  Putting the different branches side by side makes it a little
>>>> >>>> like "choose your own adventure" - when in reality, the author of the
>>>> >>>> research want to steer the reader along a very particular path.  The
>>>> >>>> alternative paths maybe useful to have around, but they should be be
>>>> >>>> given equal weight as the "best" one.  But, maybe it is just
>>>> >>>> presentation and can be accounted for in descriptive text.
>>>> >>>
>>>> >>>
>>>> >>> This is very true. My current thinking calls for both a "default"
>>>> >>> designation and a "most recently selected/run" designation, which I believe
>>>> >>> deals with the valid concern you raise above.
>>>> >>>
>>>> >>> There are also other important designations for "branch types". The most
>>>> >>> notable/easily explained of these is the concept of a "terminal" branch,
>>>> >>> which is a branch that records important computations (and prose), and which
>>>> >>> a viewer of the notebook  (be it the original author, a reviewer, a student,
>>>> >>> or someone looking to extend the work) may want to look at or run, but whose
>>>> >>> output is not compatible with the subsequent computations. This arises most
>>>> >>> commonly when one analysis strategy is implemented and pursued, but
>>>> >>> ultimately abandoned  (hopefully for good reasons, and with this we can
>>>> >>> check!) in favor of a different final strategy which produces incompatible
>>>> >>> output. The subsequent code then makes assumptions about the output which
>>>> >>> are compatible with the final strategy computations, but not the original
>>>> >>> strategy ones. A way to gracefully deal with this case is important for any
>>>> >>> document/processing/rendering system attempting to pursue these concepts.
>>>> >>>
>>>> >>> There are other cases that arise with these documents, but I will omit a
>>>> >>> detailed discussion of them and what I think should be done to support them
>>>> >>> here, as that would make this mail burdensomely long and it is not my
>>>> >>> primary message.
>>>> >>>
>>>> >>> I will note, though, that while I agree that the final/core/whathaveyou
>>>> >>> and secondary/informative/archival branches should not be indistinguishable,
>>>> >>> it is important for my usecase that they be easily accessible when the
>>>> >>> reader wants to in both interactive (notebook) and headless (nbconvert)
>>>> >>> modes.
>>>> >>>
>>>> >>>>
>>>> >>>> <snip>
>>>> >>>>
>>>> >>>>
>>>> >>>> > As a practical/UI standpoint unselected branches can be hidden almost
>>>> >>>> > entirely (in theory, not currently in my PoC :p), resulting in a view
>>>> >>>> > equivalent to (any) the only view offered by a linear notebook. This
>>>> >>>> > means
>>>> >>>> > that from a viewer (and author since a straight line IS a DAG and
>>>> >>>> > nesting
>>>> >>>> > isn't forced) standpoint, what I'm describing is in essense a strict
>>>> >>>> > extension of what the notebook does now, rather than a change.
>>>> >>>>
>>>> >>>> I would be *more* interested in alt-cell approaches that present the
>>>> >>>> notebook as a linear entity in all cases, but that has the alt-cell
>>>> >>>> logic underneath.  For example, what about the following:
>>>> >>>>
>>>> >>>> * A user writes the different N alt cells in linear sequence
>>>> >>>> * The result is a purely linear notebook where one of the N cells should
>>>> >>>> be run.
>>>> >>>> * We write a JavaScript plugin for the notebook that does a couple of
>>>> >>>> things:
>>>> >>>>
>>>> >>>> 1. It provides a cell toolbar for marking those cells as members of an
>>>> >>>> alt-set.  This would simple modify the cell level metadata and allow
>>>> >>>> the author to provide titles of each alt-member.
>>>> >>>
>>>> >>>
>>>> >>> What about branching that is 2 or more levels deep? That happens
>>>> >>> naturally with my approach but sounds difficult/annoying to keep track of in
>>>> >>> the one you are describing.
>>>> >>>
>>>> >>>>
>>>> >>>> 2. It provides the logic for building a UI for viewing one of the
>>>> >>>> alt-set members at a time.  It could be as simple as injecting a drop
>>>> >>>> down menu that shows one and hides the rest.
>>>> >>>
>>>> >>>
>>>> >>> I have an ugly but functional version of this now in my implementation.
>>>> >>>
>>>> >>>>
>>>> >>>>
>>>> >>>> * This plugin could simple walk the notebook cells and find all the
>>>> >>>> alt-cell sets and build this supplementary UI.
>>>> >>>> * This plugin could also have settings that allow the author to select
>>>> >>>> the "best" member of the alt-set.
>>>> >>>> * nbconvert Transformers could use the cell level metadata to export
>>>> >>>> the notebook in different formats.
>>>> >>>>
>>>> >>>> As I write about this - I think this would be extremely nice, and it
>>>> >>>> would not be difficult to write at all.  Because of how our JavaScript
>>>> >>>> plugins work, it could be developed outside IPython initially.  The
>>>> >>>> question of inclusion in the official code base could be handled
>>>> >>>> later.  Honestly, this approach should be much easier than the work
>>>> >>>> you have already done.
>>>> >>>
>>>> >>>
>>>> >>> Well, editing the notebook once it exists in this form seems like it
>>>> >>> would be much less fun, in terms of adding new cells.
>>>> >>>
>>>> >>> What you're describing is also much more onerous for the author. With
>>>> >>> what I have now, you declare a cell to be an altset or task and everything
>>>> >>> just sort of works. New cells are inserted in the right places, cells
>>>> >>> trivially know who their parents are, etc.
>>>> >>>
>>>> >>> If I understand you correctly, the author would have to write all the
>>>> >>> alternatives in a big linear document (not fun or easy to test, see
>>>> >>> discussion below) and then click a bunch of buttons to manually select what
>>>> >>> cells go in which alternate. That is a much larger cognitive burden on the
>>>> >>> author (as well as probably being really annoying...).
>>>> >>>
>>>> >>>>
>>>> >>>>
>>>> >>>> Best of all the resulting notebooks would remain standard linear
>>>> >>>> notebooks that could be shared today on nbviewer, etc.  It would just
>>>> >>>> work.
>>>> >>>
>>>> >>>
>>>> >>> Respectfully, this is actually the fatal flaw of this approach IMO, both
>>>> >>> in this case and in other cases where a JS plugin/extension uses the
>>>> >>> metadata approach to fundamentally modify behavior (as opposed to
>>>> >>> aestethics/UI) of the IPython Notebook.
>>>> >>>
>>>> >>> The issue, stated in the context of the nesting/alts/etc cells extension,
>>>> >>> is that a notebook that has branching/alternates *requires* that they be
>>>> >>> understood as such, rather than simply benefiting from it.
>>>> >>>
>>>> >>> The ability to distribute notebooks I write and have them work properly
>>>> >>> is entirely core to my usecase for IPython. If I can't do so, what I
>>>> >>> personally can get IPython or IPython notebooks to do on my own machine is
>>>> >>> not something I have any real interest in. Now you may be thinking to
>>>> >>> yourself "But Gabe, no one is using your fork so you can't do that now with
>>>> >>> your implementation anyway". That is true, but if someone without my fork
>>>> >>> installed manages to get their hands on a notebook which uses the nesting
>>>> >>> features, it will break when they try to load it.
>>>> >>>
>>>> >>> If I create an extension as you are describing, create a complex notebook
>>>> >>> using it, and someone without the plugin installed finds it, downloads it,
>>>> >>> and runs it, it will run fine and happily give them incorrect results
>>>> >>> without even noticing the extra bits I stuck in the metadata.
>>>> >>>
>>>> >>> The core issue here is that running a notebook with branching as a linear
>>>> >>> notebook by executing each of the branches in sequence is actually erroneous
>>>> >>> and will produce undefined, untrustworthy, and likely incorrect, behavior
>>>> >>> and output. The reason for this is that branches/alternatives are assumed to
>>>> >>> be mutually exclusive by the computational model, and can alter objects
>>>> >>> in-place in manners that can have unintended cumulative effects.
>>>> >>>
>>>> >>> As a very simple example consider branches which handle outliers in a
>>>> >>> certain variable by modifying the variable in-place and trimming its values
>>>> >>> by .1, 1, 5, and 10%, respectively,  using quantiles and then consider what
>>>> >>> would happen if these branches were all run in an arbitrary order.
>>>> >>>
>>>> >>> It is easy to see that the outcome from running all the branches (which
>>>> >>> is what will silently happen if the notebook is treated as a standard linear
>>>> >>> notebook because the plugin is not being used) does not reflect any of the
>>>> >>> choices intended by the author and more complex situations could be
>>>> >>> difficult to predict at all without sitting down and thinking about it.
>>>> >>>
>>>> >>> As such, I would not be comfortable distributing branching notebooks
>>>> >>> using the extension mechanism as I understand it to exist now because a) I
>>>> >>> feel it indirectly damages the type of scientific reprodicibility and result
>>>> >>> trustworthiness I seek to advance, and b) I don't want to spend all my time
>>>> >>> fielding angry emails/bugreports from notebook authors who sent their
>>>> >>> notebooks to collaborators who didn't have the plugin installed.
>>>> >>>
>>>> >>>
>>>> >>>>
>>>> >>>>
>>>> >>>> <snip>
>>>> >>>>
>>>> >>>> > Consider the example of classifying new data based on a training set
>>>> >>>> > via
>>>> >>>> > KNN, SVM, and GLM approaches. These approaches all need different sets
>>>> >>>> > of
>>>> >>>> > parameters, return different types of objects as the output of the
>>>> >>>> > fitting
>>>> >>>> > function, may have subtley different behaviour when being used for
>>>> >>>> > prediction, etc.
>>>> >>>>
>>>> >>>> Yep, that is the big challenge with the branching idea in general.  It
>>>> >>>> is not always true that the members of the alt sets can be swapped
>>>> >>>> out.
>>>> >>>
>>>> >>>
>>>> >>> And under the model I am envisioning, that is actually an informative
>>>> >>> and queriable feature, rather than a drawback. See my discussion above
>>>> >>> regarding terminal branches.
>>>> >>>
>>>> >>>>
>>>> >>>>
>>>> >>>> <snip>
>>>> >>>>
>>>> >>>> I hope you can see that I really like the general idea and think the
>>>> >>>> usage cases you are describing are really important.  I think I can
>>>> >>>> speak for the project in saying that we want the notebook to be useful
>>>> >>>> for things like this.  But I think our abstractions are important
>>>> >>>> enough that we make every attempt to see how we can do these while
>>>> >>>> leveraging our existing abstractions.  This is partially a question
>>>> >>>> about implementation, but also partly a question about how the new
>>>> >>>> features are thought about.  The reason we don't like to break
>>>> >>>> abstractions for new features is that we have found an interesting
>>>> >>>> relationship between abstraction breaking and new features.  We have
>>>> >>>> found that when a new feature/idea breaks a core abstraction that we
>>>> >>>> have thought about very carefully, it is usually because the feature
>>>> >>>> has not been fully understood.  Time and time again, we have found
>>>> >>>> that when we take the time to fully understand the feature, it usually
>>>> >>>> fits within our abstractions beautifully and is even much better that
>>>> >>>> we ever imagined it could be.
>>>> >>>>
>>>> >>>> The plugin idea above is a perfect example of this.  By preserving the
>>>> >>>> abstractions the new feature itself a multiplication of even new
>>>> >>>> functionality:
>>>> >>>>
>>>> >>>> * The resulting notebooks can still be version controlled.  This means
>>>> >>>> that the different alt-cell can be thrown into git and when we develop
>>>> >>>> a visual diff tool for notebooks, they will *just work*.
>>>> >>>
>>>> >>>
>>>> >>> I don't really understand this point. I have numerous fork-based
>>>> >>> non-linear notebooks under version control.
>>>> >>>
>>>> >>> Also, when you have a visual diff tool, it will successfully do something
>>>> >>> when given a linear+metadata branching notebook, but whether that something
>>>> >>> would be to deliver the information required to understand changes to
>>>> >>> non-linear notebooks  is less clear (and seems somewhat unlikely).
>>>> >>>
>>>> >>>>
>>>> >>>> * The notebooks can immediately leverage the abstractions we have put
>>>> >>>> into place for converting notebooks to different formats.  You could
>>>> >>>> write custom transformers to present the notebook in a reveal.js
>>>> >>>> giving alt-cells special treatment.
>>>> >>>
>>>> >>>
>>>> >>>
>>>> >>> I could write custom transformers, this is true, but the default behavior
>>>> >>> would treat the notebook as if it actually were linear (instead of just
>>>> >>> being stored that way) which is problematic.
>>>> >>>
>>>> >>>
>>>> >>>>
>>>> >>>> * All of this can be done, and into the hands of user, without going
>>>> >>>> through those overly conservative IPython developers ;-)
>>>> >>>> * It will just work with nbviewer as well.
>>>> >>>
>>>> >>>
>>>> >>> Again, I disagree. It would *display* in nbviewer, but not work, in that
>>>> >>> the display would be actively misleading regarding what the notebook would
>>>> >>> do when executed properly.
>>>> >>>
>>>> >>>>
>>>> >>>> * It provides a cleanly abstracted foundation for other people to build
>>>> >>>> upon
>>>> >>>
>>>> >>>
>>>> >>> I agree that this is important, but it is not clear to me that it would
>>>> >>> be more true in the case that I created the extension via custom JS than it
>>>> >>> would if nesting were supported in the actual ipynb format and core notebook
>>>> >>> mechanisms.
>>>> >>>
>>>> >>>>
>>>> >>>>
>>>> >>>> In summary, we are trying to build an architecture that allows a few
>>>> >>>> simple abstractions (we actually don't have that many!) to combine in
>>>> >>>> boundless ways to create features we never planned on, but that "just
>>>> >>>> work".
>>>> >>>
>>>> >>>
>>>> >>> I agree that the customjs + metadata extensions approach is very powerful
>>>> >>> and almost infinitely versatile. I think it is great for extensions which
>>>> >>> change appearance/rendering/UI details of how the notebook behaves.
>>>> >>>
>>>> >>> As far as I can see, however,  it has some signficant problems with
>>>> >>> regard to extensions which fundamentally change non-rendering behavior of
>>>> >>> notebooks (please correct me if I'm wrong), namely:
>>>> >>>
>>>> >>> There is no guarantee that notebooks authored using an extension which
>>>> >>> alters fundamental behaviors will work or visibly fail in the absence of
>>>> >>> that extension
>>>> >>> There is no way for an individual notebook to require a particular
>>>> >>> extension
>>>> >>> There is no way to ensure that two extensions are compatible with
>>>> >>> each-other
>>>> >>> There is no standard/unified way for end-users to install extensions
>>>> >>> There is no way for users to determine which extensions they have
>>>> >>>
>>>> >>> The first point is not true of extensions which exclusively affect
>>>> >>> rendering and UI, making the rest of the points minor nuisances rather than
>>>> >>> critical issues.
>>>> >>>
>>>> >>> Looking forward to hearing your (further) thoughts about this stuff and
>>>> >>> hopefully meeting you in person soon.
>>>> >>>
>>>> >>> ~G
>>>> >>>
>>>> >>> --
>>>> >>> Gabriel Becker
>>>> >>> Graduate Student
>>>> >>> Statistics Department
>>>> >>> University of California, Davis
>>>> >>
>>>> >>
>>>> >>
>>>> >>
>>>> >> --
>>>> >> Gabriel Becker
>>>> >> Graduate Student
>>>> >> Statistics Department
>>>> >> University of California, Davis
>>>> >>
>>>> >> _______________________________________________
>>>> >> IPython-dev mailing list
>>>> >> IPython-dev at scipy.org
>>>> >> http://mail.scipy.org/mailman/listinfo/ipython-dev
>>>> >>
>>>> >
>>>> >
>>>> >
>>>> > --
>>>> > Fernando Perez (@fperez_org; http://fperez.org)
>>>> > fperez.net-at-gmail: mailing lists only (I ignore this when swamped!)
>>>> > fernando.perez-at-berkeley: contact me here for any direct mail
>>>> >
>>>> > _______________________________________________
>>>> > IPython-dev mailing list
>>>> > IPython-dev at scipy.org
>>>> > http://mail.scipy.org/mailman/listinfo/ipython-dev
>>>> >
>>>> 
>>>> 
>>>> 
>>>> --
>>>> Brian E. Granger
>>>> Cal Poly State University, San Luis Obispo
>>>> bgranger at calpoly.edu and ellisonbg at gmail.com
>>>> _______________________________________________
>>>> IPython-dev mailing list
>>>> IPython-dev at scipy.org
>>>> http://mail.scipy.org/mailman/listinfo/ipython-dev
>>> 
>>> 
>>> 
>>> -- 
>>> Gabriel Becker
>>> Graduate Student
>>> Statistics Department
>>> University of California, Davis
>>> 
>>> _______________________________________________
>>> IPython-dev mailing list
>>> IPython-dev at scipy.org
>>> http://mail.scipy.org/mailman/listinfo/ipython-dev
>> 
>> 
>> 
>> -- 
>> Damián Avila
>> Scientific Python Developer
>> Quantitative Finance Analyst
>> Statistics, Biostatistics and Econometrics Consultant
>> Biochemist
>> 
>> _______________________________________________
>> IPython-dev mailing list
>> IPython-dev at scipy.org
>> http://mail.scipy.org/mailman/listinfo/ipython-dev
> 
> 
> 
> -- 
> Gabriel Becker
> Graduate Student
> Statistics Department
> University of California, Davis
> _______________________________________________
> IPython-dev mailing list
> IPython-dev at scipy.org
> http://mail.scipy.org/mailman/listinfo/ipython-dev
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/ipython-dev/attachments/20131023/50fb4e87/attachment.html>