[IPython-dev] Some new cell types for describing data analyses in IPy. Notebook

Tue Jul 2 23:44:14 EDT 2013

On Tue, Jul 2, 2013 at 12:56 PM, Gabriel Becker <gmbecker at ucdavis.edu>wrote:

> Min (or Ben, you didn't sign your mail so I'm not sure what you go by) et
> al.,
>
> I think I may not have done a good job describing the goals and potential
> uses of some of the features I've implemented. Please allow me to try
> again, after which I will respond to your specific comments.
>
> First off, I am thinking about notebooks in the context of data analysis
> scripts. This is not the only use for IPython notebooks, but I think you'll
> agree it is a major and important one.
>
> Imagine if authors could easily create documents allowing themselves,
> reviewers, students, and other researchers to see, explore, understand, and
> reproduce the *data analysis process itself*, rather than just the final
> results generated at the end of a long and complex process.
>
> Data analyses are not simple linear/sequential affairs. It is much more
> apt to view the actions taken during research/data analysis within the
> framework of a directed graph. For any non-trivial analysis, there is no
> single block of (non-repetitious) code which can encompass the *research
> process* which lead to the generation of the final results.
>
> Generally a large amount of code and parameter configurations exist which
> contributed substantially to the analysis but would not appear in the
> linear block of code which generates the final computational results of the
> study from the raw data. By relaxing the linear/sequential structural
> assumption on notebooks, we can gain the ability to record, represent and
> explore the larger research process *without losing any existing
> functionality* (nodes connected along a straight line is itself a special
> case of a directed graph).
>
> As an example see the attached screenshots.
>
> This notebook represents a quite trivial "data analysis" with three steps:
> read the data, clean the data, and create a plot of the data. As with any
> real dataset, however, the analyst has numerous options in exactly how to
> clean the data. In particular, this notebook records 3 particular choices
> the analyst makes: whether to remove rows with missing data, and
> whether/how to trim the data based on values of the two variables being
> plotted (price and lsqft). These three simple choices combine to create 18
> distinct data cleaning strategies.
>
> As you can see from the screenshots the resulting plot looks substantially
> different depending on the strategy that is chosen. This is a simplified,
> concrete example of the implications  that choices often not explicitly
> reflected in the final code/discussion can have on analysis results. With
> non-sequential notebooks these choices can be represented by authors and
> reproduced/assessed by readers and reviewers.
>
> I hope that this offered a clearer picture of the goals and motivations
> behind what I'm doing. In terms of your specific comments, thank you for
> responding. Please see my inline comments below:
>
> On Mon, Jul 1, 2013 at 12:59 PM, MinRK <benjaminrk at gmail.com> wrote:
>
>> Very interesting!
>>
>> I think your interactive code cells and task cells will be covered by
>> plans we already have in the works.
>>
>> - with the rich widget API that is the primary milestone for 2.0, you
>> should be able to do all of the GUI-type controls via rich display data and
>> kernel callbacks.
>>
>
> I had heard that but had already implemented the cell type myself. If I
> understand correctly this will be possible with your 2.0, which is very
> exciting. The one thing that I feel is very important here is that the code
> which actually resides in the cell be identical between interactive and
> non-interactive versions. If that can/will be the case then it seems like
> you guys will have met this (perceived, by me) need.
>
>
>> - task cells encapsulating a segment of the notebook should be covered by
>> the plan to expose UI based on heading cells, to be able to treat
>> 'sections' of a notebook as discrete entities (cut/copy/move/hide, tab
>> view, run, etc.)
>>
>
> I'm not as sure about this one. Task cells can be nested within other task
> cells, will this be true of how you treat heading cell sections?
>

Yes, headings have levels and can thus express nested hierarchy. The UI
only exposes seven levels, but there's no actual limit to the level of
nesting.

>
> Also, it seems like the plan is to get groups of cells to act like a
> single cell. With the task cell (you could easily call them section cells,
> or whathaveyou) approach all of what you're describe comes for free based
> on machinery you guys have already implemented. I haven't been a party to
> your discussions, so maybe I'm missing something but I'm not clear what the
> benefit implementing new machinery to simulate (a special case of) complex
> structured notebooks is over simply supporting the actual structure and
> getting the desired functionality for free along with a lot more.
>
>
>>
>> The altset is an interesting idea that we don't have a model for. My
>> guess is that you will actually be able to represent this with an
>> interactive widget via the new API, but I'm not certain as we haven't built
>> that yet.
>>
>
> Here I disagree that this would be the right approach even if possible. I
> might be able to hack something together using the future API that I could
> insert into a sequential document which will render it *as if *it had a
> branching structure, though this would require cells/controls to be able to
> affect the rendering of ostensibly unrelated cells which would be a bit
> odd. Even if it is possible, however, rendering is not the crux of the
> issue.
>
> There are other things which are natural to want to do given a
> non-sequential document. We will want generate a linear article-like view
> (pdf) that is a rendering of a particular path through the document. We
> might also want to add the concepts of terminal branches (important things
> the analyst tried, but which don't fit back into the rest of the flow) or
> assertions regarding particular combinations of branch choices at different
> points in the document which must or cannot be made together.
>

This, together with the Student/Teacher/Answer-key model is an approach to
the notebook that is not yet well served, but highly attractive.  There is
resistance to adding new cell types, because there is great inertia in the
cell type part of code (adding or removing cell types is a very significant
operation, with major backward-incompatible side effects).  That's why we
try to express ideas as much as we can through the dynamic output
mechanism, which is much more flexible, generic, and extensible.  It may
well be that new cell types are necessary, but we want to be sure that
existing (and planned) APIs are inadequate before making such a change.

-Min RK

>
> These things are relatively straightforward to do if we have access to the
> actual structure but much more complicated and difficult if the structure
> is simulated solely during rendering. Furthermore, if we did find a way to
> implement them we would be working very hard to simulate supporting
> structured documents without actually supporting them (which is actually
> much easier).
>
> I'm looking forward to continued dialogue about these ideas with you, the
> rest of the IPython team, and the IPython Notebook community at large.
>
> Thanks,
> ~G
>
>
>>
>>
>> On Mon, Jul 1, 2013 at 11:20 AM, Gabriel Becker <gmbecker at ucdavis.edu>wrote:
>>
>>> Hey all,
>>>
>>> As part of my research into capturing the data analysis process in
>>> documents, I have been working on some extensions to the IPython Notebook
>>> as a way of implementing proof-of-concepts for certain ideas my advisors
>>> and I have had. I think the subscribers of this list might find them
>>> interesting and would love to hear what you guys think.
>>>
>>> I have posted a screencast showcasing and explaining the work here:
>>> https://www.youtube.com/watch?v=iQPagwhad_8 and will "briefly" describe
>>> it in text below.
>>>
>>> I've implemented 3 fundamentally different new cell types in a fork of
>>> the IPython codebase: interactive code cells, task cells, and alternatives
>>> set cells. To be clear, my goal is absolutely not to replace IPython
>>> Notebook. I am simply leveraging their excellent core application to
>>> explore some new ideas about representing data analyses in documents.
>>> Descriptions of the cell types follow.
>>>
>>> *interactivecode cells:* Interactive code cells are code cells which
>>> have additional information attached to them allowing them to render a UI
>>> control which controls one (or more) values within the code and re-executes
>>> the code when the control is used to change the value. Example: controlling
>>> the bandwidth of a kernel regression estimator via a slider.
>>>
>>> *task cells*: Task cells are cells that can contain other cells
>>> (including nested task cells or altset cells). They are used to group
>>> conceptually linked content and can be executed in order to execute all the
>>> cells they contain with a single command. They are primarily for
>>> organization. Example:  the data cleaning task during a data analysis would
>>> likely contain multiple code and exposition blocks which fit conceptually
>>> within a single goal.
>>>
>>> *altset cells: *Alternatives set (altset) cells represent a point in an
>>> analysis where multiple approaches were tried before the analyst decided on
>>> a final strategy. An altset contains two or more branches representing
>>> these different approaches, only one of which can be active at a time. This
>>> allows an analyst to capture the entire research process in their IPython
>>> notebook in a structurally meaningful way, rather than just the final
>>> approach.
>>>
>>> Finally, when the structure of a document actually contains information
>>> about the research process, there are a bunch of really cool things we can
>>> do when querying, processing, executing, and rendering the document which
>>> are difficult or impossible without this extra information. This email has
>>> already gotten quite long, however, so I will leave discussion of those to
>>> another time.
>>>
>>> I'd love to hear what people think of these concepts, so please share
>>> your thoughts.
>>>
>>> Thanks for reading and thanks to the IPython core team for their great
>>> work.
>>> ~G
>>>
>>>
>>> Gabriel Becker
>>> Graduate Student
>>> Statistics Department
>>> University of California, Davis
>>>
>>> _______________________________________________
>>> IPython-dev mailing list
>>> IPython-dev at scipy.org
>>> http://mail.scipy.org/mailman/listinfo/ipython-dev
>>>
>>>
>>
>> _______________________________________________
>> IPython-dev mailing list
>> IPython-dev at scipy.org
>> http://mail.scipy.org/mailman/listinfo/ipython-dev
>>
>>
>
>
> --
> Gabriel Becker
> Graduate Student
> Statistics Department
> University of California, Davis
>
> _______________________________________________
> IPython-dev mailing list
> IPython-dev at scipy.org
> http://mail.scipy.org/mailman/listinfo/ipython-dev
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/ipython-dev/attachments/20130702/81f9ca6a/attachment.html>