[IPython-dev] Some new cell types for describing data analyses in IPy. Notebook

Gabriel Becker gmbecker at ucdavis.edu
Tue Jul 2 15:56:31 EDT 2013


Min (or Ben, you didn't sign your mail so I'm not sure what you go by) et
al.,

I think I may not have done a good job describing the goals and potential
uses of some of the features I've implemented. Please allow me to try
again, after which I will respond to your specific comments.

First off, I am thinking about notebooks in the context of data analysis
scripts. This is not the only use for IPython notebooks, but I think you'll
agree it is a major and important one.

Imagine if authors could easily create documents allowing themselves,
reviewers, students, and other researchers to see, explore, understand, and
reproduce the *data analysis process itself*, rather than just the final
results generated at the end of a long and complex process.

Data analyses are not simple linear/sequential affairs. It is much more apt
to view the actions taken during research/data analysis within the
framework of a directed graph. For any non-trivial analysis, there is no
single block of (non-repetitious) code which can encompass the *research
process* which lead to the generation of the final results.

Generally a large amount of code and parameter configurations exist which
contributed substantially to the analysis but would not appear in the
linear block of code which generates the final computational results of the
study from the raw data. By relaxing the linear/sequential structural
assumption on notebooks, we can gain the ability to record, represent and
explore the larger research process *without losing any existing
functionality* (nodes connected along a straight line is itself a special
case of a directed graph).

As an example see the attached screenshots.

This notebook represents a quite trivial "data analysis" with three steps:
read the data, clean the data, and create a plot of the data. As with any
real dataset, however, the analyst has numerous options in exactly how to
clean the data. In particular, this notebook records 3 particular choices
the analyst makes: whether to remove rows with missing data, and
whether/how to trim the data based on values of the two variables being
plotted (price and lsqft). These three simple choices combine to create 18
distinct data cleaning strategies.

As you can see from the screenshots the resulting plot looks substantially
different depending on the strategy that is chosen. This is a simplified,
concrete example of the implications  that choices often not explicitly
reflected in the final code/discussion can have on analysis results. With
non-sequential notebooks these choices can be represented by authors and
reproduced/assessed by readers and reviewers.

I hope that this offered a clearer picture of the goals and motivations
behind what I'm doing. In terms of your specific comments, thank you for
responding. Please see my inline comments below:

On Mon, Jul 1, 2013 at 12:59 PM, MinRK <benjaminrk at gmail.com> wrote:

> Very interesting!
>
> I think your interactive code cells and task cells will be covered by
> plans we already have in the works.
>
> - with the rich widget API that is the primary milestone for 2.0, you
> should be able to do all of the GUI-type controls via rich display data and
> kernel callbacks.
>

I had heard that but had already implemented the cell type myself. If I
understand correctly this will be possible with your 2.0, which is very
exciting. The one thing that I feel is very important here is that the code
which actually resides in the cell be identical between interactive and
non-interactive versions. If that can/will be the case then it seems like
you guys will have met this (perceived, by me) need.


> - task cells encapsulating a segment of the notebook should be covered by
> the plan to expose UI based on heading cells, to be able to treat
> 'sections' of a notebook as discrete entities (cut/copy/move/hide, tab
> view, run, etc.)
>

I'm not as sure about this one. Task cells can be nested within other task
cells, will this be true of how you treat heading cell sections?

Also, it seems like the plan is to get groups of cells to act like a single
cell. With the task cell (you could easily call them section cells, or
whathaveyou) approach all of what you're describe comes for free based on
machinery you guys have already implemented. I haven't been a party to your
discussions, so maybe I'm missing something but I'm not clear what the
benefit implementing new machinery to simulate (a special case of) complex
structured notebooks is over simply supporting the actual structure and
getting the desired functionality for free along with a lot more.


>
> The altset is an interesting idea that we don't have a model for. My guess
> is that you will actually be able to represent this with an interactive
> widget via the new API, but I'm not certain as we haven't built that yet.
>

Here I disagree that this would be the right approach even if possible. I
might be able to hack something together using the future API that I could
insert into a sequential document which will render it *as if *it had a
branching structure, though this would require cells/controls to be able to
affect the rendering of ostensibly unrelated cells which would be a bit
odd. Even if it is possible, however, rendering is not the crux of the
issue.

There are other things which are natural to want to do given a
non-sequential document. We will want generate a linear article-like view
(pdf) that is a rendering of a particular path through the document. We
might also want to add the concepts of terminal branches (important things
the analyst tried, but which don't fit back into the rest of the flow) or
assertions regarding particular combinations of branch choices at different
points in the document which must or cannot be made together.

These things are relatively straightforward to do if we have access to the
actual structure but much more complicated and difficult if the structure
is simulated solely during rendering. Furthermore, if we did find a way to
implement them we would be working very hard to simulate supporting
structured documents without actually supporting them (which is actually
much easier).

I'm looking forward to continued dialogue about these ideas with you, the
rest of the IPython team, and the IPython Notebook community at large.

Thanks,
~G


>
>
> On Mon, Jul 1, 2013 at 11:20 AM, Gabriel Becker <gmbecker at ucdavis.edu>wrote:
>
>> Hey all,
>>
>> As part of my research into capturing the data analysis process in
>> documents, I have been working on some extensions to the IPython Notebook
>> as a way of implementing proof-of-concepts for certain ideas my advisors
>> and I have had. I think the subscribers of this list might find them
>> interesting and would love to hear what you guys think.
>>
>> I have posted a screencast showcasing and explaining the work here:
>> https://www.youtube.com/watch?v=iQPagwhad_8 and will "briefly" describe
>> it in text below.
>>
>> I've implemented 3 fundamentally different new cell types in a fork of
>> the IPython codebase: interactive code cells, task cells, and alternatives
>> set cells. To be clear, my goal is absolutely not to replace IPython
>> Notebook. I am simply leveraging their excellent core application to
>> explore some new ideas about representing data analyses in documents.
>> Descriptions of the cell types follow.
>>
>> *interactivecode cells:* Interactive code cells are code cells which
>> have additional information attached to them allowing them to render a UI
>> control which controls one (or more) values within the code and re-executes
>> the code when the control is used to change the value. Example: controlling
>> the bandwidth of a kernel regression estimator via a slider.
>>
>> *task cells*: Task cells are cells that can contain other cells
>> (including nested task cells or altset cells). They are used to group
>> conceptually linked content and can be executed in order to execute all the
>> cells they contain with a single command. They are primarily for
>> organization. Example:  the data cleaning task during a data analysis would
>> likely contain multiple code and exposition blocks which fit conceptually
>> within a single goal.
>>
>> *altset cells: *Alternatives set (altset) cells represent a point in an
>> analysis where multiple approaches were tried before the analyst decided on
>> a final strategy. An altset contains two or more branches representing
>> these different approaches, only one of which can be active at a time. This
>> allows an analyst to capture the entire research process in their IPython
>> notebook in a structurally meaningful way, rather than just the final
>> approach.
>>
>> Finally, when the structure of a document actually contains information
>> about the research process, there are a bunch of really cool things we can
>> do when querying, processing, executing, and rendering the document which
>> are difficult or impossible without this extra information. This email has
>> already gotten quite long, however, so I will leave discussion of those to
>> another time.
>>
>> I'd love to hear what people think of these concepts, so please share
>> your thoughts.
>>
>> Thanks for reading and thanks to the IPython core team for their great
>> work.
>> ~G
>>
>>
>> Gabriel Becker
>> Graduate Student
>> Statistics Department
>> University of California, Davis
>>
>> _______________________________________________
>> IPython-dev mailing list
>> IPython-dev at scipy.org
>> http://mail.scipy.org/mailman/listinfo/ipython-dev
>>
>>
>
> _______________________________________________
> IPython-dev mailing list
> IPython-dev at scipy.org
> http://mail.scipy.org/mailman/listinfo/ipython-dev
>
>


-- 
Gabriel Becker
Graduate Student
Statistics Department
University of California, Davis
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/ipython-dev/attachments/20130702/057e108e/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: BranchingAnalysisNotebook.png
Type: image/png
Size: 255859 bytes
Desc: not available
URL: <http://mail.python.org/pipermail/ipython-dev/attachments/20130702/057e108e/attachment.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: BranchingAnalysisNotebook2.png
Type: image/png
Size: 253708 bytes
Desc: not available
URL: <http://mail.python.org/pipermail/ipython-dev/attachments/20130702/057e108e/attachment-0001.png>


More information about the IPython-dev mailing list