[IPython-dev] Fwd: [SciTech] ActivePapers Python edition

Mon Dec 9 12:14:34 EST 2013

Hey all,

I'm actually drafting a thesis chapter about this very subject right now in
the context of so-called "dynamic documents" in the R realm such as those
processed by Sweave[1] and knitr[2]. The issue is with how we determine and
track dependency.

The current practice is to think of dependencies in terms of cells; as in
Cell 2 depends on Cell 1, so if Cell 1 is changed (or has been run more
recently than Cell 2) then Cell 2 needs to be rerun. This is a proxy for
what we actually care about, which is whether our cached result for Cell 2
is safe to use.

For code implementing deterministic algorithms, and assuming any external
data and software versions remain unchanged, the only way for the result of
running a piece of code to change is for the inputs (variables it assumes
already exist and uses without assigning values to them) passed to/used by
it to change.

Under the assumption that a dynamic document (notebook) will be run in
order from start to finish, checking for changes in cells previous to a
particular cell (either all of them or via more sophisticated dependency
detection) is sufficient to determine *whether it is possible that the
inputs to the cell have changed*
. This is what allows the proxy of tracking cell contents to predict result
changes to be quite successful (and theoretically sound).

In the case of interactive execution (or non-linear documents), however,
this proxy breaks down. It is not sufficient to check Cell 1 to determine
whether to rerun Cell 2 if we have no guarantee that Cell 1 was the last
cell executed.

The solution to this that I am pursuing is essentially memoization of code
cells. That is, keying dependency detection and caching off of the actual
values of the inputs used by a cell, rather than either the names of the
input variables (as in the weaver[3], and knitr[2]) or simple position in
the document. This means that if you have a certain input state, it doesn't
matter whether you arrived there by executing Cell 1, or by some more
convoluted method. The answer will be the same, so the same cache will be
used (think equivalence partitioning for code cells).

The input detection necessary to do this can be difficult in a general
language like Python, but with some assumptions about the type of code
being put in notebooks (and/or allowing authors to explicitly declare input
variables) it can be manageable.

I have implemented this concept for R code in the RCacheSuite R package[4]
(currently on github, soon to be on CRAN), as well as in a custom
%%Rcaching magic that lives in my experimental fork of the IPython project.
The magic simply uses the RCacheSuite function though, so not much new
there beyond what is in [4].

Happy to chat more about this (who doesn't want to talk about their
thesis), but this is already getting a bit long, so I'll sign off for now.

~G

P.S. I'm sure this goes without saying, but please remember to cite me if
you end up pursuing this strategy

[1] http://www.stat.uni-muenchen.de/~leisch/Sweave/
[2] http://yihui.name/knitr/
[3] http://www.bioconductor.org/packages/release/bioc/html/weaver.html
[4] https://github.com/gmbecker/RCacheSuite
[5] https://github.com/gmbecker/ipython

On Sun, Dec 8, 2013 at 11:40 AM, Thomas Kluyver <takowl at gmail.com> wrote:

>
> On 8 December 2013 03:16, Konrad Hinsen <konrad.hinsen at fastmail.net>wrote:
>
>> > Yes - the prompt numbers by the cells record the order of execution.
>>
>> If that's the same information shown in the browser, then it's only
>> the number of the last execution that remains available. And that
>> means that I cannot reconstruct the history of execution. To make it
>> worse, a cell could have been modified and re-executed, so the
>> original code is no longer available.
>>
>
> That's all the information available in the notebook frontend, but the
> kernel records every cell executed, so there's a record of all the code
> that has run in the present session. The kernel knows nothing about the
> notebook document, so if for instance a cell has been edited since it was
> executed, there's no easy way to know which cell that execution represents.
> However, you can easily enough tell whether the current notebook has been
> run through once with no editing, by comparing the list of code cells to
> the history.
>
>
>>
>>  > However, I think that the reproducible case should normally be the
>>  > one where the cells are run in order from top to bottom. Running
>>  > them out of order is convenient, but to check that a notebook all
>>  > works together, it's common to clear all output, reset the kernel,
>>  > and run all cells in order.
>>
>> That's what people should do, but I still need to decide what to do
>> when a user runs cells randomly and then wants to save the notebook to
>> an ActivePapers backend. Since any result datasets are in a
>> non-reproducible state, the options are
>>
>>   1) Refuse with an error message.
>>   2) Re-run the notebook in order before saving.
>>   3) Save the notebook, but mark it and all its results as stale.
>>      Stale datasets are marked as such and require an update
>>      before they can be used in some other computation.
>>
>> I'll probably go for 3) and see how this works out.
>
>
> 3 sounds like the best option to me: you can save the information, but it
> shouldn't be considered as the fixed form for later calculations.
> Presumably ActivePapers will be able to re-run the cells in order if it
> needs to freshen the results, without the user having to manually open the
> notebook and execute the cells.
>
> Thomas
>
> _______________________________________________
> IPython-dev mailing list
> IPython-dev at scipy.org
> http://mail.scipy.org/mailman/listinfo/ipython-dev
>
>

-- 
Gabriel Becker
Graduate Student
Statistics Department
University of California, Davis
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/ipython-dev/attachments/20131209/2958ec92/attachment.html>