[Numpy-discussion] Thoughts on persistence/object tracking in scientific code

Mon Dec 29 18:25:05 EST 2008

Hello all,

On Monday 29 December 2008 17:40:07 Gael Varoquaux wrote:
> It is interesting to see that you take a slightly different approach than
> the others already discussed. This probably stems from the fact that you
> are mostly interested by parallelism, whereas there are other adjacent
> problems that can be solved by similar abstractions. In particular, I
> have the impression that you do not deal with what I call
> "lazy-revaluation". In other words, I am not sure if you track results
> enough to know whether a intermediate result should be re-run, or if you
> run a 'clean' between each run to avoid this problem.

I do. As long as the hash (the arguments to the function) is the same, the 
code loads objects from disk instead of computing results. I don't track the 
actual source code, though, only whether parameters have changed (but this 
could be a later addition).

> I must admit I went away from using hash to store objects to the disk
> because I am very much interested in traceability, and I wanted my
> objects to have meaningful names, and to be stored in convenient formats
> (pickle, numpy .npy, hdf5, or domain-specific). I have now realized that
> explicit naming is convenient, but it should be optional.

But using a hash is not so impenetrable as long as you can easily get to the 
files you want.

If I want to load the results of a partial computation, all I have to do is to 
generate the same Task objects as the initial computation and load those: I 
can run the jugfile.py inside ipython and call the appropriate load() methods.

ipython jugfile.py

: interesting = [t for t in tasks if t.name == 'something.other']
: intermediate = interesting[0].load()

> I did notice too that using the argument value to hash was bound to
> failure in all but the simplest case. This is the immediate limitation to
> the famous memoize pattern when applied to scientific code. If I
> understand well, what you do is that you track the 'history' of the
> object and use it as a hash to the object, right? I had come to the
> conclusion that the history of objects should be tracked, but I hadn't
> realized that using it as a hash was also a good way to solve the scoping
> problem. Thanks for the trick.

Yes, let's say I have the following:

feats = [Task(features,img) for img in glob('*.png')]
cluster = Task(kmeans,feats,k=10)

then the hash for cluster is computed from its arguments:
	* kmeans : the function name
	* feats: this is a list of tasks, therefore I use its hash, which is defined 
by its argument, which is a simple string.
	* k=10: this is a literal.

I don't need to use the value computed by feats to compute the hash for 
cluster.

> Your task-based approach, and the API you have built around it, reminds
> my a bit of twisted deferred. Have you studied this API?

No. I will look into it. Thanks.

bye,
Luis