[SciPy-User] tool for running simulations

Gael Varoquaux gael.varoquaux at normalesup.org
Mon Jun 20 04:12:11 EDT 2011


On Mon, Jun 20, 2011 at 06:39:14AM +0200, Dan Goodman wrote:
> * Is reading the data fast?

As long as most of the data is in numpy arrays, yes. You can make it
faster by passing "mmap_mode='r'" to the Memory object, but you should
beware that you will have read-only memmaped arrays in your code.

> At the moment I have a system built on Python shelves, and the
> performance is not great.

:). On top of that, I had a fair amount of database corruptions when I
was using shelves. This code is reasonnably isolated, so you won't
corrupt your complete cache, just one result.

> * Can it be used on multiple computers?

If you have an NFS share between the computers, yes. The code works OK in
parallel. You will have race conditions, but it captures them, and falls
back on its feets.

> If not at the moment, is there at least a way to easily combine data
> produced on multiple computers?

If you don't have a shared disk, I suggest that you use unison.

> * Can you browse the generated data easily?

No. This is something that could/should be improved (want to organize a
sprint in Paris, if you still are in Paris?).

> That's one thing I liked about the idea of doing it with HDF5 is that
> there are nice visual browsers and you can include metadata, search via
> metadata, remove parts of the data, etc.

Agreed. Actually, an HDF5 backend would probably be a good idea. But
first we would need to merge Dags's changes, that abstract a bit the data
storage.

> * If I change the code for a function, will that cause a recompute?

Yes, but only if it is the function that you have cached. It does not do
a deep inspection of the code.

> I think it's better that it doesn't cause a recompute,

It should be an option. Also, it would be good to be able to version the
results with regards to function code. This actually raises non trivial
questions with regards to cache flushing. Dags has been working on these
questions. Once again, I need to find time to review the code, and for
this, I fear I need a couple of days, as these things are not trivial at
all.

> but given that having the ability to easily browse the cached data and
> remove the cache for a function would be very handy.

Given a decorated function, "g = mem.cache(f)", "g.clear()" will flush
the corresponding cache.

The main issue of the code is that it has no cache replacement policy. As
a result, it will blow your disk at some point. I have a pretty good idea
of how to implement this, but I need to find a full free week to hack.
The difficulty here is to keep the cache in a sensible state without
introducing global locks that kill performance in parallel computing
settings. I am telling you this just to stress that I don't believe that
the code is yet fully production ready, although we have been using it
happily for a couple of years.

Gaël



More information about the SciPy-User mailing list