[Numpy-discussion] Thoughts on persistence/object tracking in scientific code

Sat Dec 27 10:59:25 EST 2008

I prototyped an approach last year that worked out well. I don't really know
what to call it - maybe something like "property based persistence."  It is
kind of strange and I am not sure how broadly applicable it is - I have only
used it for financial time series data.

I'll try to explain how the idea works.  I start with a python object that
has a number of properties and an associated large data set (in my case,
financial instruments and their associated time series in the form of numpy
arrays.)  I then created infrastructure that allowed me to define a simple
"mapper" function that used a subset of the object's properties to define a
"path" (expressible in the same form either as a file system path or as a
path in HDF to a table.) Then I persisted the bulky data set (again, time
series in my case) at that location.

This little piece of infrastructure is very lightweight and cuts the client
side persistence code down to only the small "mapper" functions.  The mapper
functions don't actually build up paths - they just specify the properties
and ordering that you want to use to build up the paths.  It also makes
querying very simple and fast because you don't really query at all -
instead the properties associated with the query directly express the path
at which the data is located.

The drawback of this simplistic approach is that you need to add a second
level of path addressing if you deal with datasets so large that you can not
really persist them under a single path.  If you have single multi GB or TB
arrays you probably want to chunk things up a bit more in the style of GFS
and its open source counterparts.

I still have the python code for this properties based time series
database.  It is a very small and simple peice of code, but I am happy to
give it a quick polish and open source it if anyone is interested in taking
a look.

I am also about to try this model using F# and db4o for a .Net project.

On Wed, Dec 24, 2008 at 2:21 PM, Gael Varoquaux <
gael.varoquaux at normalesup.org> wrote:

> On Tue, Dec 23, 2008 at 02:10:50AM +0100, Olivier Grisel wrote:
> >    Interesting topic indeed. I think I have been hit with similar
> problems on
> >    toy experimental scripts. So far the solution was always adhoc FS
> caches
> >    of numpy arrays with manual filename management. Maybe the first step
> for
> >    designing a generic solution would be to list some representative yet
> >    simple enough use cases with real sample python code so as to focus on
> >    concrete matters and avoid over engineering a general solution for
> >    philosophical problems.
>
> Yes, that's clearly a first ste: list the usecases, and the way we would
> like it solved: think about the API.
>
> My internet connection is quite random currently, and I'll probably loose
> it for a week any time soon. Do you want to start such a page on the
> wiki. Mark it as a sratch page, and we'll delete it later.
>
> I should point out that joblib (on PyPI and launchpad) was a first
> attempt to solve this problem, so you could have a look at it. I have
> already identified things that are wrong with joblib (more on the API
> side than actual bugs), so I know it is not a final solution. Figuring
> out what was wrong only came from using it heavily in my work. I thing
> the only way forward it to start something, use it, figure out what's
> wrong, and start again...
>
> Looking forward to your input,
>
> Gaël
> _______________________________________________
> Numpy-discussion mailing list
> Numpy-discussion at scipy.org
> http://projects.scipy.org/mailman/listinfo/numpy-discussion
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20081227/9a5bfa13/attachment.html>