[Yt-dev] Simulation Database

6 Sep 2011

      (Summary: skip to the bottom, help decide what records should be in a
minimal simulation database.  Please take the time to contribute to
this discussion; it likely affects you, even if you don’t think it
does!)

Currently, in order to support pickling data objects in
external-to-the-.yt-file formats, yt keeps a record of the most recent
N (where N is usually 500-1000) parameter files it has personally
seen.  This is stored in a csv file, typically in
~/.yt/parameter_files.csv .  The reasons for using CSV are pretty
reactionary: there were for a while difficulties with getting sqlite
everywhere, shelve was unreliable and difficult to sort unless you
used a database backend, and I wanted a single file.

We *already* have a simulation output database, but it’s in CSV.  If
you check, you’ll find you have one.

Very briefly, let me motivate why we have the .csv file (also covered
in the method paper) and how it is modified.  The idea is that, before
we had UUIDs in parameter files, it was challenging to identify a
parameter file uniquely.  For instance, one occasionally will need to
know “Did I mean *this* DD0070, or *that* DD0070?”  So using the
CurrentTimeIdentifier (and falling back -- for other codes -- on the
ST_CTIME, which is not immensely reliable) and a couple different
pieces of information, a hash was constructed for that parameter file.
 This has included the path and a few bits of info.

So everytime you load a parameter file in yt, it looks in this .csv
and if necessary updates the *path* for a given *hash*.

Since the time of implementation, UUIDs have been inserted into Enzo;
I have been in contact with a couple other code developers and there
is a remote, outside possibility that such a rider could be added to
some of them.  I am not holding my breath, so we will continue falling
back on this hash, which is for all intents and purposes likely to be
Universally Unique, although may not follow the code and so may not
serve as a consistent Identifier.

However, as shows up once in a while on the mailing lists, the .csv
file can be problematic.  My approach was conservative, which leads to
lots of small bits of IO on them.  (Not cool.)  Updating requires
rewriting the entire file.  (Not cool.)  If you have multiple
processes going, sometimes they can be corrupted.  (Not cool.)  Plus,
it’s just not a terribly easily-queryable format.

Recently, inspired by something Tom talked about at the Enzo Spring
Workshop, I put into Enzo the ability to write out a set of
information about a simulation output at the same time the data itself
was written.  The format for this set was in SQLite3, which is simple,
easy to install, and -- most importantly for my near-obsessive goals
-- has been installed with yt for a couple months now.  This last
weekend, I was able to insert the first pass at a transition to using
this format of database in yt, in my fork:

https://bitbucket.org/MatthewTurk/yt/changeset/1b685b6bfbf2#chg-yt/utilities...

In this commit I also add an ORM called “pee wee” that abstract all of
the SQL out; in this way we actually can use a SQLite database while
not having any raw  text string SQL in yt.
What the main changes are boils down to:

 * No more CSV; all sqlite, which handles multi-process locking and cocurrency.
 * Reads from the same database that Enzo writes out, and that other
codes could choose to write out if they wanted.
 * Uses UUIDs instead of hashes.  (Could potentially break existing
pickles, but I plan to provide a migration strategy.)

There are some big advantages to having an output database -- one
could select all outputs based on the simulation they derived from (if
such a field is available, as it is for enzo), one could load() just
the UUID or hash, and we can provide a Reason “file open” GUI that
doesn’t require touching the file system explicitly.  (The idea there
is that Reason would pop up a grid of all available parameter files,
with their times, redshifts, etc, and then you would choose.)  Plus,
*other* utilities like Stranger, Jacques, etc could use it.  And long
term, Tom’s vision of a universal simulation database becomes a *lot*
more tractable, if we have a firm starting point.

However, before this can be accepted into mainline, there are three
things that need to be decided.

Here are the currenty-included fields:

dset_uuid
output_type
pf_path
creation_time
last_seen_time
simulation_uuid
redshift
time
topgrid0
topgrid1
topgrid2

1) Thumbs up or thumbs down to moving to SQLite from CSV?
2) Should any additional fields be *added*?
3) Should any of these fields be *removed*?  (I have opinions on this,
but I would like to hear from others first.)

I think this is the wrong place to put everything there is to know
about a simulation.  The parameter file exists for that, and every
field is an additional overhead of space, complexity, etc.  But it
would be nice if things that people *commonly* want to query on or
*sort* on were in here.  What else?

-Matt

[Yt-dev] Simulation Database

Matthew Turk