Re: [yt-dev] sqlite parameter files storage

19 Dec 2011

      Hi Stephen,

On Mon, Dec 19, 2011 at 5:14 PM, Stephen Skory  wrote:
...
Hey all (Matt, in particular),
I've got some dead time waiting for jobs to run so I'd like to A)
discuss this topic and B) make some "final" decisions about this so I
can go ahead and do some coding on this. Sorry about the length of
this!
A) Briefly, for those of you who aren't aware of this topic, the idea
is to replace the ~/.yt/parameter_files.csv text file with a SQLite
database file. This has many advantages over a text file, too many to
list here. But in particular, it has built-in locks for writes (*),
which is especially useful for multi-level parallelism. This is
something we're currently addressing in "official" examples with a
kludge [0]. I think everyone is in agreement that this is a good
thing, no?
Yes.
...
The other big thing that this feeds into is a remote, centralized
storage point for a clone of this database. I've discussed this idea
before, sketched up a simple partially functional example, and made a
simple video cast of how it works. [1]
B) The final decisions that I'd like input on are these.
- What data fields should we include in the databases? There are three
ways to go with this.
 #1. The same amount of data that is in the current csv (basically:
hash, name, location on disk, time, type). This is probably too few
data fields, so I think we can scratch it off immediately.
 #2. Everything that can be gleamed from the dataset. This is
actually fine to do practically because of the database being binary
and searchable. However, because the fields in various datasets are so
different, this could result in a fairly unwieldy database with (in a
Chris Traeger voice) a literal ton of columns. This could be mitigated
by having a different database tables for each type of dataset (Enzo,
Athena, etc...), but that really only swaps one kind of complexity for
another.
 #3. A minimal set of "interesting" fields (redshift, box resolution,
cosmological parameters, etc..) This is more attractive than #2 in
that it's very unlikely anyone will want to search over every field in
a dataset, so it keeps things more streamlined. But then we have to
agree to a reasonable set of parameters to include, and it makes
future changes a bit more difficult.
Odd that you bring this up today!  This weekend I started work on a
project that ties into this.  It's in my repository "yt.hub" on
bitbucket, and it's a revamped hub -- unified pastebin, data pastebin
+ mapserver, and the project directory.  There is the possibility it
won't go anywhere, but I based it on Flask and I've got it supporting
mapserver of in-memory data components already.  I spent some time on
it today before preparing arxiv submissions, and I also created what I
am calling "minimal representations" in yt.  I have issued a pull
request for these:

https://bitbucket.org/yt_analysis/yt/pull-request/50/minimal-representation-...

Here's the architecture I was aiming for, although I was hoping to
have a lot more done before sharing it with anybody.  Additionally,
the idea would be to stage these items.

 * Authentication of users: either through OpenID or local items
 * Pastebinning of data up to a given size
 * Storing, as a side effect of data pastebins, a mapping between
simulation data and simulation parameters, mediated by the 'minimal
representation'
 * Projects, like we have on the hub now
 * Pastes, like we have on the pastebin now

The idea I was exploring was to take an in-memory yt data object --
not the raw data, but rather a projection, a slice, a 2D or 1D phase
profile -- and upload that to a remote data repository.  One could
then go to the hub, view the available data, and spawn the mapserver,
or a plot fiddler, or something.  As a side effect of the fundamental
operations of the hub -- the data pastebin, the pastes, the project
dir -- we need to track simulation outputs.  And I think that's where
this plays in, and why I think we should have an identical data model
between the two.

As I mentioned above, I have this working for projections (I don't yet
have the multipart chunked upload going for uploading directly from
yt, but for projections manually inserted it works great).  The
minimal representation provides an easy way to both pass around an
object that can be pickled and unpickled (without having access to the
original data) and a mechanism for taking an object and converting it
to a POST request sent to an http server -- from the same underlying
object.

The columns inside the yt.hub right now mirror those stored in the
minimal representation.  So my vote would be whatever local storage is
fine, but that it be a superset of the attributes in the
MinimalStaticOutput.  Nathan's concern is a valid one, and I would be
loathe to require that someone manage lustre striping for fundamental
components of yt, but I am otherwise indifferent to the usage of
SQLite.  I would mainly like to emphasize that we try to focus on the
data, not the model.
...
What do we all think?
- Once we have the above settled, and working, I would like to extend
the functionality to the cloud bzzzzzzz. Get it? It's a buzz word. So
it buzzes. Thanks, I'll be here all week.
Your humor aside, there are very good reasons for hosting this in the
cloud.  The one that is most pressing, I feel, is that it allows for a
nice compartmentalization of a running server from everything else.
It's also potentially cheaper than buying hardware.
...
There are three (four) ways to do this that I can think of
 #1. Amazon Simple DB. The advantages of this is that it's offered
free to all up to 1GB of storage and some reasonable limit of
transactions per month. Each user sets up her own account on S3, and
no one else has to be involved. But the main disadvantage is that it
only supports storing things as strings, which makes numerical
searches and sorts less useful, more annoying, and slower.
 #1.5. Amazon Relational DB. This is not free at any level, but it
offers all the usual DB functionality. Amazon does offer some
educational grants, so we could apply for that. This service is
targeted at usage levels that we will never reach, but if we get free
time, that's fine. I think in this case (and the next two) user
accounts on the database would have to be created for yt users by
"us".
 #2. Google App Engine. Free right now in pre-beta invitation-only
phase. It will be similar or #1.5 above, as I understand things, and
not be free forever. Personally, I seriously doubt that we'd get in on
the pre-beta. I've looked at the application form [2] and I don't even
understand one of the questions.
 #3. Host a MySQL (or similar) database on one of our own servers
(yt-project or similar). The advantage is that the cost should be no
more that Matt is paying now. The disadvantage is, again, we have to
set up accounts. Also, I don't know if Dreamhost (is that where
yt-project is still?) allows open MySQL databases. Another advantage
is that unlike #1.5 or #2 above, costs should never rise suddenly when
an educational grant or beta period ends.
For the newly re-envisioned hub, I was thinking EC2 instances.
...
Thanks for reading, and any and all comments are welcomed.
[0] http://yt-project.org/doc/advanced/parallel_computation.html#parallelizing-y...
[1] http://vimeo.com/28797703
[2] https://docs.google.com/spreadsheet/viewform?formkey=dHBwRmpHV2VicFVVNi1PaFB...
(As a note, I find it pretty jarring to have to dig up your footnotes.
 It's way easier if you include them inline in the text, set off with
spaces or something.  Maybe I'm the only one ...)
...
(*) There are issues with locks on parallel network file systems, but
most home partitions on supercomputers are NFS (not something like
Lustre) so this shouldn't be a problem.
--
Stephen Skory
s@skory.us
http://stephenskory.com/
510.621.3687 (google voice)
_______________________________________________
yt-dev mailing list
yt-dev@lists.spacepope.org
http://lists.spacepope.org/listinfo.cgi/yt-dev-spacepope.org

Re: [yt-dev] sqlite parameter files storage

Matthew Turk