On Mon, Dec 19, 2011 at 5:14 PM, Stephen Skory email@example.com wrote:
Hey all (Matt, in particular),
I've got some dead time waiting for jobs to run so I'd like to A) discuss this topic and B) make some "final" decisions about this so I can go ahead and do some coding on this. Sorry about the length of this!
A) Briefly, for those of you who aren't aware of this topic, the idea is to replace the ~/.yt/parameter_files.csv text file with a SQLite database file. This has many advantages over a text file, too many to list here. But in particular, it has built-in locks for writes (*), which is especially useful for multi-level parallelism. This is something we're currently addressing in "official" examples with a kludge . I think everyone is in agreement that this is a good thing, no?
The other big thing that this feeds into is a remote, centralized storage point for a clone of this database. I've discussed this idea before, sketched up a simple partially functional example, and made a simple video cast of how it works. 
B) The final decisions that I'd like input on are these.
- What data fields should we include in the databases? There are three
ways to go with this. #1. The same amount of data that is in the current csv (basically: hash, name, location on disk, time, type). This is probably too few data fields, so I think we can scratch it off immediately. #2. Everything that can be gleamed from the dataset. This is actually fine to do practically because of the database being binary and searchable. However, because the fields in various datasets are so different, this could result in a fairly unwieldy database with (in a Chris Traeger voice) a literal ton of columns. This could be mitigated by having a different database tables for each type of dataset (Enzo, Athena, etc...), but that really only swaps one kind of complexity for another. #3. A minimal set of "interesting" fields (redshift, box resolution, cosmological parameters, etc..) This is more attractive than #2 in that it's very unlikely anyone will want to search over every field in a dataset, so it keeps things more streamlined. But then we have to agree to a reasonable set of parameters to include, and it makes future changes a bit more difficult.
Odd that you bring this up today! This weekend I started work on a project that ties into this. It's in my repository "yt.hub" on bitbucket, and it's a revamped hub -- unified pastebin, data pastebin + mapserver, and the project directory. There is the possibility it won't go anywhere, but I based it on Flask and I've got it supporting mapserver of in-memory data components already. I spent some time on it today before preparing arxiv submissions, and I also created what I am calling "minimal representations" in yt. I have issued a pull request for these:
Here's the architecture I was aiming for, although I was hoping to have a lot more done before sharing it with anybody. Additionally, the idea would be to stage these items.
* Authentication of users: either through OpenID or local items * Pastebinning of data up to a given size * Storing, as a side effect of data pastebins, a mapping between simulation data and simulation parameters, mediated by the 'minimal representation' * Projects, like we have on the hub now * Pastes, like we have on the pastebin now
The idea I was exploring was to take an in-memory yt data object -- not the raw data, but rather a projection, a slice, a 2D or 1D phase profile -- and upload that to a remote data repository. One could then go to the hub, view the available data, and spawn the mapserver, or a plot fiddler, or something. As a side effect of the fundamental operations of the hub -- the data pastebin, the pastes, the project dir -- we need to track simulation outputs. And I think that's where this plays in, and why I think we should have an identical data model between the two.
As I mentioned above, I have this working for projections (I don't yet have the multipart chunked upload going for uploading directly from yt, but for projections manually inserted it works great). The minimal representation provides an easy way to both pass around an object that can be pickled and unpickled (without having access to the original data) and a mechanism for taking an object and converting it to a POST request sent to an http server -- from the same underlying object.
The columns inside the yt.hub right now mirror those stored in the minimal representation. So my vote would be whatever local storage is fine, but that it be a superset of the attributes in the MinimalStaticOutput. Nathan's concern is a valid one, and I would be loathe to require that someone manage lustre striping for fundamental components of yt, but I am otherwise indifferent to the usage of SQLite. I would mainly like to emphasize that we try to focus on the data, not the model.
What do we all think?
- Once we have the above settled, and working, I would like to extend
the functionality to the cloud bzzzzzzz. Get it? It's a buzz word. So it buzzes. Thanks, I'll be here all week.
Your humor aside, there are very good reasons for hosting this in the cloud. The one that is most pressing, I feel, is that it allows for a nice compartmentalization of a running server from everything else. It's also potentially cheaper than buying hardware.
There are three (four) ways to do this that I can think of
#1. Amazon Simple DB. The advantages of this is that it's offered free to all up to 1GB of storage and some reasonable limit of transactions per month. Each user sets up her own account on S3, and no one else has to be involved. But the main disadvantage is that it only supports storing things as strings, which makes numerical searches and sorts less useful, more annoying, and slower. #1.5. Amazon Relational DB. This is not free at any level, but it offers all the usual DB functionality. Amazon does offer some educational grants, so we could apply for that. This service is targeted at usage levels that we will never reach, but if we get free time, that's fine. I think in this case (and the next two) user accounts on the database would have to be created for yt users by "us". #2. Google App Engine. Free right now in pre-beta invitation-only phase. It will be similar or #1.5 above, as I understand things, and not be free forever. Personally, I seriously doubt that we'd get in on the pre-beta. I've looked at the application form  and I don't even understand one of the questions. #3. Host a MySQL (or similar) database on one of our own servers (yt-project or similar). The advantage is that the cost should be no more that Matt is paying now. The disadvantage is, again, we have to set up accounts. Also, I don't know if Dreamhost (is that where yt-project is still?) allows open MySQL databases. Another advantage is that unlike #1.5 or #2 above, costs should never rise suddenly when an educational grant or beta period ends.
For the newly re-envisioned hub, I was thinking EC2 instances.
Thanks for reading, and any and all comments are welcomed.
(As a note, I find it pretty jarring to have to dig up your footnotes. It's way easier if you include them inline in the text, set off with spaces or something. Maybe I'm the only one ...)
(*) There are issues with locks on parallel network file systems, but most home partitions on supercomputers are NFS (not something like Lustre) so this shouldn't be a problem.
-- Stephen Skory firstname.lastname@example.org http://stephenskory.com/ 510.621.3687 (google voice) _______________________________________________ yt-dev mailing list email@example.com http://lists.spacepope.org/listinfo.cgi/yt-dev-spacepope.org