One computer I use regularly (JuRoPa) does use lustre for the home directory.
It would be great if any database solution you come up with does work on a parallel filesystem.
On Dec 19, 2011, at 2:14 PM, Stephen Skory wrote:
Hey all (Matt, in particular),
I've got some dead time waiting for jobs to run so I'd like to A) discuss this topic and B) make some "final" decisions about this so I can go ahead and do some coding on this. Sorry about the length of this!
A) Briefly, for those of you who aren't aware of this topic, the idea is to replace the ~/.yt/parameter_files.csv text file with a SQLite database file. This has many advantages over a text file, too many to list here. But in particular, it has built-in locks for writes (*), which is especially useful for multi-level parallelism. This is something we're currently addressing in "official" examples with a kludge . I think everyone is in agreement that this is a good thing, no?
The other big thing that this feeds into is a remote, centralized storage point for a clone of this database. I've discussed this idea before, sketched up a simple partially functional example, and made a simple video cast of how it works. 
B) The final decisions that I'd like input on are these.
- What data fields should we include in the databases? There are three
ways to go with this. #1. The same amount of data that is in the current csv (basically: hash, name, location on disk, time, type). This is probably too few data fields, so I think we can scratch it off immediately. #2. Everything that can be gleamed from the dataset. This is actually fine to do practically because of the database being binary and searchable. However, because the fields in various datasets are so different, this could result in a fairly unwieldy database with (in a Chris Traeger voice) a literal ton of columns. This could be mitigated by having a different database tables for each type of dataset (Enzo, Athena, etc...), but that really only swaps one kind of complexity for another. #3. A minimal set of "interesting" fields (redshift, box resolution, cosmological parameters, etc..) This is more attractive than #2 in that it's very unlikely anyone will want to search over every field in a dataset, so it keeps things more streamlined. But then we have to agree to a reasonable set of parameters to include, and it makes future changes a bit more difficult.
What do we all think?
- Once we have the above settled, and working, I would like to extend
the functionality to the cloud bzzzzzzz. Get it? It's a buzz word. So it buzzes. Thanks, I'll be here all week.
There are three (four) ways to do this that I can think of
#1. Amazon Simple DB. The advantages of this is that it's offered free to all up to 1GB of storage and some reasonable limit of transactions per month. Each user sets up her own account on S3, and no one else has to be involved. But the main disadvantage is that it only supports storing things as strings, which makes numerical searches and sorts less useful, more annoying, and slower. #1.5. Amazon Relational DB. This is not free at any level, but it offers all the usual DB functionality. Amazon does offer some educational grants, so we could apply for that. This service is targeted at usage levels that we will never reach, but if we get free time, that's fine. I think in this case (and the next two) user accounts on the database would have to be created for yt users by "us". #2. Google App Engine. Free right now in pre-beta invitation-only phase. It will be similar or #1.5 above, as I understand things, and not be free forever. Personally, I seriously doubt that we'd get in on the pre-beta. I've looked at the application form  and I don't even understand one of the questions. #3. Host a MySQL (or similar) database on one of our own servers (yt-project or similar). The advantage is that the cost should be no more that Matt is paying now. The disadvantage is, again, we have to set up accounts. Also, I don't know if Dreamhost (is that where yt-project is still?) allows open MySQL databases. Another advantage is that unlike #1.5 or #2 above, costs should never rise suddenly when an educational grant or beta period ends.
Thanks for reading, and any and all comments are welcomed.
(*) There are issues with locks on parallel network file systems, but most home partitions on supercomputers are NFS (not something like Lustre) so this shouldn't be a problem.
-- Stephen Skory email@example.com http://stephenskory.com/ 510.621.3687 (google voice) _______________________________________________ yt-dev mailing list firstname.lastname@example.org http://lists.spacepope.org/listinfo.cgi/yt-dev-spacepope.org