Hi all, This week, after Cameron announced and Jeff demo'd the new GUI for yt, we've had some thinking and thoughts about some other obvious places to go with the interface. Tonight Tom and I did some brainstorming about what a *personal* database of simulation outputs might look like; this isn't so much about large, comprehensive databases of simulations, but rather about a pragmatic convenience for people who run their data on a single file system. As it stands, yt dumps the last 200 parameter files into a .csv file in your ~/.yt directory. These include the unique id, the parameter file name on the file system, and some other misc data that's not terribly important. We were thinking it might be neat to have a really simple one that the files never really left. It could be ephemeral -- so that files that get removed or moved around would simply be removed from the DB -- and maybe it would contain information (if available from the simulation code) about the previous output in the simulation. Enzo has this, and it may be coming to other codes, in the form of UUIDs that get tracked and carried along by the sim. So there are a couple ways this might work. We came up with a few: * A "locate" command that just returns all the basic parameter files / outputs that could be loaded into yt (or other tools), maybe based on a quick search * A registering of these outputs into the database at write time by the simulation code * An intentional import of, say, a directory -- simply type "yt import" in the base dir and it'll find all the outputs below that. (This used to exist in the form of "fimport" but it has bitrot a bit.) * An unlimited parameter file storage, where all loaded parameter files get added (unlike the 200 limit we have) What do people think? With the new GUI, this becomes I think a lot more interesting -- particularly if you can re-assemble a graph of the outputs, maybe even querying them, if you were to store the parameter file in full. Anyway, just a handful of thoughts. Stephen, you have quite a bit of experience with SQLite; do you think this could be a fun application of them? Or would that be overkill? Maybe the whole system could be handled with just filenames? -Matt
Matt,
Anyway, just a handful of thoughts. Stephen, you have quite a bit of experience with SQLite; do you think this could be a fun application of them? Or would that be overkill? Maybe the whole system could be handled with just filenames?
I think that there is a fair bit of upside for something like this. Since it would be nice to have a (semi)-permanent storage of records in the SQLite database, we could leverage the synergy of the collaboration... wait I lapsed into corporate-speak. Getting back, we could use the unique UUIDs of simulations and create a single table per simulation (MetaDataSimulationUUID) with entries for each MetaDataDatasetUUID. The tables would have columns with useful things: redshift, etc... We could then make some functions that would very easily return everything yt knows about a dataset and all its peers, which could then be used to drive time-ordered analysis, or simply just inform a user about the Big Picture of this dataset & simulation. It could be wise to include some date information in the database (date first read in or last accessed date), which could be useful for user information or house-keeping. Gmail tells you when you last accessed your account, why can't yt tell you how lazy you've been? We could do some neat things with a SQLite database. The drawbacks are mostly already leapt over now that yt includes SQLite. I have told Matt that I would implement a SQLite abstraction layer for the merger tree, and this may be an opportunity to dive into that for us. -- Stephen Skory s@skory.us http://stephenskory.com/ 510.621.3687 (google voice)
Hi Matt & others, I've done some more thinking about this stuff and I have some questions/thoughts I'd like your thoughts on. I would like to set down some solid ideas of what we would like to get out of this. This will allow us to design the system better from the get-go. Here's what I've got: - The ability to identify simulations datasets that are ordered in time. For an Enzo dataset this is easy with the UUID values. For other datatypes, I'm not so certain it would be quite so direct. There could be some kind of similarity measure based on dataset parameters, or perhaps keep it simpler and do some inference based on file paths. - What are the important parameters of a simulation to record? Because SQL is binary, it wouldn't be too difficult or unreasonable to store everything in an Enzo restart text file, for example. Or similarly for other datatypes. There's no reason why the database can't be heterogeneous with one table per datatype, each with data field labels of their own. - What are the questions we'd like to be able to query? I can think of: * Which simulations are in this simulations lineage? * Similarly, what sets of time-ordered datasets do I have? * What datasets are similar to this current simulation, or two similar using some set of parameters I am specifying? The similarity could be along the usual set of things: redshift, box size, resolution. * What are the differences between two simulations/datasets in my collection? - Another thing would be to add a step where the dataset is searched for on disk, to see if it's still available. I could see this as an optional, default==False step, due to the preponderance of sluggish Lustre disk systems out there. Thanks for the comments! -- Stephen Skory s@skory.us http://stephenskory.com/ 510.621.3687 (google voice)
Hi Stephen, On Mon, Apr 4, 2011 at 5:00 PM, Stephen Skory <s@skory.us> wrote:
Hi Matt & others,
I've done some more thinking about this stuff and I have some questions/thoughts I'd like your thoughts on.
I would like to set down some solid ideas of what we would like to get out of this. This will allow us to design the system better from the get-go. Here's what I've got:
- The ability to identify simulations datasets that are ordered in time. For an Enzo dataset this is easy with the UUID values. For other datatypes, I'm not so certain it would be quite so direct. There could be some kind of similarity measure based on dataset parameters, or perhaps keep it simpler and do some inference based on file paths.
Yes, I agree. The goal should be able to identify outputs from a given "simulation run" ordered in time. For Enzo this will be more straightforward than perhaps for other codes, because it includes the UUIDs of current and previous. (My recollection is that this is why UUIDs were added in the first place.) I would say that inferring based on file paths is fine for other simulation types, as well as Enzo datasets that were created before the UUIDs were added.
- What are the important parameters of a simulation to record? Because SQL is binary, it wouldn't be too difficult or unreasonable to store everything in an Enzo restart text file, for example. Or similarly for other datatypes. There's no reason why the database can't be heterogeneous with one table per datatype, each with data field labels of their own.
This is where I become less certain of things. My initial feeling was that we'd want a time indicator (redshift, time or both), a full path to the dataset, and its position in a graph of simulation outputs. It makes sense to include additional parameters, but as you note it may end up being that the only mechanism we have at the moment for doing so is a set of heterogeneous table formats. My initial hope for this database was twofold -- 1) Provide a list of all "known" datasets in Reason (which we could get from parameter_files.csv) 2) Provide a database that could be shared of simulations with parameters. (i.e., manual and simple publishing of datasets on shared filesystems.) Both of these kind of tie together. On a shared filesystem, like Kraken, one could imagine opening a dataset with: load("db://sskory/SOME_SIMULATION_UUID/RedshiftOutput0030") Anyway, I don't think we need to re-implement a *full* parameter scraping, but it would not add so much overhead as to be undesirable.
- What are the questions we'd like to be able to query? I can think of: * Which simulations are in this simulations lineage? * Similarly, what sets of time-ordered datasets do I have? * What datasets are similar to this current simulation, or two similar using some set of parameters I am specifying? The similarity could be along the usual set of things: redshift, box size, resolution. * What are the differences between two simulations/datasets in my collection?
This is exactly right, and I completely agree. But I don't want to completely reinvent the VO (it's done a good job of inventing itself) but instead apply a simple layer of querying and comparison; it can be pretty DIY, I think.
- Another thing would be to add a step where the dataset is searched for on disk, to see if it's still available. I could see this as an optional, default==False step, due to the preponderance of sluggish Lustre disk systems out there.
This is where this is different from parameter_files.csv. That system acts as a FIFO of the last, say, 200 datasets. When they're loaded, the unique hash/ID is looked up, and if it's found in the .csv it's updated to point to the new location. Rather than providing an update mechanism explicitly, it's done implicitly. I don't see why this couldn't be the same thing. -Matt
Thanks for the comments!
-- Stephen Skory s@skory.us http://stephenskory.com/ 510.621.3687 (google voice) _______________________________________________ Yt-dev mailing list Yt-dev@lists.spacepope.org http://lists.spacepope.org/listinfo.cgi/yt-dev-spacepope.org
participants (2)
-
Matthew Turk
-
Stephen Skory