(Summary: skip to the bottom, help decide what records should be in a minimal simulation database. Please take the time to contribute to this discussion; it likely affects you, even if you don’t think it does!) Currently, in order to support pickling data objects in external-to-the-.yt-file formats, yt keeps a record of the most recent N (where N is usually 500-1000) parameter files it has personally seen. This is stored in a csv file, typically in ~/.yt/parameter_files.csv . The reasons for using CSV are pretty reactionary: there were for a while difficulties with getting sqlite everywhere, shelve was unreliable and difficult to sort unless you used a database backend, and I wanted a single file. We *already* have a simulation output database, but it’s in CSV. If you check, you’ll find you have one. Very briefly, let me motivate why we have the .csv file (also covered in the method paper) and how it is modified. The idea is that, before we had UUIDs in parameter files, it was challenging to identify a parameter file uniquely. For instance, one occasionally will need to know “Did I mean *this* DD0070, or *that* DD0070?” So using the CurrentTimeIdentifier (and falling back -- for other codes -- on the ST_CTIME, which is not immensely reliable) and a couple different pieces of information, a hash was constructed for that parameter file. This has included the path and a few bits of info. So everytime you load a parameter file in yt, it looks in this .csv and if necessary updates the *path* for a given *hash*. Since the time of implementation, UUIDs have been inserted into Enzo; I have been in contact with a couple other code developers and there is a remote, outside possibility that such a rider could be added to some of them. I am not holding my breath, so we will continue falling back on this hash, which is for all intents and purposes likely to be Universally Unique, although may not follow the code and so may not serve as a consistent Identifier. However, as shows up once in a while on the mailing lists, the .csv file can be problematic. My approach was conservative, which leads to lots of small bits of IO on them. (Not cool.) Updating requires rewriting the entire file. (Not cool.) If you have multiple processes going, sometimes they can be corrupted. (Not cool.) Plus, it’s just not a terribly easily-queryable format. Recently, inspired by something Tom talked about at the Enzo Spring Workshop, I put into Enzo the ability to write out a set of information about a simulation output at the same time the data itself was written. The format for this set was in SQLite3, which is simple, easy to install, and -- most importantly for my near-obsessive goals -- has been installed with yt for a couple months now. This last weekend, I was able to insert the first pass at a transition to using this format of database in yt, in my fork: https://bitbucket.org/MatthewTurk/yt/changeset/1b685b6bfbf2#chg-yt/utilities... In this commit I also add an ORM called “pee wee” that abstract all of the SQL out; in this way we actually can use a SQLite database while not having any raw text string SQL in yt. What the main changes are boils down to: * No more CSV; all sqlite, which handles multi-process locking and cocurrency. * Reads from the same database that Enzo writes out, and that other codes could choose to write out if they wanted. * Uses UUIDs instead of hashes. (Could potentially break existing pickles, but I plan to provide a migration strategy.) There are some big advantages to having an output database -- one could select all outputs based on the simulation they derived from (if such a field is available, as it is for enzo), one could load() just the UUID or hash, and we can provide a Reason “file open” GUI that doesn’t require touching the file system explicitly. (The idea there is that Reason would pop up a grid of all available parameter files, with their times, redshifts, etc, and then you would choose.) Plus, *other* utilities like Stranger, Jacques, etc could use it. And long term, Tom’s vision of a universal simulation database becomes a *lot* more tractable, if we have a firm starting point. However, before this can be accepted into mainline, there are three things that need to be decided. Here are the currenty-included fields: dset_uuid output_type pf_path creation_time last_seen_time simulation_uuid redshift time topgrid0 topgrid1 topgrid2 1) Thumbs up or thumbs down to moving to SQLite from CSV? 2) Should any additional fields be *added*? 3) Should any of these fields be *removed*? (I have opinions on this, but I would like to hear from others first.) I think this is the wrong place to put everything there is to know about a simulation. The parameter file exists for that, and every field is an additional overhead of space, complexity, etc. But it would be nice if things that people *commonly* want to query on or *sort* on were in here. What else? -Matt