Hi Stephen,

I prefer solution number 1 because solution number 2 requires knowledge of how long things are taking to know when it's time to write out the database.  My feeling is that the time for calculating halo/halo relationships can change a lot based on the size of the simulation, the number of cores, and a variety of other things.  It think it's best not to get into engineering a solution that solves for all of those variables.

I think the most important thing here is the ability of this process to be restarted with minimal effort from the user.  If there is a way for the code to take in all of the files intended to be used, but to have knowledge of which ones have already been done, that would be the best.


On Wed, Mar 21, 2012 at 11:38 AM, Stephen Skory <s@skory.us> wrote:
Hi all,

after chatting with Britton on IRC a few days ago, I pushed some
changes that keeps the SQLite I/O on the root task only. Previously
only the O was on the root task, but all tasks did the I. This change
was done to hopefully A) speed things up with fewer tasks reading off
disk and B) reduce memory usage with fopen()s and such. In my limited
testing I saw a small increase in speed on 26 data dumps (something
like 3m50s to 3m35s) excluding/precomputing the halo finding step. But
this was on a machine with a good disk and there was no chance of
running out of memory.

The point of this email is as follows. After Britton had his problems,
I re-acquainted myself with the merger tree code, and I realized there
is a bit of a problem with the way it works. In brief, in order to
reduce the amount of SQLite interaction on disk, which is slow, the
results of the merger tree (namely the halo->halo relationships) are
not written to disk until the very end. It's kept in memory up that
point. This means that if the merger tree process is killed before the
information is saved to disk, everything is lost.

As I see it, there are a couple solutions to this.

1. When the halo relationships are saved, what actually happens is the
existing halo database is read in, and a new one is written out, and
in the process the just computed halo relationships are inserted into
the new database. This is done because SELECT (on old) and then INSERT
(on new) is magnitudes times faster than UPDATE (on old) on databases.
I could change things such that this process is done after every new
set of halo relationships is found between pairs of data dumps. Then,
if the merger tree is killed prior to completion, not all work is

2. Add a TTL parameter to MergerTree(). When the runtime of the merger
tree approaches this number, it will stop what it's doing, and write
out what it has so far.

In both cases, restarts would just check to see what work has been
done already, and continue on from there.

For those of you who care, which do you think is a better solution? #1
is a bit less work for a user, but #2 is likely faster by some
disk-dependent amount.

Stephen Skory
510.621.3687 (google voice)
yt-dev mailing list