[AstroPy] Questions re: Markov Chain Monte Carlo

Tue Jun 4 05:23:34 EDT 2013

Hi Eric,

I know very little about MCMC, but my first thought is that the idea of rounding the parameter values sounds a little dodgy. On the other hand, some quick googling suggests that doing MCMC in discrete (as opposed to continuous) parameter spaces -- which is more or less what you're proposing -- isn't unknown, and there are even some codes specialized for that, e.g.

http://www.hydrol-earth-syst-sci.net/15/3701/2011/hess-15-3701-2011.html

It might be that using emcee (which assumes a continuous parameter space) might be somewhat less efficient than an approach built to account for discrete parameter spaces, but that may not matter much.

> So I'm imagining something like this, but I wonder if others have suggestions or alternatives:
> 
> - Each time the Monte Carlo calculation makes a step in parameter space and chooses a set of parameters (i.e. a position vector in parameter space), do the following: 
> 	- Check to see if the directory corresponding to that set of parameters exists; if it doesn't, then create it and proceed with the calculation. 
> 	- If it does exist already, check to see if the final output image exists; if so, then use it.  If not, then presumably another process is working on creating the image; sleep for some amount of time and keep checking back until the image appears. (The 'emcee' MCMC implementation uses multiple subprocesses that explore parameter space simultaneously, so it would be possible for two processes to arrive at the same place in parameter space at the same time.)  One problem here is that if some process fails to finish a calculation, then other processes arriving at that space will get stuck there indefinitely while they wait; so maybe the wait needs to be finite, but if so, I'm not sure what gets returned after that finite wait; maybe just an error. 

Just to confirm something: it sounds like you're saying that the result of starting with the same set of parameters (Mdisk, Rdisk, T, etc.) will be the exact same output FITS image; but somehow different iterations of the MC process, using the same output model FITS image, can give you different chi^2 values? Which doesn't really make sense to me -- it implies that a given set of parameters (location in the parameter space) does not uniquely determine the chi^2 value.

Assuming that's not actually true -- i.e., assuming that same set of parameters --> same output model FITS image --> same chi^2 -- I think a simpler approach would be to cache the chi^2 values in a data structure which is indexed by the parameter values. Checking the cache might itself be faster than checking the disk for the existence of a given directory, since you avoid any disk I/O delays; and this way you avoid having the recalculate chi^2 for a given model image once you've done it the first time. One way of doing this in Python would be a dictionary indexed by tuples of the parameter values. So the above stage of your algorithm would look something like this:

- Each time the Monte Carlo calculation makes a step in parameter space and chooses a set of parameters (i.e. a position vector in parameter space), do the following: 
	1. Check to see if there is already an entry in the chi^2 cache data structure corresponding to those [possibly rounded] parameter values, e.g. 
	if (Mdisk, Rdisk, T, ...) in chi2ValuesDict:
		# do step 2 below
	else:
		# do step 3 below
	2. If it's already there, just take the corresponding chi^2 value from the dictionary:
	currentChi2 = chi2ValuesDict[(Mdisk, Rdisk, T, ...)]
	3. If it's *not* already done, then generate a new model. You can get a crude (though not
	perfect) locking approach by first placing a dummy value in the chi^2 cache:
		A. chi2ValuesDict[(Mdisk, Rdisk, T, ...)] = 0  # (from this point on, any other
		emcee subprocess which queries the dictionary will get the message that an entry
		already exists -- or will -- for these parameter values)
		B. Do the full directory setup, image calculation, and chi^2 calculation
		C. Store the new chi^2 in the dictionary, and return it to the emcee process:
		chi2ValuesDict[(Mdisk, Rdisk, T, ...)] = chi2_from_new_calculation
		currentChi2 = chi2_from_new_calculation

This *does* require that the dictionary chi2ValuesDict  be a *shared* data structure that all the emcee subprocesses can read and write to; I don't know if emcee permits that sort of thing. If not, you can probably still save *some* time by having it be a single file on disk -- i.e., you still avoid wasting time re-calculating chi^2 values for identical model images.

(Naturally, this really only works if you *do* round the parameters, because otherwise the odds that any given step in the MC calculation will pick the exact same set of parameter values is (just about) infinitesimal.)

cheers,

Peter

=============================================================
Peter Erwin                   Max-Planck-Insitute for Extraterrestrial 
erwin at mpe.mpg.de              Physics, Giessenbachstrasse
tel. +49 (0)176 2481 7713     85748 Garching, Germany
fax  +49 (0)89 30000 3495     http://www.mpe.mpg.de/~erwin