[SciPy-User] issue estimating the multivariate Polya distribution: why?

Sun Mar 4 00:54:47 EST 2012

On Sat, Mar 3, 2012 at 5:51 PM,  <josef.pktd at gmail.com> wrote:
> On Sat, Mar 3, 2012 at 6:24 PM,  <josef.pktd at gmail.com> wrote:
>> On Sat, Mar 3, 2012 at 4:05 PM, Emanuele Olivetti
>> The probability that you want to calculate is very low
>>
>>>>> np.exp(-51.737395965)
>> 3.3941765165211696e-23
>>
>> For larger values it seems to work fine, but it deteriorates fast when
>> the loglikelihood drops below -15 or so (with the versions I have
>> installed).
>
> just an observation
>
> with iterations=1e7 I get much better numbers, which are still way
> off. But I don't see why this should matter much, since you are
> simulating alpha and not low probability events. (unless lot's of tiny
> errors add up in different ways)
>
> Josef

I'm a little out of my field here, so take this with a grain of salt.

I think Josef's observation is the key; the problem is the number of
samples in the MC is too low. This distribution seems very, very
skewed; if you plot the actual values in the second case
(specifically, exp(logp_Hs)) - some of it underflows, obviously, but
if you plot it in linear scale, it appears to be dominated by 1 or 2
large "outlier" values. The final mean value is largely dependent on
only those outliers. The MC just would require *a lot* more samples to
get a few realizations that would pull up the mean to more accurately
match the analytic prediction.

Try this test: compute the MC mean as the number of samples increase;
for example (this will take a few minutes to compute - the spacing in
the iteration number is overdone but when you plot it, you get some
rough approximation to the scatter and expectation value for the MC
estimate)

X = array([100,0,0])
alpha = array([1,10,10])
n_iter = 10**linspace(3,6,161)
test_logmeans = np.zeros(n_iter.shape)
for n in range(n_iter.shape[0]):
    test_logmeans[n] = log_multivariate_polya_montecarlo(X, alpha,
int(n_iter[n]))

If you plot test_logmeans, it clearly shows a negative bias (relative
to the analytic prediction) that decreases as the sample size
increases.

Cheers,
Aronne