[Numpy-discussion] Multivariate hypergeometric distribution?

Mon Jul 2 21:35:14 EDT 2012

On Mon, Jul 2, 2012 at 8:08 PM,  <josef.pktd at gmail.com> wrote:
> On Mon, Jul 2, 2012 at 4:16 PM, Fernando Perez <fperez.net at gmail.com> wrote:
>> Hi all,
>>
>> in recent work with a colleague, the need came up for a multivariate
>> hypergeometric sampler; I had a look in the numpy code and saw we have
>> the bivariate version, but not the multivariate one.
>>
>> I had a look at the code in scipy.stats.distributions, and it doesn't
>> look too difficult to add a proper multivariate hypergeometric by
>> extending the bivariate code, with one important caveat: the hard part
>> is the implementation of the actual discrete hypergeometric sampler,
>> which lives inside of numpy/random/mtrand/distributions.c:
>>
>> https://github.com/numpy/numpy/blob/master/numpy/random/mtrand/distributions.c#L743
>>
>> That code is hand-written C, and it only works for the bivariate case
>> right now.  It doesn't look terribly difficult to extend, but it will
>> certainly take a bit of care and testing to ensure all edge cases are
>> handled correctly.
>
> My only foray into this
>
> http://projects.scipy.org/numpy/ticket/921
> http://projects.scipy.org/numpy/ticket/923
>
> This looks difficult to add without a good reference and clear
> description of the algorithm.
>
>>
>> Does anyone happen to have that implemented lying around, in a form
>> that would be easy to merge to add this capability to numpy?
>
> not me, I have never even heard of multivariate hypergeometric distribution.
>
>
> maybe http://hal.inria.fr/docs/00/11/00/56/PDF/perm.pdf  p.11
> with some properties http://www.math.uah.edu/stat/urn/MultiHypergeometric.html
>
> I've seen one other algorithm, that seems to need N (number of draws
> in hypergeom) random variables for one multivariate hypergeometric
> random draw, which seems slow to me.
>
> But maybe someone has it lying around.

Now I have a pure num/sci/python version around.

A bit more than an hour, so no guarantees, but freq and pmf look close enough.

Josef

>
> Josef
>
>>
>> Thanks,
>>
>> f
>> _______________________________________________
>> NumPy-Discussion mailing list
>> NumPy-Discussion at scipy.org
>> http://mail.scipy.org/mailman/listinfo/numpy-discussion
-------------- next part --------------
# -*- coding: utf-8 -*-
"""

Created on Mon Jul 02 20:23:08 2012

Author: Josef Perktold
"""

import numpy as np

n_balls = [5, 3, 4]
n_sample = 5
size = 100000

p = len(n_balls)
n_all = sum(n_balls)
rvs = np.zeros((size, p), int)

n_bad = n_all
n_remain = n_sample * np.ones(size, int)
for ii in range(p-1):
    n_good = n_balls[ii]
    n_bad = n_bad - n_good
    rvs_ii = rvs[:,ii]
    mask = n_remain >= 1
    need = mask.sum()
    rvs_ii[mask] = np.random.hypergeometric(n_good, n_bad, n_remain[mask], size=need)
    rvs[:,ii] = rvs_ii
    n_remain = np.maximum(n_remain - rvs_ii, 0) 

rvs[:, -1] = n_sample - rvs.sum(1)
#print rvs
print rvs.mean(0) * n_all / n_sample
u, idx = np.unique(rvs.view([('', int)]*3), return_inverse=True)
u_arr = u.view(int).reshape(len(u), -1)
count = np.bincount(idx) 
freq = count * 1. / len(idx)

from scipy.misc import comb

def pmf(x, n_balls):
    x = np.asarray(x)
    n_balls = np.asarray(n_balls)
    #p = len(n_balls)
    ret = np.product(comb(n_balls, x)) / comb(n_balls.sum(), x.sum())
    return ret

print
print freq
for x,fr in zip(u_arr, freq):
    th = pmf(x, n_balls)
    print x, np.round(th, 5),  fr, np.round(fr - th, 10)