sampling from frequency distribution / histogram without replacement
duncan smith
duncan at invalid.invalid
Fri Jan 18 21:29:54 EST 2019
On 14/01/2019 20:11, duncan smith wrote:
> Hello,
> Just checking to see if anyone has attacked this problem before
> for cases where the population size is unfeasibly large. i.e. The number
> of categories is manageable, but the sum of the frequencies, N,
> precludes simple solutions such as creating a list, shuffling it and
> using the first n items to populate the sample (frequency distribution /
> histogram).
>
> I note that numpy.random.hypergeometric will allow me to generate a
> sample when I only have two categories, and that I could probably
> implement some kind of iterative / partitioning approach calling this
> repeatedly. But before I do I thought I'd ask if anyone has tackled this
> before. Can't find much on the web. Cheers.
>
> Duncan
>
After much tinkering I came up with the following:
import numpy as np
def hypgeom_variate(freqs, n):
# recursive partitioning approach
sample = [0] * len(freqs)
cumsum = np.cumsum(list(chain([0], freqs)))
if n > cumsum[-1]:
raise ValueError('n cannot be greater than population size')
hypergeometric = np.random.hypergeometric
argslist = [(0, len(freqs), 0, cumsum[-1], n)]
for i, k, ci, ck, m in argslist:
if k == i + 1:
sample[i] = m
else:
j = (i + k) // 2
cj = cumsum[j]
x = hypergeometric(cj - ci, ck - cj, m, 1)[0]
y = m-x
if x:
argslist.append((i, j, ci, cj, x))
if y:
argslist.append((j, k, cj, ck, y))
return sample
Cheers.
Duncan
More information about the Python-list
mailing list