[Numpy-discussion] sample without replacement
Robert Kern
robert.kern at gmail.com
Tue Dec 21 14:53:11 EST 2010
On Mon, Dec 20, 2010 at 10:28, Alan G Isaac <alan.isaac at gmail.com> wrote:
> I want to sample *without* replacement from a vector
> (as with Python's random.sample). I don't see a direct
> replacement for this, and I don't want to carry two
> PRNG's around. Is the best way something like this?
>
> permutation(myvector)[:samplesize]
For one of my personal projects, I copied over the mtrand package and
added a method to RandomState for doing this kind of thing using
reservoir sampling.
http://en.wikipedia.org/wiki/Reservoir_sampling
def subset_reservoir(self, long nselected, long ntotal, object size=None):
""" Sample a given number integers from the set [0, ntotal) without
replacement using a reservoir algorithm.
Parameters
----------
nselected : int
The number of integers to sample.
ntotal : int
The size of the set to sample from.
size : int, sequence of ints, or None
The number of subsets to sample or a shape tuple. An axis of the
length nselected will be appended to a shape.
Returns
-------
out : ndarray
The sampled subsets. The order of the items is not necessarily
random. Use a slice from the result of permutation() if you need the
order of the items to be randomized.
"""
cdef long total_size, length, i, j, u
cdef cnp.ndarray[cnp.int_t, ndim=2] out
if size is None:
shape = (nselected,)
total_size = nselected
length = 1
elif isinstance(size, int):
shape = (size, nselected)
total_size = size * nselected
length = size
else:
shape = size + (nselected,)
length = 1
for i from 0 <= i < len(size):
length *= size[i]
total_size = length * nselected
out = np.empty((length, nselected), dtype=int)
for i from 0 <= i < length:
for j from 0 <= j < nselected:
out[i,j] = j
for j from nselected <= j < ntotal:
u = <long>rk_interval(j+1, self.internal_state)
if u < nselected:
out[i,u] = j
return out.reshape(shape)
--
Robert Kern
"I have come to believe that the whole world is an enigma, a harmless
enigma that is made terrible by our own mad attempt to interpret it as
though it had an underlying truth."
-- Umberto Eco
More information about the NumPy-Discussion
mailing list