I know this question came up on the mailing list some time ago (19/09/2008), and the conclusion was that yes, you can do it more or less efficiently in pure python; the trick is to use two different methods. If your sample is more than, say, a quarter the size of the set you're drawing from, you permute the set and take the first few. If your sample is smaller, you draw with replacement, then redraw the duplicates, and repeat until there aren't any more duplicates. Since you only do this when your sample is much smaller than the population you don't need to repeat many times. Here's the code I posted to the previous discussion (not tested this time around) with comments: ''' def choose_without_replacement(m,n,repeats=None): """Choose n nonnegative integers less than m without replacement Returns an array of shape n, or (n,repeats). """ if repeats is None: r = 1 else: r = repeats if n>m: raise ValueError, "Cannot find %d nonnegative integers less than %d" % (n,m) if n>m/2: res = np.sort(np.random.rand(m,r).argsort(axis=0)[:n,:],axis=0) else: res = np.random.random_integers(m,size=(n,r)) while True: res = np.sort(res,axis=0) w = np.nonzero(np.diff(res,axis=0)==0) nr = len(w[0]) if nr==0: break res[w] = np.random.random_integers(m,size=nr) if repeats is None: return res[:,0] else: return res For really large values of repeats it does too much sorting; I didn't have the energy to make it pull all the ones with repeats to the beginning so that only they need to be re-sorted the next time through. Still, the expected number of trips through the while loop grows only logarithmically with repeats, so it shouldn't be too bad. ''' Anne On 20 December 2010 12:13, John Salvatier <jsalvati@u.washington.edu> wrote:
I think this is not possible to do efficiently with just numpy. If you want to do this efficiently, I wrote a no-replacement sampler in Cython some time ago (below). I hearby release it to the public domain.
'''
Created on Oct 24, 2009 http://stackoverflow.com/questions/311703/algorithm-for-sampling-without-rep... @author: johnsalvatier
'''
from __future__ import division
import numpy
def random_no_replace(sampleSize, populationSize, numSamples):
samples = numpy.zeros((numSamples, sampleSize),dtype=int)
# Use Knuth's variable names
cdef int n = sampleSize
cdef int N = populationSize
cdef i = 0
cdef int t = 0 # total input records dealt with
cdef int m = 0 # number of items selected so far
cdef double u
while i < numSamples:
t = 0
m = 0
while m < n :
u = numpy.random.uniform() # call a uniform(0,1) random number generator
if (N - t)*u >= n - m :
t += 1
else:
samples[i,m] = t
t += 1
m += 1
i += 1
return samples
On Mon, Dec 20, 2010 at 8:28 AM, Alan G Isaac <alan.isaac@gmail.com> wrote:
I want to sample *without* replacement from a vector (as with Python's random.sample). I don't see a direct replacement for this, and I don't want to carry two PRNG's around. Is the best way something like this?
permutation(myvector)[:samplesize]
Thanks, Alan Isaac _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion