Mailman 3 sample without replacement - NumPy-Discussion

sample without replacement

Alan G Isaac

20 Dec 2010 20 Dec '10

4:28 p.m.

I want to sample *without* replacement from a vector (as with Python's random.sample). I don't see a direct replacement for this, and I don't want to carry two PRNG's around. Is the best way something like this? permutation(myvector)[:samplesize] Thanks, Alan Isaac

Show replies by date

John Salvatier

20 Dec 20 Dec

5:13 p.m.

I think this is not possible to do efficiently with just numpy. If you want to do this efficiently, I wrote a no-replacement sampler in Cython some time ago (below). I hearby release it to the public domain. ''' Created on Oct 24, 2009 http://stackoverflow.com/questions/311703/algorithm-for-sampling-without-rep... @author: johnsalvatier ''' from __future__ import division import numpy def random_no_replace(sampleSize, populationSize, numSamples): samples = numpy.zeros((numSamples, sampleSize),dtype=int) # Use Knuth's variable names cdef int n = sampleSize cdef int N = populationSize cdef i = 0 cdef int t = 0 # total input records dealt with cdef int m = 0 # number of items selected so far cdef double u while i < numSamples: t = 0 m = 0 while m < n : u = numpy.random.uniform() # call a uniform(0,1) random number generator if (N - t)*u >= n - m : t += 1 else: samples[i,m] = t t += 1 m += 1 i += 1 return samples On Mon, Dec 20, 2010 at 8:28 AM, Alan G Isaac wrote:

...

I want to sample *without* replacement from a vector (as with Python's random.sample). I don't see a direct replacement for this, and I don't want to carry two PRNG's around. Is the best way something like this?

permutation(myvector)[:samplesize]

Thanks, Alan Isaac _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion

Anne Archibald

21 Dec 21 Dec

3:39 p.m.

I know this question came up on the mailing list some time ago (19/09/2008), and the conclusion was that yes, you can do it more or less efficiently in pure python; the trick is to use two different methods. If your sample is more than, say, a quarter the size of the set you're drawing from, you permute the set and take the first few. If your sample is smaller, you draw with replacement, then redraw the duplicates, and repeat until there aren't any more duplicates. Since you only do this when your sample is much smaller than the population you don't need to repeat many times. Here's the code I posted to the previous discussion (not tested this time around) with comments: ''' def choose_without_replacement(m,n,repeats=None): """Choose n nonnegative integers less than m without replacement Returns an array of shape n, or (n,repeats). """ if repeats is None: r = 1 else: r = repeats if n>m: raise ValueError, "Cannot find %d nonnegative integers less than %d" % (n,m) if n>m/2: res = np.sort(np.random.rand(m,r).argsort(axis=0)[:n,:],axis=0) else: res = np.random.random_integers(m,size=(n,r)) while True: res = np.sort(res,axis=0) w = np.nonzero(np.diff(res,axis=0)==0) nr = len(w[0]) if nr==0: break res[w] = np.random.random_integers(m,size=nr) if repeats is None: return res[:,0] else: return res For really large values of repeats it does too much sorting; I didn't have the energy to make it pull all the ones with repeats to the beginning so that only they need to be re-sorted the next time through. Still, the expected number of trips through the while loop grows only logarithmically with repeats, so it shouldn't be too bad. ''' Anne On 20 December 2010 12:13, John Salvatier wrote:

...

I think this is not possible to do efficiently with just numpy. If you want to do this efficiently, I wrote a no-replacement sampler in Cython some time ago (below). I hearby release it to the public domain.

'''

Created on Oct 24, 2009 http://stackoverflow.com/questions/311703/algorithm-for-sampling-without-rep... @author: johnsalvatier

'''

from __future__ import division

import numpy

def random_no_replace(sampleSize, populationSize, numSamples):

   samples = numpy.zeros((numSamples, sampleSize),dtype=int)

   # Use Knuth's variable names

   cdef int n = sampleSize

   cdef int N = populationSize

   cdef i = 0

   cdef int t = 0 # total input records dealt with

   cdef int m = 0 # number of items selected so far

   cdef double u

   while i < numSamples:

   t = 0

   m = 0

   while m < n :

   u = numpy.random.uniform() # call a uniform(0,1) random number generator

   if (N - t)*u >= n - m :

   t += 1

   else:

   samples[i,m] = t

   t += 1

   m += 1

   i += 1

   return samples

On Mon, Dec 20, 2010 at 8:28 AM, Alan G Isaac wrote:

...
I want to sample *without* replacement from a vector (as with Python's random.sample). I don't see a direct replacement for this, and I don't want to carry two PRNG's around. Is the best way something like this?

permutation(myvector)[:samplesize]

Thanks, Alan Isaac _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion

_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion

Alan G Isaac

7 p.m.

New subject: bincount and generators

bincount does not currently allow a generator as an argument. I'm wondering if it is considered too costly to extend it to allow this. (Motivation: I'm counting based on an attribute of a large number of objects, and I don't need a list of the data.) Thanks, Alan Isaac

Sturla Molden

7:33 p.m.

We often need to generate more than one such sample from an array, e.g. for permutation tests. If we shuffle an array x of size N and use x[:M] as a random sample "without replacement", we just need to put them back randomly to get the next sample (cf. Fisher-Yates shuffle). That way we get O(M) amortized complexity for each sample of size M. Only the first sample will have complexity O(N). Sturla

...

I know this question came up on the mailing list some time ago (19/09/2008), and the conclusion was that yes, you can do it more or less efficiently in pure python; the trick is to use two different methods. If your sample is more than, say, a quarter the size of the set you're drawing from, you permute the set and take the first few. If your sample is smaller, you draw with replacement, then redraw the duplicates, and repeat until there aren't any more duplicates. Since you only do this when your sample is much smaller than the population you don't need to repeat many times.

Here's the code I posted to the previous discussion (not tested this time around) with comments:

''' def choose_without_replacement(m,n,repeats=None): """Choose n nonnegative integers less than m without replacement

Returns an array of shape n, or (n,repeats). """ if repeats is None: r = 1 else: r = repeats if n>m: raise ValueError, "Cannot find %d nonnegative integers less than %d" % (n,m) if n>m/2: res = np.sort(np.random.rand(m,r).argsort(axis=0)[:n,:],axis=0) else: res = np.random.random_integers(m,size=(n,r)) while True: res = np.sort(res,axis=0) w = np.nonzero(np.diff(res,axis=0)==0) nr = len(w[0]) if nr==0: break res[w] = np.random.random_integers(m,size=nr)

if repeats is None: return res[:,0] else: return res

For really large values of repeats it does too much sorting; I didn't have the energy to make it pull all the ones with repeats to the beginning so that only they need to be re-sorted the next time through. Still, the expected number of trips through the while loop grows only logarithmically with repeats, so it shouldn't be too bad. '''

Anne

On 20 December 2010 12:13, John Salvatier wrote:

...
I think this is not possible to do efficiently with just numpy. If you want to do this efficiently, I wrote a no-replacement sampler in Cython some time ago (below). I hearby release it to the public domain.

'''

Created on Oct 24, 2009 http://stackoverflow.com/questions/311703/algorithm-for-sampling-without-rep... @author: johnsalvatier

'''

from __future__ import division

import numpy

def random_no_replace(sampleSize, populationSize, numSamples):

   samples = numpy.zeros((numSamples, sampleSize),dtype=int)

   # Use Knuth's variable names

   cdef int n = sampleSize

   cdef int N = populationSize

   cdef i = 0

   cdef int t = 0 # total input records dealt with

   cdef int m = 0 # number of items selected so far

   cdef double u

   while i < numSamples:

   t = 0

   m = 0

   while m < n :

   u = numpy.random.uniform() # call a uniform(0,1) random number generator

   if (N - t)*u >= n - m :

   t += 1

   else:

   samples[i,m] = t

   t += 1

   m += 1

   i += 1

   return samples

On Mon, Dec 20, 2010 at 8:28 AM, Alan G Isaac wrote:

...
I want to sample *without* replacement from a vector (as with Python's random.sample). I don't see a direct replacement for this, and I don't want to carry two PRNG's around. Is the best way something like this?

permutation(myvector)[:samplesize]

Thanks, Alan Isaac _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion

_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion

_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion

josef.pktd＠gmail.com

2:41 a.m.

On Mon, Dec 20, 2010 at 11:28 AM, Alan G Isaac wrote:

...

I want to sample *without* replacement from a vector (as with Python's random.sample). I don't see a direct replacement for this, and I don't want to carry two PRNG's around. Is the best way something like this?

permutation(myvector)[:samplesize]

python has it in random sample( population, k) Return a k length list of unique elements chosen from the population sequence. Used for random sampling without replacement. New in version 2.3 Josef

...

Thanks, Alan Isaac _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion

Alan G Isaac

3:19 a.m.

On 12/20/2010 9:41 PM, josef.pktd@gmail.com wrote:

...

python has it in random

sample( population, k)

Yes, I mentioned this in my original post: http://www.mail-archive.com/numpy-discussion@scipy.org/msg29324.html But good simulation practice is perhaps to seed a simulation specific random number generator (not just rely on a global), and I don't want to pass around two different instances. So I want to get this functionality from numpy.random. Which reminds me of another question. numpy.random.RandomState accepts an int array as a seed: what is the *intended* use? Thanks, Alan

josef.pktd＠gmail.com

3:49 a.m.

On Mon, Dec 20, 2010 at 10:19 PM, Alan G Isaac wrote:

...

On 12/20/2010 9:41 PM, josef.pktd@gmail.com wrote:

...
python has it in random

sample( population, k)

Yes, I mentioned this in my original post: http://www.mail-archive.com/numpy-discussion@scipy.org/msg29324.html

But good simulation practice is perhaps to seed a simulation specific random number generator (not just rely on a global), and I don't want to pass around two different instances. So I want to get this functionality from numpy.random.

Sorry, I was reading to fast, and I might be tired. What's the difference between a numpy Random and a python random.Random instance of separate states of the random number generators? Josef

...

Which reminds me of another question. numpy.random.RandomState accepts an int array as a seed: what is the *intended* use?

Thanks, Alan _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion

Alan G Isaac

1:25 p.m.

On 12/20/2010 10:49 PM, josef.pktd@gmail.com wrote:

...

What's the difference between a numpy Random and a python random.Random instance of separate states of the random number generators?

Sorry, I don't understand the question. The difference for my use is that a np.RandomState instance provides access to a different set of methods, which unfortunately does not include an equivalent to random.Random's sample method but which does include others I need. Would it be appropriate to request that an analog to random.sample be added to numpy.random? (It might sample only a range, since producing indexes would provide the base functionality.) Or is this functionality absent intentionally? Alan

Robert Kern

7:53 p.m.

On Mon, Dec 20, 2010 at 10:28, Alan G Isaac wrote:

...

I want to sample *without* replacement from a vector (as with Python's random.sample). I don't see a direct replacement for this, and I don't want to carry two PRNG's around. Is the best way something like this?

permutation(myvector)[:samplesize]

For one of my personal projects, I copied over the mtrand package and added a method to RandomState for doing this kind of thing using reservoir sampling. http://en.wikipedia.org/wiki/Reservoir_sampling def subset_reservoir(self, long nselected, long ntotal, object size=None): """ Sample a given number integers from the set [0, ntotal) without replacement using a reservoir algorithm. Parameters ---------- nselected : int The number of integers to sample. ntotal : int The size of the set to sample from. size : int, sequence of ints, or None The number of subsets to sample or a shape tuple. An axis of the length nselected will be appended to a shape. Returns ------- out : ndarray The sampled subsets. The order of the items is not necessarily random. Use a slice from the result of permutation() if you need the order of the items to be randomized. """ cdef long total_size, length, i, j, u cdef cnp.ndarray[cnp.int_t, ndim=2] out if size is None: shape = (nselected,) total_size = nselected length = 1 elif isinstance(size, int): shape = (size, nselected) total_size = size * nselected length = size else: shape = size + (nselected,) length = 1 for i from 0 <= i < len(size): length *= size[i] total_size = length * nselected out = np.empty((length, nselected), dtype=int) for i from 0 <= i < length: for j from 0 <= j < nselected: out[i,j] = j for j from nselected <= j < ntotal: u = <long>rk_interval(j+1, self.internal_state) if u < nselected: out[i,u] = j return out.reshape(shape) -- Robert Kern "I have come to believe that the whole world is an enigma, a harmless enigma that is made terrible by our own mad attempt to interpret it as though it had an underlying truth." -- Umberto Eco

4868

Age (days ago)

4869

Last active (days ago)

List overview

Download

9 comments

6 participants

participants (6)

Alan G Isaac
Anne Archibald
John Salvatier
josef.pktd＠gmail.com
Robert Kern
Sturla Molden

sample without replacement

Alan G Isaac

John Salvatier

Anne Archibald

Alan G Isaac

Sturla Molden

josef.pktd＠gmail.com

Alan G Isaac

josef.pktd＠gmail.com

Alan G Isaac

Robert Kern

tags

participants (6)