[Numpy-discussion] Generating random samples without repeats

Fri Sep 19 05:08:20 EDT 2008

Robert Kern <robert.kern <at> gmail.com> writes:
> On Thu, Sep 18, 2008 at 16:55, Paul Moore <pf_moore <at> yahoo.co.uk> wrote:
> > I want to generate a series of random samples, to do simulations based
> > on them. Essentially, I want to be able to produce a SAMPLESIZE * N
> > matrix, where each row of N values consists of either
> >
> > 1. Integers between 1 and M (simulating M rolls of an N-sided die), or
> > 2. A sample of N numbers between 1 and M without repeats (simulating
> >    deals of N cards from an M-card deck).
> >
> > Example (1) is easy, numpy.random.random_integers(1, M, (SAMPLESIZE, N))
> >
> > But I can't find an obvious equivalent for (2). Am I missing something
> > glaringly obvious? I'm using numpy - is there maybe something in scipy I
> > should be looking at?
> 
> numpy.array([(numpy.random.permutation(M) + 1)[:N]
>     for i in range(SAMPLESIZE)])
> 

Thanks.

And yet, this takes over 70s and peaks at around 400M memory use, whereas the 
equivalent for (1)

numpy.random.random_integers(1,M,(SAMPLESIZE,N))

takes less than half a second, and negligible working memory (both end up 
allocating an array of the same size, but your suggestion consumes temporary 
working memory - I suspect, but can't prove, that the time taken comes from 
memory allocations rather than computation.

As a one-off cost initialising my data, it's not a disaster, but I anticipate 
using idioms like this later in my calculations as well, where the costs could 
hurt more.

If I'm going to need to write C code, are there any good examples of this? (I 
guess the source for numpy.random is a good place to start).

Paul