[Numpy-discussion] extracting a random subset of a vector

Tue Aug 31 10:41:11 EDT 2004

Curzio Basso wrote:

> import numarray as NA
> import numarray.random_array as RA
> 
> N = 1000
> M = 100
> full = NA.arange(N)
> subset = full[RA.permutation(N)][:M]
> 
> ---------------------------------------------------------
> 
> However, it's quite slow (at least with N~40k), 

you can speed it up a tiny bit my subsetting the permutation array first:
subset = full[ RA.permutation(N)[:M] ]

> and from the hotshot 
> output is looks like it's the indexing, not the permutation, which takes 
> time.

not from my tests:

import numarray.random_array as RA
import numarray as NA
import time

N = 1000000
M = 100
full = NA.arange(N)

start = time.clock()
P = RA.permutation(N)
print "permutation took %F seconds"%(time.clock() - start)
start = time.clock()
subset = full[P[:M]]
print "subsetting took %F seconds"%(time.clock() - start)

which results in:
permutation took 1.640000 seconds
subsetting took 0.000000 seconds

so it's the permutation that takes the time, as I suspected. What would 
really speed this up is a random_array.non-repeat-randint() function, 
written in C. That way you wouldn't have to permute the entire N values, 
when you really just need M of them.

Does anyone else think this would be a useful function? I can't imagine 
it wouldn't be that hard to write.

If M <<< N, then you could probably write a little function in Python 
that called randint, and removed the repeats. If M is only a little 
smaller than N, this would be slow.

-Chris

-- 
Christopher Barker, Ph.D.
Oceanographer

NOAA/OR&R/HAZMAT         (206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115       (206) 526-6317   main reception

Chris.Barker at noaa.gov