extracting a random subset of a vector
![](https://secure.gravatar.com/avatar/b3379806a8e3984c0613fb86d8395b14.jpg?s=120&d=mm&r=g)
Hi all, I have an optimization problem. I currently use the following code to select a random subset of a rank-1 array: ---------------------------------------------------------- import numarray as NA import numarray.random_array as RA N = 1000 M = 100 full = NA.arange(N) subset = full[RA.permutation(N)][:M] --------------------------------------------------------- However, it's quite slow (at least with N~40k), and from the hotshot output is looks like it's the indexing, not the permutation, which takes time. Does anyone have a suggestion on a faster pure-python implementation? thanks
![](https://secure.gravatar.com/avatar/5dde29b54a3f1b76b2541d0a4a9b232c.jpg?s=120&d=mm&r=g)
Curzio Basso wrote:
you can speed it up a tiny bit my subsetting the permutation array first: subset = full[ RA.permutation(N)[:M] ]
not from my tests: import numarray.random_array as RA import numarray as NA import time N = 1000000 M = 100 full = NA.arange(N) start = time.clock() P = RA.permutation(N) print "permutation took %F seconds"%(time.clock() - start) start = time.clock() subset = full[P[:M]] print "subsetting took %F seconds"%(time.clock() - start) which results in: permutation took 1.640000 seconds subsetting took 0.000000 seconds so it's the permutation that takes the time, as I suspected. What would really speed this up is a random_array.non-repeat-randint() function, written in C. That way you wouldn't have to permute the entire N values, when you really just need M of them. Does anyone else think this would be a useful function? I can't imagine it wouldn't be that hard to write. If M <<< N, then you could probably write a little function in Python that called randint, and removed the repeats. If M is only a little smaller than N, this would be slow. -Chris -- Christopher Barker, Ph.D. Oceanographer NOAA/OR&R/HAZMAT (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker@noaa.gov
![](https://secure.gravatar.com/avatar/b3379806a8e3984c0613fb86d8395b14.jpg?s=120&d=mm&r=g)
Chris Barker wrote:
a question about the method: isn't a bit risky to use the clock() for timing the performance? The usual argument is that CPU allocates time for different processes, and the allocation could vary. That's why I used the profiler. Anyway, I performed another test with the profiler, on the example you also used, and I also obtained that most of the time is spent in permutation() (2.264 over 2.273 secs). regards
![](https://secure.gravatar.com/avatar/5dde29b54a3f1b76b2541d0a4a9b232c.jpg?s=120&d=mm&r=g)
Curzio Basso wrote:
that's why I use time.clock() rather than time.time().
That's why I used the profiler.
For order of magnitude estimates, any of these works fine. -Chris -- Christopher Barker, Ph.D. Oceanographer NOAA/OR&R/HAZMAT (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker@noaa.gov
![](https://secure.gravatar.com/avatar/ec366db3649cf13f4061b519193849d6.jpg?s=120&d=mm&r=g)
[Replying to the list instead of just Jp. Sorry, Jp! Mail readers will be the death of me.] Jp Calderone wrote:
FWIW, the idiom recommended by Tim Peters is the following: import time import sys if sys.platform == 'win32': now = time.clock else: now = time.time and then using now() to get the current time.
Jp
-- Robert Kern rkern@ucsd.edu "In the fields of hell where the grass grows high Are the graves of dreams allowed to die." -- Richard Harter
![](https://secure.gravatar.com/avatar/b3379806a8e3984c0613fb86d8395b14.jpg?s=120&d=mm&r=g)
Robert Kern wrote: than time.time()) on Win32.
Ok, now I'm really confused... From the doc of the module 'time': the clock function "return the current processor time as a floating point number expressed in seconds." AFAIK, the processor time is not the time spent in the process calling the function. Or is it? Anyway, "this is the function to use for benchmarkingPython or timing algorithms.", that is, if processor time is good enough, than use time.clock() and not time.time(), irregardless of the system, right?
![](https://secure.gravatar.com/avatar/ec366db3649cf13f4061b519193849d6.jpg?s=120&d=mm&r=g)
Curzio Basso wrote:
I think that the documentation is wrong. C.f. http://groups.google.com/groups?selm=mailman.1475.1092179147.5135.python-lis... And the relevant snippet from timeit.py: if sys.platform == "win32": # On Windows, the best timer is time.clock() default_timer = time.clock else: # On most other platforms the best timer is time.time() default_timer = time.time I will note from personal experience that on Macs, time.clock is especially bad for benchmarking. -- Robert Kern rkern@ucsd.edu "In the fields of hell where the grass grows high Are the graves of dreams allowed to die." -- Richard Harter
![](https://secure.gravatar.com/avatar/5a7d8a4d756bb1f1b2ea729a7e5dcbce.jpg?s=120&d=mm&r=g)
Robert Kern wrote:
Well, this is what I have in my timing code: # Basic timing functionality # If possible (Unix), use the resource module instead of time.clock() try: import resource def clock(): """clock() -> floating point number Return the CPU time in seconds (user time only, system time is ignored) since the start of the process. This is done via a call to resource.getrusage, so it avoids the wraparound problems in time.clock().""" return resource.getrusage(resource.RUSAGE_SELF)[0] except ImportError: clock = time.clock I'm not about to argue with Tim Peters, so I may well be off-base here. But by using resource, I think I can get proper CPU time allocated to my own process by the kernel (not wall clock), without the wraparound problems inherent in time.clock (which make it useless for timing long running codes). Best, f
![](https://secure.gravatar.com/avatar/5dde29b54a3f1b76b2541d0a4a9b232c.jpg?s=120&d=mm&r=g)
Robert Kern wrote:
FWIW, the idiom recommended by Tim Peters is the following:
Thanks. Yet another reason that the implementation being determined by the underlying C library is a pain! why not just have time() and clock() return the same thing under win32? And does windows really have no way to get what a Unix clock() gives you? -Chris -- Christopher Barker, Ph.D. Oceanographer NOAA/OR&R/HAZMAT (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker@noaa.gov
![](https://secure.gravatar.com/avatar/c3fbc70c6e7101b4905799649b5572e7.jpg?s=120&d=mm&r=g)
On Tue, 31 Aug 2004, Curzio Basso wrote:
Here's a slightly faster version. It's about 3x faster than Chris Barker's version (4x faster than your original version) for N=1000000, M=100: import numarray as NA import numarray.random_array as RA from math import sqrt N = 1000000 M = 100 full = NA.arange(N) r = RA.random(N) thresh = (M+3*sqrt(M))/N subset = NA.compress(r<thresh, full) while len(subset) < M: # rarely executed thresh = thresh+3*sqrt(M)/N subset = NA.compress(r<thresh, full) subset = subset[RA.permutation(len(subset))[:M]] By the way, I also find that most of the time gets spent in the permutation computation. That's why this is faster -- it gets do a smaller permutation. Rick
![](https://secure.gravatar.com/avatar/5dde29b54a3f1b76b2541d0a4a9b232c.jpg?s=120&d=mm&r=g)
Curzio Basso wrote:
you can speed it up a tiny bit my subsetting the permutation array first: subset = full[ RA.permutation(N)[:M] ]
not from my tests: import numarray.random_array as RA import numarray as NA import time N = 1000000 M = 100 full = NA.arange(N) start = time.clock() P = RA.permutation(N) print "permutation took %F seconds"%(time.clock() - start) start = time.clock() subset = full[P[:M]] print "subsetting took %F seconds"%(time.clock() - start) which results in: permutation took 1.640000 seconds subsetting took 0.000000 seconds so it's the permutation that takes the time, as I suspected. What would really speed this up is a random_array.non-repeat-randint() function, written in C. That way you wouldn't have to permute the entire N values, when you really just need M of them. Does anyone else think this would be a useful function? I can't imagine it wouldn't be that hard to write. If M <<< N, then you could probably write a little function in Python that called randint, and removed the repeats. If M is only a little smaller than N, this would be slow. -Chris -- Christopher Barker, Ph.D. Oceanographer NOAA/OR&R/HAZMAT (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker@noaa.gov
![](https://secure.gravatar.com/avatar/b3379806a8e3984c0613fb86d8395b14.jpg?s=120&d=mm&r=g)
Chris Barker wrote:
a question about the method: isn't a bit risky to use the clock() for timing the performance? The usual argument is that CPU allocates time for different processes, and the allocation could vary. That's why I used the profiler. Anyway, I performed another test with the profiler, on the example you also used, and I also obtained that most of the time is spent in permutation() (2.264 over 2.273 secs). regards
![](https://secure.gravatar.com/avatar/5dde29b54a3f1b76b2541d0a4a9b232c.jpg?s=120&d=mm&r=g)
Curzio Basso wrote:
that's why I use time.clock() rather than time.time().
That's why I used the profiler.
For order of magnitude estimates, any of these works fine. -Chris -- Christopher Barker, Ph.D. Oceanographer NOAA/OR&R/HAZMAT (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker@noaa.gov
![](https://secure.gravatar.com/avatar/ec366db3649cf13f4061b519193849d6.jpg?s=120&d=mm&r=g)
[Replying to the list instead of just Jp. Sorry, Jp! Mail readers will be the death of me.] Jp Calderone wrote:
FWIW, the idiom recommended by Tim Peters is the following: import time import sys if sys.platform == 'win32': now = time.clock else: now = time.time and then using now() to get the current time.
Jp
-- Robert Kern rkern@ucsd.edu "In the fields of hell where the grass grows high Are the graves of dreams allowed to die." -- Richard Harter
![](https://secure.gravatar.com/avatar/b3379806a8e3984c0613fb86d8395b14.jpg?s=120&d=mm&r=g)
Robert Kern wrote: than time.time()) on Win32.
Ok, now I'm really confused... From the doc of the module 'time': the clock function "return the current processor time as a floating point number expressed in seconds." AFAIK, the processor time is not the time spent in the process calling the function. Or is it? Anyway, "this is the function to use for benchmarkingPython or timing algorithms.", that is, if processor time is good enough, than use time.clock() and not time.time(), irregardless of the system, right?
![](https://secure.gravatar.com/avatar/ec366db3649cf13f4061b519193849d6.jpg?s=120&d=mm&r=g)
Curzio Basso wrote:
I think that the documentation is wrong. C.f. http://groups.google.com/groups?selm=mailman.1475.1092179147.5135.python-lis... And the relevant snippet from timeit.py: if sys.platform == "win32": # On Windows, the best timer is time.clock() default_timer = time.clock else: # On most other platforms the best timer is time.time() default_timer = time.time I will note from personal experience that on Macs, time.clock is especially bad for benchmarking. -- Robert Kern rkern@ucsd.edu "In the fields of hell where the grass grows high Are the graves of dreams allowed to die." -- Richard Harter
![](https://secure.gravatar.com/avatar/5a7d8a4d756bb1f1b2ea729a7e5dcbce.jpg?s=120&d=mm&r=g)
Robert Kern wrote:
Well, this is what I have in my timing code: # Basic timing functionality # If possible (Unix), use the resource module instead of time.clock() try: import resource def clock(): """clock() -> floating point number Return the CPU time in seconds (user time only, system time is ignored) since the start of the process. This is done via a call to resource.getrusage, so it avoids the wraparound problems in time.clock().""" return resource.getrusage(resource.RUSAGE_SELF)[0] except ImportError: clock = time.clock I'm not about to argue with Tim Peters, so I may well be off-base here. But by using resource, I think I can get proper CPU time allocated to my own process by the kernel (not wall clock), without the wraparound problems inherent in time.clock (which make it useless for timing long running codes). Best, f
![](https://secure.gravatar.com/avatar/5dde29b54a3f1b76b2541d0a4a9b232c.jpg?s=120&d=mm&r=g)
Robert Kern wrote:
FWIW, the idiom recommended by Tim Peters is the following:
Thanks. Yet another reason that the implementation being determined by the underlying C library is a pain! why not just have time() and clock() return the same thing under win32? And does windows really have no way to get what a Unix clock() gives you? -Chris -- Christopher Barker, Ph.D. Oceanographer NOAA/OR&R/HAZMAT (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker@noaa.gov
![](https://secure.gravatar.com/avatar/c3fbc70c6e7101b4905799649b5572e7.jpg?s=120&d=mm&r=g)
On Tue, 31 Aug 2004, Curzio Basso wrote:
Here's a slightly faster version. It's about 3x faster than Chris Barker's version (4x faster than your original version) for N=1000000, M=100: import numarray as NA import numarray.random_array as RA from math import sqrt N = 1000000 M = 100 full = NA.arange(N) r = RA.random(N) thresh = (M+3*sqrt(M))/N subset = NA.compress(r<thresh, full) while len(subset) < M: # rarely executed thresh = thresh+3*sqrt(M)/N subset = NA.compress(r<thresh, full) subset = subset[RA.permutation(len(subset))[:M]] By the way, I also find that most of the time gets spent in the permutation computation. That's why this is faster -- it gets do a smaller permutation. Rick
participants (6)
-
Chris Barker
-
Curzio Basso
-
Fernando Perez
-
Jp Calderone
-
Rick White
-
Robert Kern