
Dear List, I'm trying to speed up a piece of code that selects a subsample based on some criteria: Setup: I have two samples, raw and cut. Cut is a pure subset of raw, all elements in cut are also in raw, and cut is derived from raw by applying some cuts. Now I would like to select a random subsample of raw and find out how many are also in cut. In other words, some of those random events pass the cuts, others don't. So in principle I have randomSample = np.random.random_integers(0, len(raw)-1, size=sampleSize) random_that_pass1 = [r for r in raw[randomSample] if r in cut] This is fine (I hope), but slow. I have seen searchsorted mentioned as a possible way to speed this up. Now it gets complicated. I'm creating a boolean array that contains True, wherever a raw event is in cut. raw_sorted = np.sort(raw) cut_sorted = np.sort(cut) passed = np.searchsorted(raw_sorted, cut_sorted) raw_bool = np.zeros(len(raw), dtype='bool') raw_bool[passed] = True Now I create a second boolean array that is set to True at the random values. The events I care about are the ones that pass the cuts and are selected by the random selection: sample_bool = np.zeros(len(raw), dtype='bool') sample_bool[randomSample] = True random_that_pass2 = raw[np.logical_and(raw_bool, sample_bool)] The problem comes in now: random_that_pass2 and random_that_pass1 have different lengths!!! Sometimes one is longer, sometimes the other. I am completely at a loss to explain this. I tend to believe the slow selection leading to random_that_pass1, because it's only two lines, but I don't understand where the other selection could fail. Unfortunately, the samples that give me trouble are 2.2 MB, so maybe a bit large to mail around, but I can put it somewhere if needed. Thank you for your help, Cheers, Jan

On Mon, Jan 25, 2010 at 1:38 PM, Jan Strube <curiousjan@gmail.com> wrote:
Dear List,
I'm trying to speed up a piece of code that selects a subsample based on some criteria: Setup: I have two samples, raw and cut. Cut is a pure subset of raw, all elements in cut are also in raw, and cut is derived from raw by applying some cuts. Now I would like to select a random subsample of raw and find out how many are also in cut. In other words, some of those random events pass the cuts, others don't. So in principle I have
randomSample = np.random.random_integers(0, len(raw)-1, size=sampleSize) random_that_pass1 = [r for r in raw[randomSample] if r in cut]
This is fine (I hope), but slow.
You could construct raw2 and cut2 where each element placed in cut2 is removed from raw2: idx = np.random.rand(n_in_cut2) > 0.5 # for example raw2 = raw[~idx] cut2 = raw[idx] If you concatenate raw2 and cut2 you get raw (but reordered): raw3 = np.concatenate((raw2, cut2), axis=0) Any element in the subsample with an index of len(raw2) or greater is in cut. That makes counting fast. There is a setup cost. So I guess it all depends on how many subsamples you need from one cut. Not sure any of this works, just an idea.

On Mon, Jan 25, 2010 at 5:16 PM, Keith Goodman <kwgoodman@gmail.com> wrote:
On Mon, Jan 25, 2010 at 1:38 PM, Jan Strube <curiousjan@gmail.com> wrote:
Dear List,
I'm trying to speed up a piece of code that selects a subsample based on some criteria: Setup: I have two samples, raw and cut. Cut is a pure subset of raw, all elements in cut are also in raw, and cut is derived from raw by applying some cuts. Now I would like to select a random subsample of raw and find out how many are also in cut. In other words, some of those random events pass the cuts, others don't. So in principle I have
randomSample = np.random.random_integers(0, len(raw)-1, size=sampleSize) random_that_pass1 = [r for r in raw[randomSample] if r in cut]
This is fine (I hope), but slow.
You could construct raw2 and cut2 where each element placed in cut2 is removed from raw2:
idx = np.random.rand(n_in_cut2) > 0.5 # for example raw2 = raw[~idx] cut2 = raw[idx]
If you concatenate raw2 and cut2 you get raw (but reordered):
raw3 = np.concatenate((raw2, cut2), axis=0)
Any element in the subsample with an index of len(raw2) or greater is in cut. That makes counting fast.
There is a setup cost. So I guess it all depends on how many subsamples you need from one cut.
Not sure any of this works, just an idea. _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
in1d or intersect in arraysetops should also work, pure python but well constructed and tested for performance. Josef
participants (3)
-
Jan Strube
-
josef.pktd@gmail.com
-
Keith Goodman