efficient use of numpy.where() and .any()
Greetings:
In some of my code, I need to use large matrix of random numbers that meet specific criteria (i.e., some random numbers need to be removed and replaces with new ones).
I have been working with .any() and .where() to facilitate this process. In the code below, .any() is used in a while loop to test for the presence of matrix elements that do not meet criteria. After that, .where() is used to obtain the tuple of indices of elements that do not meet criteria. I then use iteration over the tuple of indices and replace each 'bad' random number one at a time.
Here's an example:
a=numpy.random.normal(0,1,(3,3)) a
array([[ 0.79653228, 2.28751484, 1.15989261], [0.7549573 , 2.35133816, 0.22551564], [ 0.85666713, 1.17484243, 1.18306248]])
while (a<0).any() or (a>1).any():
ind=numpy.where(a<0) for aa in xrange(len(ind[0])): a[ind[0][aa],ind[1][aa]]=numpy.random.normal(0,1) ind=numpy.where(a>1) for aa in xrange(len(ind[0])): a[ind[0][aa],ind[1][aa]]=numpy.random.normal(0,1)
a
array([[ 0.79653228, 0.99298488, 0.24903299], [ 0.10884186, 0.10139654, 0.22551564], [ 0.85666713, 0.76554597, 0.38383126]])
My main question: is there a more efficient approach that I could be using? I would ideally like to be able to remove the two for loops.
An ancillary question: I don't quite see why the test in the while loop above works fine, however, the following formation does not.
while (0<a<1).any():
ind=numpy.where(a<0) for aa in xrange(len(ind[0])): a[ind[0][aa],ind[1][aa]]=numpy.random.normal(0,1) ind=numpy.where(a>1) for aa in xrange(len(ind[0])): a[ind[0][aa],ind[1][aa]]=numpy.random.normal(0,1) print type(ind)
Traceback (most recent call last): File "<pyshell#71>", line 1, in <module> while (0<a<1).any(): ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()
Can someone clarify this?
Thanks,
Mark
On Monday 23 April 2007 10:37:57 Mark.Miller wrote:
Greetings:
In some of my code, I need to use large matrix of random numbers that meet specific criteria (i.e., some random numbers need to be removed and replaces with new ones).
I have been working with .any() and .where() to facilitate this process.
Have you tried nonzero() ?
a[a<0] = numpy.random.normal(0,1)
will put a random number from the normal distribution where your initial a is negative. No Python loops needed, no Python temps.
Traceback (most recent call last): File "<pyshell#71>", line 1, in <module> while (0<a<1).any():
The double condition (0<a<1) is not legit. You should try logical.and(a>0,a<1) or (a>0) & (a<1)
Note the () around each condition in case #2.
Oh, I pressed "send" too early. Just an addition: numpy.where creates a new array from some condition. If you only want to change elements of an existing array that satisfies a given condition, indexing is far more efficient: no temporary is created. Hence the suggestion of a[a<0]
Excellent suggestions...just a few comments:
Pierre GM wrote:
On Monday 23 April 2007 10:37:57 Mark.Miller wrote:
Greetings:
In some of my code, I need to use large matrix of random numbers that meet specific criteria (i.e., some random numbers need to be removed and replaces with new ones).
I have been working with .any() and .where() to facilitate this process.
Have you tried nonzero() ?
Nonzero isn't quite what I'm after, as the tests are more complicated than what I illustrated in my example.
a[a<0] = numpy.random.normal(0,1)
This is a neat construct that I didn't realize was possible. However, it has the undesirable (in my case) effect of placing a single new random number in each locations where a<0. While this could work, I ideally need a different random number chosen for each replaced value. Does that make sense?
will put a random number from the normal distribution where your initial a is negative. No Python loops needed, no Python temps.
Traceback (most recent call last): File "<pyshell#71>", line 1, in <module> while (0<a<1).any():
The double condition (0<a<1) is not legit. You should try logical.and(a>0,a<1) or (a>0) & (a<1)
Note the () around each condition in case #2.
This makes perfect sense. Thanks for pointing it out to me. It should easily do the trick.
Any and all additional suggestions are greatly appreciated,
Mark
Numpydiscussion mailing list Numpydiscussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpydiscussion
Mark.Miller wrote:
Pierre GM wrote:
a[a<0] = numpy.random.normal(0,1)
This is a neat construct that I didn't realize was possible. However, it has the undesirable (in my case) effect of placing a single new random number in each locations where a<0. While this could work, I ideally need a different random number chosen for each replaced value. Does that make sense?
Certainly. How about this?
mask = (a<0) a[mask] = numpy.random.normal(0, 1, size=mask.sum())
On 23/04/07, Pierre GM pgmdevlist@gmail.com wrote:
Have you tried nonzero() ?
a[a<0] = numpy.random.normal(0,1)
will put a random number from the normal distribution where your initial a is negative. No Python loops needed, no Python temps.
When you say "no python temps" I guess you mean, no temporary *variables*? If I understand correctly, this allocates a temporary boolean array to hold the result of "a<0".
The double condition (0<a<1) is not legit. You should try logical.and(a>0,a<1) or (a>0) & (a<1)
Note the () around each condition in case #2.
This is an unfortunate limitation that comes from the fact that we can't override the behaviour of python's logical operations. a<b<c does the right thing for python scalars, but it does it by being expanded to (approximately) "a<b and b<c", and "and" doesn't do the right thing for arrays. The best we can do is override the bitwise operators for boolean arrays. This is a shame as I often want to select array elements that fall into a given range, and creating three temporary arrays instead of one is unpleasant.
Anne
Have you tried nonzero() ?
Nonzero isn't quite what I'm after, as the tests are more complicated than what I illustrated in my example.
Tests such as (a<0)&(b>1) will give you arrays of booleans. The nonzero give you where the two conditions are met (viz, where the results is True, or 1)
a[a<0] = numpy.random.normal(0,1)
This is a neat construct that I didn't realize was possible. However, it has the undesirable (in my case) effect of placing a single new random number in each locations where a<0. While this could work, I ideally need a different random number chosen for each replaced value. Does that make sense?
Completely. So what about the slightly more complicated: test = a<0 if test .any() a[test] = numpy.random.normal(0,1,size=len(test.nonzero()[0]))
test.nonzero() outputs a tuple of indices. test.nonzero()[0] gives you the indices along the first axis, len(test.nonzero()[0]) the number of elements where the condition is met. You could also get the same result with something like test = a<0 if test .any() a[test] = numpy.random.normal(0,1,size=test.sum())
Actually, the following simpler solution works as well:
a[a<0] = numpy.random.normal(0,1,size=a.size)
When you say "no python temps" I guess you mean, no temporary *variables*? If I understand correctly, this allocates a temporary boolean array to hold the result of "a<0".
Indeed, hence my precising "no *python* temps". There still be a tmp created at one point or another (if I'm not mistaken)
This is an unfortunate limitation that comes from the fact that we can't override the behaviour of python's logical operations. a<b<c does the right thing for python scalars, but it does it by being expanded to (approximately) "a<b and b<c", and "and" doesn't do the right thing for arrays.
Note that in addition of the bitwise operators, you can use the "logical_" functions. OK, you'll still end up w/ temporaries, but I wonder whether there couldn't be some tricks to bypass that...
On 23/04/07, Pierre GM pgmdevlist@gmail.com wrote:
Note that in addition of the bitwise operators, you can use the "logical_" functions. OK, you'll still end up w/ temporaries, but I wonder whether there couldn't be some tricks to bypass that...
If you're really determined not to make many temps, you can use their output arguments and do it all inplace on the first temporary. The few times I've rewritten code that way it hasn't made an appreciable positive difference in the speed, and it was sometimes significantly slower, perhaps because of the increased memory footprint (the temps were longerlived than when I did it by hand). malloc() is really really cheap, and garbage collection is also extremely cheap for huge arrays.
Anne
El dl 23 de 04 del 2007 a les 12:47 0400, en/na Anne Archibald va escriure:
On 23/04/07, Pierre GM pgmdevlist@gmail.com wrote:
Note that in addition of the bitwise operators, you can use the "logical_" functions. OK, you'll still end up w/ temporaries, but I wonder whether there couldn't be some tricks to bypass that...
If you're really determined not to make many temps, you can use their output arguments and do it all inplace on the first temporary. The few times I've rewritten code that way it hasn't made an appreciable positive difference in the speed, and it was sometimes significantly slower, perhaps because of the increased memory footprint (the temps were longerlived than when I did it by hand). malloc() is really really cheap, and garbage collection is also extremely cheap for huge arrays.
Or you can use numexpr. Numexpr doesn't need temporaries (provided that all the arrays in the expresion are contiguous). It only uses a small buffer for carrying out computations whose size is, when compared with large matrices, negligible. It is also usally faster than the pure NumPy approach:
# Pure NumPy
Timer("c=(i>2)  ((f**2>3) & ~(i*f<2))", "import numpy;
i=numpy.arange(10**6); f=numpy.random.rand(10**6)").repeat(3, 10) [0.55586099624633789, 0.55565214157104492, 0.556427001953125]
# Using Numexpr
Timer("c=numexpr.evaluate('(i>2)  ((f**2>3) & ~(i*f<2))')", "import
tables; import tables.numexpr as numexpr; import numpy; i=numpy.arange(10**6); f=numpy.random.rand(10**6)").repeat(3, 10) [0.25569510459899902, 0.25547599792480469, 0.25554895401000977]
I've used here the numexpr instance that comes with PyTables, but you can use also the one in the scipy's sandbox as it also supports booleans since some months ago.
Cheers,
participants (5)

Anne Archibald

Francesc Altet

Mark.Miller

Pierre GM

Robert Kern