[Numpy-discussion] How to set array values based on a condition?

Sun Mar 23 17:55:38 EDT 2008

Damian Eads wrote:
> Anne Archibald wrote:
>> On 23/03/2008, Damian Eads <eads at soe.ucsc.edu> wrote:
>>> Hi,
>>>
>>>  I am working on a memory-intensive experiment with very large arrays so
>>>  I must be careful when allocating memory. Numpy already supports a
>>>  number of in-place operations (+=, *=) making the task much more
>>>  manageable. However, it is not obvious to me out I set values based on a
>>>  very simple condition.
>>>
>>>  The expression
>>>
>>>    y[y<0]=-1
>>>
>>>  generates a binary index mask y>=0 of the same size as the array y,
>>>  which is problematic when y is quite large.
>>>
>>>  I was wondering if there was anything like a set_where(A, cmp, B,
>>>  setval, [optional elseval]) function where cmp would be a comparison
>>>  operator expressed as a string.
>>>
>>>  The code below illustrates what I want to do. Admittedly, it needs to be
>>>  cleaned up but it's a proof of concept. Does numpy provide any functions
>>>  that support the functionality of the code below?
>> That's a good question, but I'm pretty sure it doesn't, apart from
>> numpy.clip(). The way I'd try to solve that problem would be with the
>> dreaded for loop. Don't iterate over single elements, but if you have
>> a gargantuan array, working in chunks of ten thousand (or whatever)
>> won't have too much overhead:
>>
>> block = 100000
>> for n in arange(0,len(y),block):
>>     yc = y[n:n+block]
>>     yc[yc<0] = -1
>>
>> It's a bit of a pain, but working with arrays that nearly fill RAM
>> *is* a pain, as I'm sure you are all too aware by now.
>>
>> You might look into numexpr, this is the sort of thing it does (though
>> I've never used it and can't say whether it can do this).
>>
>> Anne
>> _______________________________________________
>> Numpy-discussion mailing list
>> Numpy-discussion at scipy.org
>> http://projects.scipy.org/mailman/listinfo/numpy-discussion
> 
> Hi Anne,
> 
> Since the thing I want to do is a common case, I figured that if I were 
> to take the blocked-based approach, I'd write a helper function to do 
> the blocking for me. Here it is:
> 
> import numpy
> import types
> 
> def block_cond(*args):
>      """
>      block_cond(X1, ..., XN, cond_fun, val_fun, [else_fun])
> 
>      Breaks the 1-D arrays X1 to XN into properly aligned chunks. The
>      cond_fun is a function that takes in the chunks of each array
>      returns an index or mask array. For each chunk c
> 
>         C=cond_fun(X1[c], ..., XN[c])
> 
>      The val_fun takes the masked or indexed chunks, and returns the
>      values each element should be set to
> 
>         V=cond_fun(X1[c][C], ..., XN[c][C])
> 
>      Finally, the first array's elements
> 
>         X1[c][C]=V
>      """
>      blksize = 100000
>      if len(args) < 3:
>          raise ValueError("Nothing to do.")
> 
>      if type(args[-3]) == types.FunctionType:
>          elsefn = args[-1]
>          valfn = args[-2]
>          condfn = args[-3]
>          qargs = args[:-3]
>      else:
>          elsefn = None
>          valfn = args[-1]
>          condfn = args[-2]
>          qargs = args[:-2]
> 
>      # Grab the length of the first array.
>      num = qargs[0].size
>      shp = qargs[0].shape
> 
>      # Check that rest of the arguments are all arrays of the same size.
>      for i in xrange(0, len(qargs)):
>          if type(qargs[i]) != _array_type:
>              raise TypeError("Argument %i must be an array." % i)
>          if qargs[i].size != num:
>              raise TypeError("Array argument %i differs in size from the 
> previous arrays." % i)
>          if qargs[i].shape != shp:
>              raise TypeError("Array argument %i differs in shape from 
> the previous arrays." % i)
> 
>      for a in xrange(0, num, blksize):
>          b = min(a + blksize, num)
>          fargs = [qarg[a:b] for qarg in qargs]
>          c = apply(condfn, fargs)
>          #print c
>          v = apply(valfn, [farg[c] for farg in fargs])
>          #print v
>          slc = qargs[0][a:b]
>          slc[c] = v
>          if elsefn is not None:
>              ev = apply(elsefn, [numpy.array(arg[a:b])[~c] for arg in 
> qargs])
>              slc[~c] = ev
> 
> -----------------------------
> 
> Let's try running it,
> 
> In [96]: y=numpy.random.rand(10000000)
> 
> In [97]: x=y.copy()
> 
> In [98]: %time x[:] = x<=0.5
> CPU times: user 0.39 s, sys: 0.01 s, total: 0.40 s
> Wall time: 0.66 s
> 
> In [100]: %time setwhere.block_cond(y, lambda y: y <= 0.5, lambda y: 1, 
> lambda y: 0)
> CPU times: user 1.70 s, sys: 0.10 s, total: 1.80 s
> Wall time: 2.28 s
> 
> The inefficient copying approach is almost 4 times faster than the 
> blocking approach. Ideas about what I'm doing wrong?
> 
> Would others find a proper C-based numpy implementation of the set_where 
> function useful? I'd offer to implement it.
> 
> Damian

If I try it with the scipy.weave implementation I showed in my first 
posting of this thread, I get a factor of 3 speed up over the 
memory-inefficient copy approach and a factor of 10 speed up over the 
block-based approach.

In [105]: y=numpy.random.rand(10000000)

In [106]: %time setwhere.set_where(y, "<=", 0.5, 1, 0)
CPU times: user 0.15 s, sys: 0.00 s, total: 0.15 s
Wall time: 0.21 s

This suggests a C implementation might be worth it.

Damian