[Numpy-discussion] Decision tree-like algorithm on numpy arrays

Thu May 6 11:07:34 EDT 2010

Hi Martin,

A Thursday 06 May 2010 08:50:33 Martin Raspaud escrigué:
> Hi all,
> 
> I have an old c-extension I want to remove from my code to the benefit of
>  numpy, but it looks kind of tricky to me.
> 
> Here is the thing:
> I have a number of arrays of the same shape.
> On these arrays, I run a sequence of tests, leading to a kind of decision
>  tree. In the end, based on these tests, I get a number of result arrays
>  where, based on the tests, each element gets a value.
> 
> The way to do this in an efficient way with numpy is quite unclear to me.
> My first thought would be:
> 
> result_array1 = np.where(some_test_on(array1),
>                          np.where(some_test_on(array2),
>                                   1,
>                                   2),
>                          np.where(some_test_on(array3, array4),
>                                   np.where(some_test_on(array5),
>                                            3,
>                                            4),
>                                   4))
> 
> result_array2 = np.where(some_test_on(array1),
>                          np.where(some_test_on(array2),
>                                   True,
>                                   True),
>                          np.where(some_test_on(array3, array4),
>                                   np.where(some_test_on(array5),
>                                            True,
>                                            False),
>                                   True))
> 
> etc... but that means running the same tests several times, which is not
> acceptable if the tests are lengthy.

The problem with performance, rather than being running the same tests several 
times, I'd say that it is more how NumPy deals with temporaries (i.e. it is a 
memory access problem).  You may want to try numexpr in order to speed-up this 
sort of computations.  For example, the next code:

#------------------------------------------------------------------------
import numpy as np
import numexpr as ne
# if you don't have numexpr installed, but PyTables, try this instead
#from tables import numexpr as ne
from time import time

N = 1e7

array1 = np.random.random(N)
array2 = np.random.random(N)
array3 = np.random.random(N)
array4 = np.random.random(N)
array5 = np.random.random(N)

t0 = time()
result_array1 = np.where(array1 > 0.5,
                         np.where(array2 < 0.5, 1, 2),
                         np.where(((array3 >.2) & (array4 < .1)),
                         np.where(array5 >= .1, 3, 4), 4))
t = round(time() - t0, 3)
print "result_array1:", result_array1, t

t0 = time()
result_array2 = ne.evaluate("""where(array1 > 0.5,
                                     where(array2 < 0.5, 1, 2),
                                     where(((array3 >.2) & (array4 < .1)),
                                     where(array5 >= .1, 3, 4), 4))""")
t = round(time() - t0, 3)
print "result_array2:", result_array2, t
assert np.allclose(result_array1, result_array2)
#------------------------------------------------------------------------

and the output for my machine:

result_array1: [4 2 4 ..., 1 3 4] 1.819
result_array2: [4 2 4 ..., 1 3 4] 0.308

which is a 6x speed-up.  I suppose this should be pretty close of what you can 
get with C.

-- 
Francesc Alted