[Numpy-discussion] Decision tree-like algorithm on numpy arrays
Francesc Alted
faltet at pytables.org
Thu May 6 11:07:34 EDT 2010
Hi Martin,
A Thursday 06 May 2010 08:50:33 Martin Raspaud escrigué:
> Hi all,
>
> I have an old c-extension I want to remove from my code to the benefit of
> numpy, but it looks kind of tricky to me.
>
> Here is the thing:
> I have a number of arrays of the same shape.
> On these arrays, I run a sequence of tests, leading to a kind of decision
> tree. In the end, based on these tests, I get a number of result arrays
> where, based on the tests, each element gets a value.
>
> The way to do this in an efficient way with numpy is quite unclear to me.
> My first thought would be:
>
> result_array1 = np.where(some_test_on(array1),
> np.where(some_test_on(array2),
> 1,
> 2),
> np.where(some_test_on(array3, array4),
> np.where(some_test_on(array5),
> 3,
> 4),
> 4))
>
> result_array2 = np.where(some_test_on(array1),
> np.where(some_test_on(array2),
> True,
> True),
> np.where(some_test_on(array3, array4),
> np.where(some_test_on(array5),
> True,
> False),
> True))
>
> etc... but that means running the same tests several times, which is not
> acceptable if the tests are lengthy.
The problem with performance, rather than being running the same tests several
times, I'd say that it is more how NumPy deals with temporaries (i.e. it is a
memory access problem). You may want to try numexpr in order to speed-up this
sort of computations. For example, the next code:
#------------------------------------------------------------------------
import numpy as np
import numexpr as ne
# if you don't have numexpr installed, but PyTables, try this instead
#from tables import numexpr as ne
from time import time
N = 1e7
array1 = np.random.random(N)
array2 = np.random.random(N)
array3 = np.random.random(N)
array4 = np.random.random(N)
array5 = np.random.random(N)
t0 = time()
result_array1 = np.where(array1 > 0.5,
np.where(array2 < 0.5, 1, 2),
np.where(((array3 >.2) & (array4 < .1)),
np.where(array5 >= .1, 3, 4), 4))
t = round(time() - t0, 3)
print "result_array1:", result_array1, t
t0 = time()
result_array2 = ne.evaluate("""where(array1 > 0.5,
where(array2 < 0.5, 1, 2),
where(((array3 >.2) & (array4 < .1)),
where(array5 >= .1, 3, 4), 4))""")
t = round(time() - t0, 3)
print "result_array2:", result_array2, t
assert np.allclose(result_array1, result_array2)
#------------------------------------------------------------------------
and the output for my machine:
result_array1: [4 2 4 ..., 1 3 4] 1.819
result_array2: [4 2 4 ..., 1 3 4] 0.308
which is a 6x speed-up. I suppose this should be pretty close of what you can
get with C.
--
Francesc Alted
More information about the NumPy-Discussion
mailing list