Decision tree-like algorithm on numpy arrays

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
Hi all,
I have an old c-extension I want to remove from my code to the benefit of numpy, but it looks kind of tricky to me.
Here is the thing: I have a number of arrays of the same shape. On these arrays, I run a sequence of tests, leading to a kind of decision tree. In the end, based on these tests, I get a number of result arrays where, based on the tests, each element gets a value.
The way to do this in an efficient way with numpy is quite unclear to me. My first thought would be:
result_array1 = np.where(some_test_on(array1), np.where(some_test_on(array2), 1, 2), np.where(some_test_on(array3, array4), np.where(some_test_on(array5), 3, 4), 4))
result_array2 = np.where(some_test_on(array1), np.where(some_test_on(array2), True, True), np.where(some_test_on(array3, array4), np.where(some_test_on(array5), True, False), True))
etc... but that means running the same tests several times, which is not acceptable if the tests are lengthy.
In order to avoid this problem I could also have some mask based on each test: mask1 = some_test_on(array1) mask2 = some_test_on(array2[mask1]) mask3 = some_test_on(array3[!mask1], array4[!mask1]) mask4 = some_test_on(array5[!mask1][mask3])
result_array1[mask1][mask2] = 1 result_array1[mask1][!mask2] = 2 result_array1[!mask1][mask3][mask4] = 3 result_array1[!mask1][mask3][!mask4] = 4 result_array1[!mask1][!mask3] = 4
result_array2[mask1][mask2] = True result_array2[mask1][!mask2] = True result_array2[!mask1][mask3][mask4] = True result_array2[!mask1][mask3][!mask4] = False result_array2[!mask1][!mask3] = True
etc... but that looks a bit clumsy to me...
The way it was done in the C-extension was to run the decision tree on each element sequentially, but I have the feeling that would not be very efficient with numpy (although I know I can't beat pure C code, I would like to have comparable times).
Does any of you wise people have an opinion on this ?
Thanks, Martin

Hi Martin,
A Thursday 06 May 2010 08:50:33 Martin Raspaud escrigué:
Hi all,
I have an old c-extension I want to remove from my code to the benefit of numpy, but it looks kind of tricky to me.
Here is the thing: I have a number of arrays of the same shape. On these arrays, I run a sequence of tests, leading to a kind of decision tree. In the end, based on these tests, I get a number of result arrays where, based on the tests, each element gets a value.
The way to do this in an efficient way with numpy is quite unclear to me. My first thought would be:
result_array1 = np.where(some_test_on(array1), np.where(some_test_on(array2), 1, 2), np.where(some_test_on(array3, array4), np.where(some_test_on(array5), 3, 4), 4))
result_array2 = np.where(some_test_on(array1), np.where(some_test_on(array2), True, True), np.where(some_test_on(array3, array4), np.where(some_test_on(array5), True, False), True))
etc... but that means running the same tests several times, which is not acceptable if the tests are lengthy.
The problem with performance, rather than being running the same tests several times, I'd say that it is more how NumPy deals with temporaries (i.e. it is a memory access problem). You may want to try numexpr in order to speed-up this sort of computations. For example, the next code:
#------------------------------------------------------------------------ import numpy as np import numexpr as ne # if you don't have numexpr installed, but PyTables, try this instead #from tables import numexpr as ne from time import time
N = 1e7
array1 = np.random.random(N) array2 = np.random.random(N) array3 = np.random.random(N) array4 = np.random.random(N) array5 = np.random.random(N)
t0 = time() result_array1 = np.where(array1 > 0.5, np.where(array2 < 0.5, 1, 2), np.where(((array3 >.2) & (array4 < .1)), np.where(array5 >= .1, 3, 4), 4)) t = round(time() - t0, 3) print "result_array1:", result_array1, t
t0 = time() result_array2 = ne.evaluate("""where(array1 > 0.5, where(array2 < 0.5, 1, 2), where(((array3 >.2) & (array4 < .1)), where(array5 >= .1, 3, 4), 4))""") t = round(time() - t0, 3) print "result_array2:", result_array2, t assert np.allclose(result_array1, result_array2) #------------------------------------------------------------------------
and the output for my machine:
result_array1: [4 2 4 ..., 1 3 4] 1.819 result_array2: [4 2 4 ..., 1 3 4] 0.308
which is a 6x speed-up. I suppose this should be pretty close of what you can get with C.

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
Francesc Alted skrev:
Hi Martin,
[...]
and the output for my machine:
result_array1: [4 2 4 ..., 1 3 4] 1.819 result_array2: [4 2 4 ..., 1 3 4] 0.308
which is a 6x speed-up. I suppose this should be pretty close of what you can get with C.
Hi Francesc, Thanks a lot for the idea ! This looks nice, I wasn't aware of numexpr.
I guess it's no problem to run "evaluate" on user defined functions ?
Thanks, Martin

A Friday 07 May 2010 08:18:44 Martin Raspaud escrigué:
Francesc Alted skrev:
Hi Martin,
[...]
and the output for my machine:
result_array1: [4 2 4 ..., 1 3 4] 1.819 result_array2: [4 2 4 ..., 1 3 4] 0.308
which is a 6x speed-up. I suppose this should be pretty close of what you can get with C.
Hi Francesc, Thanks a lot for the idea ! This looks nice, I wasn't aware of numexpr.
I guess it's no problem to run "evaluate" on user defined functions ?
No problem. You only have to express your functions in terms of numexpr expressions. Look at this simple example:
#------------------------------------------------------------------ import numpy as np import numexpr as ne
N = 1e7 array1 = np.random.random(N) array2 = np.random.random(N) array3 = np.random.random(N)
# An user-defined function def some_test_on(arr, value): return arr > value
result_array1 = np.where(some_test_on(array1, 0.6), np.where(some_test_on(array2, 0.5), 1, 2), np.where(some_test_on(array3, 0.3), 3, 4))
print "result_array1:", result_array1
# The same user-defined function than above, # but return a numexpr expression instead def ne_some_test_on(arr, value): return "(" + arr + " > %s" % value + ")"
expr1 = "where("+ne_some_test_on("array2", 0.5) + ", 1, 2)" expr2 = "where("+ne_some_test_on("array3", 0.3) + ", 3, 4)" expr = "where("+ne_some_test_on("array1", 0.6) + ", %s, %s)" % (expr1, expr2)
result_array2 = ne.evaluate(expr) print "result_array2:", result_array2
assert np.allclose(result_array1, result_array2) #------------------------------------------------------------------
and the output:
result_array1: [2 3 4 ..., 3 1 3] result_array2: [2 3 4 ..., 3 1 3]
participants (2)
-
Francesc Alted
-
Martin Raspaud