[Numpy-discussion] Compare NumPy arrays with threshold and return the differences

Wed May 17 13:35:34 EDT 2017

On 5/17/17 10:50 AM, Nissim Derdiger wrote:
> Hi,
> In my script, I need to compare big NumPy arrays (2D or 3D), and return
> a list of all cells with difference bigger than a defined threshold.
> The compare itself can be done easily done with "allclose" function,
> like that:
> Threshold = 0.1
> if (np.allclose(Arr1, Arr2, Threshold, equal_nan=True)):
>     Print('Same')
> But this compare does not return *_which_* cells are not the same.
>  
> The easiest (yet naive) way to know which cells are not the same is to
> use a simple for loops code like this one:
> def CheckWhichCellsAreNotEqualInArrays(Arr1,Arr2,Threshold):
>    if not Arr1.shape == Arr2.shape:
>        return ['Arrays size not the same']

I think you have been exposed to too much Matlab :-) Why the [] around
the string? The pythonic way to react to unexpected conditions is to
raise an exception:

         raise ValueError('arrays size not the same')

>    Dimensions = Arr1.shape 
>    Diff = []
>    for i in range(Dimensions [0]):
>        for j in range(Dimensions [1]):
>            if not np.allclose(Arr1[i][j], Arr2[i][j], Threshold,
> equal_nan=True):
>                Diff.append(',' + str(i) + ',' + str(j) + ',' +
> str(Arr1[i,j]) + ','
>                + str(Arr2[i,j]) + ',' + str(Threshold) + ',Fail\n')

Here you are also doing something very unusual. Why do you concatenate
all those strings? It would be more efficient to return the indexes of
the array elements matching the conditions and print them out in a
second step.

>        return Diff
> (and same for 3D arrays - with 1 more for loop)
> This way is very slow when the Arrays are big and full of none-equal cells.
>  
> Is there a fast straight forward way in case they are not the same - to
> get a list of the uneven cells? maybe some built-in function in the
> NumPy itself?

a = np.random.randn(100, 100)
b = np.random.randn(100, 100)

ids = np.nonzero(np.abs(a - b) > threshold)

gives you a tuple of the indexes of the array elements pairs satisfying
your condition.  If you want to print them:

matcha = a[ids]
matchb = b[ids]

idt = np.vstack(ids).T

for i, ai, bi in zip(ids, matcha, matchb):
    c = ','.join(str(x) for x in i)
    print('{},{},{},{},Fail'.format(c, ai, bi,threshold))

works for 2D and 3D (on nD) arrays.

However, if you have many elements matching your condition this is going
to be slow and not very useful to look at. Maybe you can think about a
different way to visualize this result.

Cheers,
Dan