![](https://secure.gravatar.com/avatar/d05cb29a33ee5b3efbc2acf44c96b3d9.jpg?s=120&d=mm&r=g)
Is there any interest in a nan-ignoring version of einsum a la nansum, nanprod, etc? Any idea how difficult it would be to implement? - Chris
![](https://secure.gravatar.com/avatar/d05cb29a33ee5b3efbc2acf44c96b3d9.jpg?s=120&d=mm&r=g)
Hi, Looks like I was a little confused. It appears that the nan* versions of functions in numpy just substitute the NaNs in a copy of the original array and so are just convenience methods. I was imagining that they were optimized and handling the NaNs at a lower level. It looks like the "bottleneck" package tries to do this for nansum, nanprod, etc, but I don't know if it is able to take advantage of SSE or not. Anyways, maybe if I elaborate someone can offer suggestions. My application is: given a large N-dimensional array, and a selected (N-1)-dimensional slice of it, compute the Pearson correlation of that slice versus all other (N-1)-dimensional slices of the array, omitting NaN's from the calculation. Ignoring NaN's for a moment, here is the slow and obvious way to do it: from scipy.stats import pearsonr import numpy as np def corrs(data,index,dim): seed = data.take(index,dim) res = np.zeros(data.shape[dim]) for i in range(0,data.shape[dim]): res[i] = pearsonr(seed, data.take(i,dim))[0] return res Doing all the math by hand and using einsum, there is an extremely fast (though fairly cryptic) way of doing this: def corrs2(data, index, axis): seed = data.take([index], axis=axis) sdims = range(0,seed.ndim) ddims = range(0,data.ndim) sample_axes = np.array([i for i in ddims if i != axis]) seed_mean = np.einsum(seed, sdims, []) / seed.size data_mean = np.einsum(data, ddims, [axis]) / seed.size data_mean.shape = tuple(data.shape[i] if i == axis else 1 for i in ddims) # restore dims after einsum seed_dev = np.einsum(seed-seed_mean, sdims, sdims) numerator = np.einsum(seed_dev, ddims, data, ddims, [axis]) numerator -= np.einsum(seed_dev, ddims, data_mean, ddims, [axis]) denominator = np.einsum(data, ddims, data, ddims, [axis]) denominator += -2.0*np.einsum(data, ddims, data_mean, ddims, [axis]) denominator += np.sum(data_mean**2, axis=sample_axes) * seed.size denominator *= np.einsum(seed_dev**2, sdims, [axis]) denominator = np.sqrt(denominator) return np.clip(numerator / denominator, -1.0, 1.0) It also doesn't need to make a copy of the array. Re-introducing the requirement to handle NaNs, though, I couldn't find any option besides making a mask array and introducing that explicitly into the calculations. That's why I was imagining an optimized "naneinsum." Are there any existing ways of doing sums and products on numpy arrays that have a fast way of handling NaNs? Or is a mask array the best thing to hope for? I think that for this problem that I could transpose and reshape the N-dimensional array down to 2-dimensions without making an array copy, which might make it easier to interface with some optimizing package that doesn't support multidimensional arrays fully. I am fairly new to making numpy/python fast, and coming from MATLAB am very impressed, though there are a bewlidering number of options when it comes to trying to optimize. Thanks, Chris On Tue, Apr 19, 2016 at 11:12 AM, Chris Barber <barberchris01@gmail.com> wrote:
![](https://secure.gravatar.com/avatar/d05cb29a33ee5b3efbc2acf44c96b3d9.jpg?s=120&d=mm&r=g)
Hi, Looks like I was a little confused. It appears that the nan* versions of functions in numpy just substitute the NaNs in a copy of the original array and so are just convenience methods. I was imagining that they were optimized and handling the NaNs at a lower level. It looks like the "bottleneck" package tries to do this for nansum, nanprod, etc, but I don't know if it is able to take advantage of SSE or not. Anyways, maybe if I elaborate someone can offer suggestions. My application is: given a large N-dimensional array, and a selected (N-1)-dimensional slice of it, compute the Pearson correlation of that slice versus all other (N-1)-dimensional slices of the array, omitting NaN's from the calculation. Ignoring NaN's for a moment, here is the slow and obvious way to do it: from scipy.stats import pearsonr import numpy as np def corrs(data,index,dim): seed = data.take(index,dim) res = np.zeros(data.shape[dim]) for i in range(0,data.shape[dim]): res[i] = pearsonr(seed, data.take(i,dim))[0] return res Doing all the math by hand and using einsum, there is an extremely fast (though fairly cryptic) way of doing this: def corrs2(data, index, axis): seed = data.take([index], axis=axis) sdims = range(0,seed.ndim) ddims = range(0,data.ndim) sample_axes = np.array([i for i in ddims if i != axis]) seed_mean = np.einsum(seed, sdims, []) / seed.size data_mean = np.einsum(data, ddims, [axis]) / seed.size data_mean.shape = tuple(data.shape[i] if i == axis else 1 for i in ddims) # restore dims after einsum seed_dev = np.einsum(seed-seed_mean, sdims, sdims) numerator = np.einsum(seed_dev, ddims, data, ddims, [axis]) numerator -= np.einsum(seed_dev, ddims, data_mean, ddims, [axis]) denominator = np.einsum(data, ddims, data, ddims, [axis]) denominator += -2.0*np.einsum(data, ddims, data_mean, ddims, [axis]) denominator += np.sum(data_mean**2, axis=sample_axes) * seed.size denominator *= np.einsum(seed_dev**2, sdims, [axis]) denominator = np.sqrt(denominator) return np.clip(numerator / denominator, -1.0, 1.0) It also doesn't need to make a copy of the array. Re-introducing the requirement to handle NaNs, though, I couldn't find any option besides making a mask array and introducing that explicitly into the calculations. That's why I was imagining an optimized "naneinsum." Are there any existing ways of doing sums and products on numpy arrays that have a fast way of handling NaNs? Or is a mask array the best thing to hope for? I think that for this problem that I could transpose and reshape the N-dimensional array down to 2-dimensions without making an array copy, which might make it easier to interface with some optimizing package that doesn't support multidimensional arrays fully. I am fairly new to making numpy/python fast, and coming from MATLAB am very impressed, though there are a bewlidering number of options when it comes to trying to optimize. Thanks, Chris On Tue, Apr 19, 2016 at 11:12 AM, Chris Barber <barberchris01@gmail.com> wrote:
participants (1)
-
Chris Barber