Mailman 3 nan version of einsum - NumPy-Discussion

newer
Is this a known error in Ipython...

nan version of einsum

Chris Barber

April 19, 2016

6:12 p.m.

Is there any interest in a nan-ignoring version of einsum a la nansum, nanprod, etc? Any idea how difficult it would be to implement? - Chris

Attachments:

attachment.htm (text/html — 187 bytes)

Show replies by date

Chris Barber

April 2016

10:36 p.m.

Hi, Looks like I was a little confused. It appears that the nan* versions of functions in numpy just substitute the NaNs in a copy of the original array and so are just convenience methods. I was imagining that they were optimized and handling the NaNs at a lower level. It looks like the "bottleneck" package tries to do this for nansum, nanprod, etc, but I don't know if it is able to take advantage of SSE or not. Anyways, maybe if I elaborate someone can offer suggestions. My application is: given a large N-dimensional array, and a selected (N-1)-dimensional slice of it, compute the Pearson correlation of that slice versus all other (N-1)-dimensional slices of the array, omitting NaN's from the calculation. Ignoring NaN's for a moment, here is the slow and obvious way to do it: from scipy.stats import pearsonr import numpy as np def corrs(data,index,dim): seed = data.take(index,dim) res = np.zeros(data.shape[dim]) for i in range(0,data.shape[dim]): res[i] = pearsonr(seed, data.take(i,dim))[0] return res Doing all the math by hand and using einsum, there is an extremely fast (though fairly cryptic) way of doing this: def corrs2(data, index, axis): seed = data.take([index], axis=axis) sdims = range(0,seed.ndim) ddims = range(0,data.ndim) sample_axes = np.array([i for i in ddims if i != axis]) seed_mean = np.einsum(seed, sdims, []) / seed.size data_mean = np.einsum(data, ddims, [axis]) / seed.size data_mean.shape = tuple(data.shape[i] if i == axis else 1 for i in ddims) # restore dims after einsum seed_dev = np.einsum(seed-seed_mean, sdims, sdims) numerator = np.einsum(seed_dev, ddims, data, ddims, [axis]) numerator -= np.einsum(seed_dev, ddims, data_mean, ddims, [axis]) denominator = np.einsum(data, ddims, data, ddims, [axis]) denominator += -2.0*np.einsum(data, ddims, data_mean, ddims, [axis]) denominator += np.sum(data_mean**2, axis=sample_axes) * seed.size denominator *= np.einsum(seed_dev**2, sdims, [axis]) denominator = np.sqrt(denominator) return np.clip(numerator / denominator, -1.0, 1.0) It also doesn't need to make a copy of the array. Re-introducing the requirement to handle NaNs, though, I couldn't find any option besides making a mask array and introducing that explicitly into the calculations. That's why I was imagining an optimized "naneinsum." Are there any existing ways of doing sums and products on numpy arrays that have a fast way of handling NaNs? Or is a mask array the best thing to hope for? I think that for this problem that I could transpose and reshape the N-dimensional array down to 2-dimensions without making an array copy, which might make it easier to interface with some optimizing package that doesn't support multidimensional arrays fully. I am fairly new to making numpy/python fast, and coming from MATLAB am very impressed, though there are a bewlidering number of options when it comes to trying to optimize. Thanks, Chris On Tue, Apr 19, 2016 at 11:12 AM, Chris Barber <barberchris01@gmail.com> wrote:

...

Chris Barber

April 2016

10:36 p.m.

...

3217

Age (days ago)

3225

Last active (days ago)

List overview

Download

1 comments

1 participants

participants (1)

Chris Barber

nan version of einsum

Chris Barber

Chris Barber

Chris Barber

tags

participants (1)