[Numpy-discussion] nan version of einsum

Chris Barber barberchris01 at gmail.com
Wed Apr 27 18:36:53 EDT 2016


Looks like I was a little confused.  It appears that the nan* versions of
functions in numpy just substitute the NaNs in a copy of the original array
and so are just convenience methods.  I was imagining that they were
optimized and handling the NaNs at a lower level.  It looks like the
"bottleneck" package tries to do this for nansum, nanprod, etc, but I don't
know if it is able to take advantage of SSE or not.

Anyways, maybe if I elaborate someone can offer suggestions.

My application is: given a large N-dimensional array, and a selected
(N-1)-dimensional slice of it, compute the Pearson correlation of that
slice versus all other (N-1)-dimensional slices of the array, omitting
NaN's from the calculation.

Ignoring NaN's for a moment, here is the slow and obvious way to do it:

    from scipy.stats import pearsonr
    import numpy as np
    def corrs(data,index,dim):
        seed = data.take(index,dim)
        res = np.zeros(data.shape[dim])
        for i in range(0,data.shape[dim]):
            res[i] = pearsonr(seed, data.take(i,dim))[0]
        return res

Doing all the math by hand and using einsum, there is an extremely fast
(though fairly cryptic) way of doing this:

    def corrs2(data, index, axis):
        seed = data.take([index], axis=axis)

        sdims = range(0,seed.ndim)
        ddims = range(0,data.ndim)
        sample_axes = np.array([i for i in ddims if i != axis])

        seed_mean = np.einsum(seed, sdims, []) / seed.size
        data_mean = np.einsum(data, ddims, [axis]) / seed.size
        data_mean.shape = tuple(data.shape[i] if i == axis else 1 for i in
ddims) # restore dims after einsum

        seed_dev = np.einsum(seed-seed_mean, sdims, sdims)
        numerator = np.einsum(seed_dev, ddims, data, ddims, [axis])
        numerator -= np.einsum(seed_dev, ddims, data_mean, ddims, [axis])

        denominator = np.einsum(data, ddims, data, ddims, [axis])
        denominator += -2.0*np.einsum(data, ddims, data_mean, ddims, [axis])
        denominator += np.sum(data_mean**2, axis=sample_axes) * seed.size
        denominator *= np.einsum(seed_dev**2, sdims, [axis])
        denominator = np.sqrt(denominator)

        return np.clip(numerator / denominator, -1.0, 1.0)

It also doesn't need to make a copy of the array.  Re-introducing the
requirement to handle NaNs, though, I couldn't find any option besides
making a mask array and introducing that explicitly into the calculations.
That's why I was imagining an optimized "naneinsum."

Are there any existing ways of doing sums and products on numpy arrays that
have a fast way of handling NaNs?  Or is a mask array the best thing to
hope for?  I think that for this problem that I could transpose and reshape
the N-dimensional array down to 2-dimensions without making an array copy,
which might make it easier to interface with some optimizing package that
doesn't support multidimensional arrays fully.

I am fairly new to making numpy/python fast, and coming from MATLAB am very
impressed, though there are a bewlidering number of options when it comes
to trying to optimize.


On Tue, Apr 19, 2016 at 11:12 AM, Chris Barber <barberchris01 at gmail.com>

> Is there any interest in a nan-ignoring version of einsum a la nansum,
> nanprod, etc?  Any idea how difficult it would be to implement?
> - Chris
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20160427/f1ed8b54/attachment.html>

More information about the NumPy-Discussion mailing list