On Mon, Nov 22, 2010 at 11:03 AM, Keith Goodman <kwgoodman@gmail.com> wrote:
On Sun, Nov 21, 2010 at 5:56 PM, Robert Kern <robert.kern@gmail.com> wrote:
On Sun, Nov 21, 2010 at 19:49, Keith Goodman <kwgoodman@gmail.com> wrote:
But this sample gives a difference:
a = np.random.rand(100) a.var() 0.080232196646619805 var(a) 0.080232196646619791
As you know, I'm trying to make a drop-in replacement for scipy.stats.nanstd. Maybe I'll have to add an asterisk to the drop-in part. Either that, or suck it up and store the damn mean.
The difference is less than eps. Quite possibly, the one-pass version is even closer to the true value than the two-pass version.
I wrote 3 cython prototype implementations of nanstd for 1d float64 arrays:
a = np.random.rand(1000000)
# numpy; doesn't take care of NaNs
a.std() 0.28852169850186793
# cython of np.sqrt(((arr - arr.mean())**2).mean())
nanstd_twopass(a, ddof=0) 0.28852169850186798
# cython of np.sqrt((arr*arr).mean() - arr.mean()**2)
nanstd_simple(a, ddof=0) 0.28852169850187437
# http://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#On-line_alg...
nanstd_online(a, ddof=0) 0.28852169850187243
# My target, scipy version
from scipy.stats import nanstd nanstd(a, bias=True) 0.28852169850186798
Timing:
timeit nanstd(a, bias=True) 10 loops, best of 3: 27.8 ms per loop timeit a.std() 100 loops, best of 3: 11.5 ms per loop timeit nanstd_twopass(a, ddof=0) 100 loops, best of 3: 3.24 ms per loop timeit nanstd_simple(a, ddof=0) 1000 loops, best of 3: 1.6 ms per loop timeit nanstd_online(a, ddof=0) 100 loops, best of 3: 10.8 ms per loop
nanstd_simple is the fastest but I assume the algorithm is no good for general use?
I think I'll go with nanstd_twopass. It will most closely match numpy/scipy, is more robust than nanstd_simple, and is the second fastest.
Here's the code. Improvements welcomed.
@cython.boundscheck(False) @cython.wraparound(False) def nanstd_simple(np.ndarray[np.float64_t, ndim=1] a, int ddof): "nanstd of 1d numpy array with dtype=np.float64 along axis=0." cdef Py_ssize_t i cdef int a0 = a.shape[0], count = 0 cdef np.float64_t asum = 0, a2sum=0, ai for i in range(a0): ai = a[i] if ai == ai: asum += ai a2sum += ai * ai count += 1 if count > 0: asum = asum * asum return sqrt((a2sum - asum / count) / (count - ddof)) else: return np.float64(NAN)
@cython.boundscheck(False) @cython.wraparound(False) def nanstd_online(np.ndarray[np.float64_t, ndim=1] a, int ddof): "nanstd of 1d numpy array with dtype=np.float64 along axis=0." cdef Py_ssize_t i cdef int a0 = a.shape[0], n = 0 cdef np.float64_t mean = 0, M2 = 0, delta, x for i in range(a0): x = a[i] if x == x: n += 1 delta = x - mean mean = mean + delta / n M2 = M2 + delta * (x - mean) if n > 0: return np.float64(sqrt(M2 / (n - ddof))) else: return np.float64(NAN)
@cython.boundscheck(False) @cython.wraparound(False) def nanstd_twopass(np.ndarray[np.float64_t, ndim=1] a, int ddof): "nanstd of 1d numpy array with dtype=np.float64 along axis=0." cdef Py_ssize_t i cdef int a0 = a.shape[0], count = 0 cdef np.float64_t asum = 0, a2sum=0, amean, ai, da for i in range(a0): ai = a[i] if ai == ai: asum += ai count += 1 if count > 0: amean = asum / count asum = 0 for i in range(a0): ai = a[i] if ai == ai: da = ai - amean asum += da a2sum += (da * da) asum = asum * asum return sqrt((a2sum - asum / count) / (count - ddof)) else: return np.float64(NAN)
I wonder how the results would change if the size of the array was larger than the processor cache? I still can't seem to wrap my head around the idea that a two-pass algorithm would be faster than a single-pass. Is this just a big-O thing where sometimes one algorithm will be faster than the other based on the size of the problem? Ben Root