On Mon, Nov 22, 2010 at 12:07 PM, Benjamin Root <ben.root@ou.edu> wrote:
On Mon, Nov 22, 2010 at 11:03 AM, Keith Goodman <kwgoodman@gmail.com> wrote:
On Sun, Nov 21, 2010 at 5:56 PM, Robert Kern <robert.kern@gmail.com> wrote:
On Sun, Nov 21, 2010 at 19:49, Keith Goodman <kwgoodman@gmail.com> wrote:
But this sample gives a difference:
a = np.random.rand(100) a.var() 0.080232196646619805 var(a) 0.080232196646619791
As you know, I'm trying to make a drop-in replacement for scipy.stats.nanstd. Maybe I'll have to add an asterisk to the drop-in part. Either that, or suck it up and store the damn mean.
The difference is less than eps. Quite possibly, the one-pass version is even closer to the true value than the two-pass version.
I wrote 3 cython prototype implementations of nanstd for 1d float64 arrays:
a = np.random.rand(1000000)
# numpy; doesn't take care of NaNs
a.std() 0.28852169850186793
# cython of np.sqrt(((arr - arr.mean())**2).mean())
nanstd_twopass(a, ddof=0) 0.28852169850186798
# cython of np.sqrt((arr*arr).mean() - arr.mean()**2)
nanstd_simple(a, ddof=0) 0.28852169850187437
# http://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#On-line_alg...
nanstd_online(a, ddof=0) 0.28852169850187243
# My target, scipy version
from scipy.stats import nanstd nanstd(a, bias=True) 0.28852169850186798
Timing:
timeit nanstd(a, bias=True) 10 loops, best of 3: 27.8 ms per loop timeit a.std() 100 loops, best of 3: 11.5 ms per loop timeit nanstd_twopass(a, ddof=0) 100 loops, best of 3: 3.24 ms per loop timeit nanstd_simple(a, ddof=0) 1000 loops, best of 3: 1.6 ms per loop timeit nanstd_online(a, ddof=0) 100 loops, best of 3: 10.8 ms per loop
nanstd_simple is the fastest but I assume the algorithm is no good for general use?
I think I'll go with nanstd_twopass. It will most closely match numpy/scipy, is more robust than nanstd_simple, and is the second fastest.
Here's the code. Improvements welcomed.
@cython.boundscheck(False) @cython.wraparound(False) def nanstd_simple(np.ndarray[np.float64_t, ndim=1] a, int ddof): "nanstd of 1d numpy array with dtype=np.float64 along axis=0." cdef Py_ssize_t i cdef int a0 = a.shape[0], count = 0 cdef np.float64_t asum = 0, a2sum=0, ai for i in range(a0): ai = a[i] if ai == ai: asum += ai a2sum += ai * ai count += 1 if count > 0: asum = asum * asum return sqrt((a2sum - asum / count) / (count - ddof)) else: return np.float64(NAN)
@cython.boundscheck(False) @cython.wraparound(False) def nanstd_online(np.ndarray[np.float64_t, ndim=1] a, int ddof): "nanstd of 1d numpy array with dtype=np.float64 along axis=0." cdef Py_ssize_t i cdef int a0 = a.shape[0], n = 0 cdef np.float64_t mean = 0, M2 = 0, delta, x for i in range(a0): x = a[i] if x == x: n += 1 delta = x - mean mean = mean + delta / n M2 = M2 + delta * (x - mean) if n > 0: return np.float64(sqrt(M2 / (n - ddof))) else: return np.float64(NAN)
@cython.boundscheck(False) @cython.wraparound(False) def nanstd_twopass(np.ndarray[np.float64_t, ndim=1] a, int ddof): "nanstd of 1d numpy array with dtype=np.float64 along axis=0." cdef Py_ssize_t i cdef int a0 = a.shape[0], count = 0 cdef np.float64_t asum = 0, a2sum=0, amean, ai, da for i in range(a0): ai = a[i] if ai == ai: asum += ai count += 1 if count > 0: amean = asum / count asum = 0 for i in range(a0): ai = a[i] if ai == ai: da = ai - amean asum += da a2sum += (da * da) asum = asum * asum return sqrt((a2sum - asum / count) / (count - ddof)) else: return np.float64(NAN)
I wonder how the results would change if the size of the array was larger than the processor cache? I still can't seem to wrap my head around the idea that a two-pass algorithm would be faster than a single-pass. Is this just a big-O thing where sometimes one algorithm will be faster than the other based on the size of the problem?
nanstd_online requires too many divisions according to the Wikipedia article, and is useful mainly if the array doesn't fit in memory. Two pass would provide precision that we would expect in numpy, but I don't know if anyone ever tested the NIST problems for basic statistics. Josef
Ben Root
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion