[Numpy-discussion] Does np.std() make two passes through the data?

Mon Nov 22 12:13:44 EST 2010

On Mon, Nov 22, 2010 at 12:07 PM, Benjamin Root <ben.root at ou.edu> wrote:
> On Mon, Nov 22, 2010 at 11:03 AM, Keith Goodman <kwgoodman at gmail.com> wrote:
>>
>> On Sun, Nov 21, 2010 at 5:56 PM, Robert Kern <robert.kern at gmail.com>
>> wrote:
>> > On Sun, Nov 21, 2010 at 19:49, Keith Goodman <kwgoodman at gmail.com>
>> > wrote:
>> >
>> >> But this sample gives a difference:
>> >>
>> >>>> a = np.random.rand(100)
>> >>>> a.var()
>> >>   0.080232196646619805
>> >>>> var(a)
>> >>   0.080232196646619791
>> >>
>> >> As you know, I'm trying to make a drop-in replacement for
>> >> scipy.stats.nanstd. Maybe I'll have to add an asterisk to the drop-in
>> >> part. Either that, or suck it up and store the damn mean.
>> >
>> > The difference is less than eps. Quite possibly, the one-pass version
>> > is even closer to the true value than the two-pass version.
>>
>> I wrote 3 cython prototype implementations of nanstd for 1d float64
>> arrays:
>>
>> >> a = np.random.rand(1000000)
>>
>> # numpy; doesn't take care of NaNs
>> >> a.std()
>>   0.28852169850186793
>>
>> # cython of np.sqrt(((arr - arr.mean())**2).mean())
>> >> nanstd_twopass(a, ddof=0)
>>   0.28852169850186798
>>
>> # cython of np.sqrt((arr*arr).mean() - arr.mean()**2)
>> >> nanstd_simple(a, ddof=0)
>>   0.28852169850187437
>>
>> #
>> http://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#On-line_algorithm
>> >> nanstd_online(a, ddof=0)
>>   0.28852169850187243
>>
>> # My target, scipy version
>> >> from scipy.stats import nanstd
>> >> nanstd(a, bias=True)
>>   0.28852169850186798
>>
>> Timing:
>>
>> >> timeit nanstd(a, bias=True)
>> 10 loops, best of 3: 27.8 ms per loop
>> >> timeit a.std()
>> 100 loops, best of 3: 11.5 ms per loop
>> >> timeit nanstd_twopass(a, ddof=0)
>> 100 loops, best of 3: 3.24 ms per loop
>> >> timeit nanstd_simple(a, ddof=0)
>> 1000 loops, best of 3: 1.6 ms per loop
>> >> timeit nanstd_online(a, ddof=0)
>> 100 loops, best of 3: 10.8 ms per loop
>>
>> nanstd_simple is the fastest but I assume the algorithm is no good for
>> general use?
>>
>> I think I'll go with nanstd_twopass. It will most closely match
>> numpy/scipy, is more robust than nanstd_simple, and is the second
>> fastest.
>>
>> Here's the code. Improvements welcomed.
>>
>> @cython.boundscheck(False)
>> @cython.wraparound(False)
>> def nanstd_simple(np.ndarray[np.float64_t, ndim=1] a, int ddof):
>>    "nanstd of 1d numpy array with dtype=np.float64 along axis=0."
>>    cdef Py_ssize_t i
>>    cdef int a0 = a.shape[0], count = 0
>>    cdef np.float64_t asum = 0, a2sum=0, ai
>>    for i in range(a0):
>>        ai = a[i]
>>        if ai == ai:
>>            asum += ai
>>            a2sum += ai * ai
>>            count += 1
>>    if count > 0:
>>        asum = asum * asum
>>        return sqrt((a2sum - asum / count) / (count - ddof))
>>    else:
>>        return np.float64(NAN)
>>
>> @cython.boundscheck(False)
>> @cython.wraparound(False)
>> def nanstd_online(np.ndarray[np.float64_t, ndim=1] a, int ddof):
>>    "nanstd of 1d numpy array with dtype=np.float64 along axis=0."
>>    cdef Py_ssize_t i
>>    cdef int a0 = a.shape[0], n = 0
>>    cdef np.float64_t mean = 0, M2 = 0, delta, x
>>    for i in range(a0):
>>        x = a[i]
>>        if x == x:
>>            n += 1
>>            delta = x - mean
>>            mean = mean + delta / n
>>            M2 = M2 + delta * (x - mean)
>>    if n > 0:
>>        return np.float64(sqrt(M2 / (n - ddof)))
>>    else:
>>        return np.float64(NAN)
>>
>> @cython.boundscheck(False)
>> @cython.wraparound(False)
>> def nanstd_twopass(np.ndarray[np.float64_t, ndim=1] a, int ddof):
>>    "nanstd of 1d numpy array with dtype=np.float64 along axis=0."
>>    cdef Py_ssize_t i
>>    cdef int a0 = a.shape[0], count = 0
>>    cdef np.float64_t asum = 0, a2sum=0, amean, ai, da
>>    for i in range(a0):
>>        ai = a[i]
>>        if ai == ai:
>>            asum += ai
>>            count += 1
>>    if count > 0:
>>        amean = asum / count
>>        asum = 0
>>        for i in range(a0):
>>            ai = a[i]
>>            if ai == ai:
>>                da = ai - amean
>>                asum += da
>>                a2sum += (da * da)
>>        asum = asum * asum
>>        return sqrt((a2sum - asum / count) / (count - ddof))
>>    else:
>>        return np.float64(NAN)
>
> I wonder how the results would change if the size of the array was larger
> than the processor cache?  I still can't seem to wrap my head around the
> idea that a two-pass algorithm would be faster than a single-pass.  Is this
> just a big-O thing where sometimes one algorithm will be faster than the
> other based on the size of the problem?

nanstd_online requires too many divisions according to the Wikipedia
article, and is useful mainly if the array
doesn't fit in memory.

Two pass would provide precision that we would expect in numpy, but I
don't know if anyone ever tested the NIST problems for basic
statistics.

Josef

>
> Ben Root
>
>
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Discussion at scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>
>