[Numpy-discussion] Does np.std() make two passes through the data?

Sun Nov 21 20:49:54 EST 2010

On Sun, Nov 21, 2010 at 4:18 PM,  <josef.pktd at gmail.com> wrote:
> On Sun, Nov 21, 2010 at 6:43 PM, Keith Goodman <kwgoodman at gmail.com> wrote:
>> Does np.std() make two passes through the data?
>>
>> Numpy:
>>
>>>> arr = np.random.rand(10)
>>>> arr.std()
>>   0.3008736260967052
>>
>> Looks like an algorithm that makes one pass through the data (one for
>> loop) wouldn't match arr.std():
>>
>>>> np.sqrt((arr*arr).mean() - arr.mean()**2)
>>   0.30087362609670526
>>
>> But a slower two-pass algorithm would match arr.std():
>>
>>>> np.sqrt(((arr - arr.mean())**2).mean())
>>   0.3008736260967052
>>
>> Is there a way to get the same result as arr.std() in one pass (cython
>> for loop) of the data?
>
> reference several times pointed to on the list is the wikipedia page, e.g.
> http://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#On-line_algorithm

Unfortunately it doesn't give the same result as numpy's two pass algorithm.

>From wikipedia:

def var(arr):
    n = 0
    mean = 0
    M2 = 0
    for x in arr:
        n = n + 1
        delta = x - mean
        mean = mean + delta / n
        M2 = M2 + delta * (x - mean)
    return M2 / n

This set of random samples gives matching variance:

>> a = np.random.rand(100)
>> a.var()
   0.07867478939716277
>> var(a)
   0.07867478939716277

But this sample gives a difference:

>> a = np.random.rand(100)
>> a.var()
   0.080232196646619805
>> var(a)
   0.080232196646619791

As you know, I'm trying to make a drop-in replacement for
scipy.stats.nanstd. Maybe I'll have to add an asterisk to the drop-in
part. Either that, or suck it up and store the damn mean.