[Numpy-discussion] What should be the result in some statistics corner cases?

Charles R Harris charlesr.harris at gmail.com
Mon Jul 15 11:47:50 EDT 2013


On Mon, Jul 15, 2013 at 8:58 AM, Charles R Harris <charlesr.harris at gmail.com
> wrote:

>
>
> On Mon, Jul 15, 2013 at 8:34 AM, Sebastian Berg <
> sebastian at sipsolutions.net> wrote:
>
>> On Mon, 2013-07-15 at 07:52 -0600, Charles R Harris wrote:
>> >
>> >
>> > On Sun, Jul 14, 2013 at 3:35 PM, Charles R Harris
>> > <charlesr.harris at gmail.com> wrote:
>> >
>>
>> <snip>
>>
>> >
>> >                 For nansum, I would expect 0 even in the case of all
>> >                 nans.  The point
>> >                 of these functions is to simply ignore nans, correct?
>> >                  So I would aim
>> >                 for this behaviour:  nanfunc(x) behaves the same as
>> >                 func(x[~isnan(x)])
>> >
>> >
>> >         Agreed, although that changes current behavior. What about the
>> >         other cases?
>> >
>> >
>> >
>> > Looks like there isn't much interest in the topic, so I'll just go
>> > ahead with the following choices:
>> >
>> > Non-NaN case
>> >
>> > 1) Empty array -> ValueError
>> >
>> > The current behavior with stats is an accident, i.e., the nan arises
>> > from 0/0. I like to think that in this case the result is any number,
>> > rather than not a number, so *the* value is simply not defined. So in
>> > this case raise a ValueError for empty array.
>> >
>> To be honest, I don't mind the current behaviour much sum([]) = 0,
>> len([]) = 0, so it is in a way well defined. At least I am not sure if I
>> would prefer always an error. I am a bit worried that just changing it
>> might break code out there, such as plotting code where it makes
>> perfectly sense to plot a NaN (i.e. nothing), but if that is the case it
>> would probably be visible fast.
>>
>> > 2) ddof >= n -> ValueError
>> >
>> > If the number of elements, n, is not zero and ddof >= n, raise a
>> > ValueError for the ddof value.
>> >
>> Makes sense to me, especially for ddof > n. Just returning nan in all
>> cases for backward compatibility would be fine with me too.
>>
>
> Currently if ddof > n it returns a negative number for variance, the NaN
> only comes when ddof == 0 and n == 0, leading to 0/0 (float is NaN, integer
> is zero division).
>
>
>>
>> > Nan case
>> >
>> > 1) Empty array -> Value Error
>> > 2) Empty slice -> NaN
>> > 3) For slice ddof >= n -> Nan
>> >
>> Personally I would somewhat prefer if 1) and 2) would at least default
>> to the same thing. But I don't use the nanfuncs anyway. I was wondering
>> about adding the option for the user to pick what the fill is (and i.e.
>> if it is None (maybe default) -> ValueError). We could also allow this
>> for normal reductions without an identity, but I am not sure if it is
>> useful there.
>>
>
> In the NaN case some slices may be empty, others not. My reasoning is that
> that is going to be data dependent, not operator error, but if the array is
> empty the writer of the code should deal with that.
>
>
In the case of the nanvar, nanstd, it might make more sense to handle ddof
as

1) if ddof is >= axis size, raise ValueError
2) if ddof is >= number of values after removing NaNs, return NaN

The first would be consistent with the non-nan case, the second accounts
for the variable nature of data containing NaNs.

Chuck
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20130715/4ef09d2f/attachment.html>


More information about the NumPy-Discussion mailing list