Mailman 3 NAN handling in the statistics module - Python-ideas

NAN handling in the statistics module

older
Re: [Python-ideas] NAN handling in...

Steven D'Aprano

7 Jan 2019 7 Jan '19

12:27 a.m.

Bug #33084 reports that the statistics library calculates median and other stats wrongly if the data contains NANs. Worse, the result depends on the initial placement of the NAN: py> from statistics import median py> NAN = float('nan') py> median([NAN, 1, 2, 3, 4]) 2 py> median([1, 2, 3, 4, NAN]) 3 See the bug report for more detail: https://bugs.python.org/issue33084 The caller can always filter NANs out of their own data, but following the lead of some other stats packages, I propose a standard way for the statistics module to do so. I hope this will be uncontroversial (he says, optimistically...) but just in case, here is some prior art: (1) Nearly all R stats functions take a "na.rm" argument which defaults to False; if True, NA and NAN values will be stripped. (2) The scipy.stats.ttest_ind function takes a "nan_policy" argument which specifies what to do if a NAN is seen in the data. https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.ttest_ind.h... (3) At least some Matlab functions, such as mean(), take an optional flag that determines whether to ignore NANs or include them. https://au.mathworks.com/help/matlab/ref/mean.html#bt5b82t-1-nanflag I propose adding a "nan_policy" keyword-only parameter to the relevant statistics functions (mean, median, variance etc), and defining the following policies: IGNORE: quietly ignore all NANs FAIL: raise an exception if any NAN is seen in the data PASS: pass NANs through unchanged (the default) RETURN: return a NAN if any NAN is seen in the data WARN: ignore all NANs but raise a warning if one is seen PASS is equivalent to saying that you, the caller, have taken full responsibility for filtering out NANs and there's no need for the function to slow down processing by doing so again. Either that, or you want the current implementation-dependent behaviour. FAIL is equivalent to treating all NANs as "signalling NANs". The presence of a NAN is an error. RETURN is equivalent to "NAN poisoning" -- the presence of a NAN in a calculation causes it to return a NAN, allowing NANs to propogate through multiple calculations. IGNORE and WARN are the same, except IGNORE is silent and WARN raises a warning. Questions: - does anyone have an serious objections to this? - what do you think of the names for the policies? - are there any additional policies that you would like to see? (if so, please give use-cases) - are you happy with the default? Bike-shed away! -- Steve

Show replies by date

David Mertz

7 Jan 7 Jan

12:46 a.m.

Would these policies be named as strings or with an enum? Following Pandas, we'd probably support both. I won't bikeshed the names, but they seem to cover desired behaviors. On Sun, Jan 6, 2019, 7:28 PM Steven D'Aprano

...

Bug #33084 reports that the statistics library calculates median and other stats wrongly if the data contains NANs. Worse, the result depends on the initial placement of the NAN:

py> from statistics import median py> NAN = float('nan') py> median([NAN, 1, 2, 3, 4]) 2 py> median([1, 2, 3, 4, NAN]) 3

See the bug report for more detail:

https://bugs.python.org/issue33084

The caller can always filter NANs out of their own data, but following the lead of some other stats packages, I propose a standard way for the statistics module to do so. I hope this will be uncontroversial (he says, optimistically...) but just in case, here is some prior art:

(1) Nearly all R stats functions take a "na.rm" argument which defaults to False; if True, NA and NAN values will be stripped.

(2) The scipy.stats.ttest_ind function takes a "nan_policy" argument which specifies what to do if a NAN is seen in the data.

https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.ttest_ind.h...

(3) At least some Matlab functions, such as mean(), take an optional flag that determines whether to ignore NANs or include them.

https://au.mathworks.com/help/matlab/ref/mean.html#bt5b82t-1-nanflag

I propose adding a "nan_policy" keyword-only parameter to the relevant statistics functions (mean, median, variance etc), and defining the following policies:

IGNORE: quietly ignore all NANs FAIL: raise an exception if any NAN is seen in the data PASS: pass NANs through unchanged (the default) RETURN: return a NAN if any NAN is seen in the data WARN: ignore all NANs but raise a warning if one is seen

PASS is equivalent to saying that you, the caller, have taken full responsibility for filtering out NANs and there's no need for the function to slow down processing by doing so again. Either that, or you want the current implementation-dependent behaviour.

FAIL is equivalent to treating all NANs as "signalling NANs". The presence of a NAN is an error.

RETURN is equivalent to "NAN poisoning" -- the presence of a NAN in a calculation causes it to return a NAN, allowing NANs to propogate through multiple calculations.

IGNORE and WARN are the same, except IGNORE is silent and WARN raises a warning.

Questions:

- does anyone have an serious objections to this?

- what do you think of the names for the policies?

- are there any additional policies that you would like to see? (if so, please give use-cases)

- are you happy with the default?

Bike-shed away!

-- Steve _______________________________________________ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/

Steven D'Aprano

2:19 a.m.

On Sun, Jan 06, 2019 at 07:46:03PM -0500, David Mertz wrote:

...

Would these policies be named as strings or with an enum? Following Pandas, we'd probably support both.

Sure, I can support both.

...

I won't bikeshed the names, but they seem to cover desired behaviors.

Good to hear. -- Steve

David Mertz

2:46 a.m.

I have to say though that the existing behavior of `statistics.median[_low|_high|]` is SURPRISING if not outright wrong. It is the behavior in existing Python, but it is very strange. The implementation simply does whatever `sorted()` does, which is an implementation detail. In particular, NaN's being neither less than nor greater than any floating point number, just stay where they are during sorting. But that's a particular feature of TimSort. Yes, we are guaranteed that sorts are stable; and we have rules about which things can and cannot be compared for inequality at all. But beyond that, I do not think Python ever promised that NaNs would remain in the same positions after sorting if some other position was stable under a different sorting algorithm. So in the incredibly unlikely even I invent a DavidSort that behaves better than TimSort, is stable, and compares only the same Python objects as current CPython, a future version could use this algorithm without breaking promises... even if NaN's sometimes sorted differently than in TimSort. For that matter, some new implementation could use my not-nearly-as-good DavidSort, and while being slower, would still be compliant. Relying on that for the result of `median()` feels strange to me. It feels strange as the default behavior, but that's the status quo. But it feels even stranger that there are not at least options to deal with NaNs in more of the signaling or poisoning ways that every other numeric library does. On Sun, Jan 6, 2019 at 7:28 PM Steven D'Aprano wrote:

...

Bug #33084 reports that the statistics library calculates median and other stats wrongly if the data contains NANs. Worse, the result depends on the initial placement of the NAN:

py> from statistics import median py> NAN = float('nan') py> median([NAN, 1, 2, 3, 4]) 2 py> median([1, 2, 3, 4, NAN]) 3

See the bug report for more detail:

https://bugs.python.org/issue33084

The caller can always filter NANs out of their own data, but following the lead of some other stats packages, I propose a standard way for the statistics module to do so. I hope this will be uncontroversial (he says, optimistically...) but just in case, here is some prior art:

(1) Nearly all R stats functions take a "na.rm" argument which defaults to False; if True, NA and NAN values will be stripped.

(2) The scipy.stats.ttest_ind function takes a "nan_policy" argument which specifies what to do if a NAN is seen in the data.

https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.ttest_ind.h...

(3) At least some Matlab functions, such as mean(), take an optional flag that determines whether to ignore NANs or include them.

https://au.mathworks.com/help/matlab/ref/mean.html#bt5b82t-1-nanflag

I propose adding a "nan_policy" keyword-only parameter to the relevant statistics functions (mean, median, variance etc), and defining the following policies:

IGNORE: quietly ignore all NANs FAIL: raise an exception if any NAN is seen in the data PASS: pass NANs through unchanged (the default) RETURN: return a NAN if any NAN is seen in the data WARN: ignore all NANs but raise a warning if one is seen

PASS is equivalent to saying that you, the caller, have taken full responsibility for filtering out NANs and there's no need for the function to slow down processing by doing so again. Either that, or you want the current implementation-dependent behaviour.

FAIL is equivalent to treating all NANs as "signalling NANs". The presence of a NAN is an error.

RETURN is equivalent to "NAN poisoning" -- the presence of a NAN in a calculation causes it to return a NAN, allowing NANs to propogate through multiple calculations.

IGNORE and WARN are the same, except IGNORE is silent and WARN raises a warning.

Questions:

- does anyone have an serious objections to this?

- what do you think of the names for the policies?

- are there any additional policies that you would like to see? (if so, please give use-cases)

- are you happy with the default?

Bike-shed away!

-- Steve _______________________________________________ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/

-- Keeping medicines from the bloodstreams of the sick; food from the bellies of the hungry; books from the hands of the uneducated; technology from the underdeveloped; and putting advocates of freedom in prisons. Intellectual property is to the 21st century what the slave trade was to the 16th.

Tim Peters

3:41 a.m.

[David Mertz ]

...

I have to say though that the existing behavior of `statistics.median[_low|_high|]` is SURPRISING if not outright wrong. It is the behavior in existing Python, but it is very strange.

The implementation simply does whatever `sorted()` does, which is an implementation detail. In particular, NaN's being neither less than nor greater than any floating point number, just stay where they are during sorting.

I expect you inferred that from staring at a handful of examples, but it's illusion. Python's sort uses only __lt__ comparisons, and if those don't implement a total ordering then _nothing_ is defined about sort's result (beyond that it's some permutation of the original list). There's nothing special about NaNs in this. For example, if you sort a list of sets, then "<" means subset inclusion, which doesn't define a total ordering among sets in general either (unless for every pair of sets in a specific list, one is a proper subset of the other - in which case the list of sets will be sorted in order of increasing cardinality).

...

But that's a particular feature of TimSort. Yes, we are guaranteed that sorts are stable; and we have rules about which things can and cannot be compared for inequality at all. But beyond that, I do not think Python ever promised that NaNs would remain in the same positions after sorting

We don't promise it, and it's not true. For example,

...

...
...
import math nan = math.nan xs = [0, 1, 2, 4, nan, 5, 3] sorted(xs) [0, 1, 2, 3, 4, nan, 5]

The NaN happened to move "one place to the right" there. There's no point to analyzing "why" - it's purely an accident deriving from the pattern of __lt__ outcomes the internals happened to invoke. FYI, it goes like so: is 1 < 0? No, so the first two are already sorted. is 2 < 1? No, so the first three are already sorted. is 4 < 2? No, so the first four are already sorted is nan < 4? No, so the first five are already sorted is 5 < nan? No, so the first six are already sorted is 3 < 5? Yes! At that point a binary insertion is used to move 3 into place. And none of timsort's "fancy" parts even come into play for lists so small. The patterns of comparisons the fancy parts invoke can be much more involved. At no point does the algorithm have any idea that there are NaNs in the list - it only looks at boolean __lt__ outcomes. So, certainly, if you want median to be predictable in the presence of NaNs, sort's behavior in the presence of NaNs can't be relied on in any respect.

...

...
...
sorted([6, 5, nan, 4, 3, 2, 1]) [1, 2, 3, 4, 5, 6, nan]

...

...
...
sorted([9, 9, 9, 9, 9, 9, nan, 1, 2, 3, 4, 5, 6]) [9, 9, 9, 9, 9, 9, nan, 1, 2, 3, 4, 5, 6]

Stephan Hoyer

3:40 a.m.

On Sun, Jan 6, 2019 at 4:27 PM Steven D'Aprano wrote:

...

I propose adding a "nan_policy" keyword-only parameter to the relevant statistics functions (mean, median, variance etc), and defining the following policies:

IGNORE: quietly ignore all NANs FAIL: raise an exception if any NAN is seen in the data PASS: pass NANs through unchanged (the default) RETURN: return a NAN if any NAN is seen in the data WARN: ignore all NANs but raise a warning if one is seen

I don't think PASS should be the default behavior, and I'm not sure it would be productive to actually implement all of these options. For reference, NumPy and pandas (the two most popular packages for data analytics in Python) support two of these modes: - RETURN (numpy.mean() and skipna=False for pandas) - IGNORE (numpy.nanmean() and skipna=True for pandas) RETURN is the default behavior for NumPy; IGNORE is the default for pandas. I'm pretty sure RETURN is the right default behavior for Python's standard library and anything else should be considered a bug. It safely propagates NaNs, along the lines of IEEE float behavior. I'm not sure what the use cases are for PASS, FAIL, or WARN, none of which are supported by NumPy or pandas: - PASS is a license to return silently incorrect results, in return for very marginal performance benefits. This seems at odds with the intended focus of the statistics module on correctness over speed. Returning incorrect statistics should not be considered a feature that needs to be maintained. - FAIL would make sense if statistics functions could introduce *new* NaN values. But as far as I can tell, statistics functions already raise StatisticsError in these cases (e.g., if zero data point are provided). If users are concerned about accidentally propagating NaNs, they should be encouraged to check for NaNs at the entry points of their code. - WARN is even less useful than FAIL. Seriously, who likes warnings? NumPy uses this approach for in array operations that produce NaNs (e.g., when dividing by zero), because *some* but not all results may be valid. But statistics functions return scalars. I'm not even entirely sure it makes sense to add the IGNORE option, or at least to add it only for NaN. None is also a reasonable sentinel for a missing value in Python, and user defined types (e.g., pandas.NaT) also fall in this category. It seems a little strange to single NaN out in particular.

Steven D'Aprano

7:05 a.m.

On Sun, Jan 06, 2019 at 07:40:32PM -0800, Stephan Hoyer wrote:

...

On Sun, Jan 6, 2019 at 4:27 PM Steven D'Aprano wrote:

...
I propose adding a "nan_policy" keyword-only parameter to the relevant statistics functions (mean, median, variance etc), and defining the following policies:

IGNORE: quietly ignore all NANs FAIL: raise an exception if any NAN is seen in the data PASS: pass NANs through unchanged (the default) RETURN: return a NAN if any NAN is seen in the data WARN: ignore all NANs but raise a warning if one is seen

I don't think PASS should be the default behavior, and I'm not sure it would be productive to actually implement all of these options.

I'm not wedded to the idea that the default ought to be the current behaviour. If there is a strong argument for one of the others, I'm listening.

...

For reference, NumPy and pandas (the two most popular packages for data analytics in Python) support two of these modes: - RETURN (numpy.mean() and skipna=False for pandas) - IGNORE (numpy.nanmean() and skipna=True for pandas)

RETURN is the default behavior for NumPy; IGNORE is the default for pandas.

I'm pretty sure RETURN is the right default behavior for Python's standard library and anything else should be considered a bug. It safely propagates NaNs, along the lines of IEEE float behavior.

How would you answer those who say that the right behaviour is not to propogate unwanted NANs, but to fail fast and raise an exception?

...

I'm not sure what the use cases are for PASS, FAIL, or WARN, none of which are supported by NumPy or pandas: - PASS is a license to return silently incorrect results, in return for very marginal performance benefits.

By my (very rough) preliminary testing, the cost of checking for NANs doubles the cost of calculating the median, and increases the cost of calculating the mean() by 25%. I'm not trying to compete with statistics libraries written in C for speed, but that doesn't mean I don't care about performance at all. The statistics library is already slower than I like and I don't want to slow it down further for the common case (numeric data with no NANs) for the sake of the uncommon case (data with NANs). But I hear you about the "return silently incorrect results" part. Fortunately, I think that only applies to sort-based functions like median(). mean() etc ought to propogate NANs with any reasonable implementation, but I'm reluctant to make that a guarantee in case I come up with some unreasonable implementation :-)

...

This seems at odds with the intended focus of the statistics module on correctness over speed. Returning incorrect statistics should not be considered a feature that needs to be maintained.

It is only incorrect because the data violates the documented requirement that it be *numeric data*, and the undocumented requirement that the numbers have a total order. (So complex numbers are out.) I admit that the docs could be improved, but there are no guarantees made about NANs. This doesn't mean I don't want to improve the situation! Far from it, hence this discussion.

...

- FAIL would make sense if statistics functions could introduce *new* NaN values. But as far as I can tell, statistics functions already raise StatisticsError in these cases (e.g., if zero data point are provided). If users are concerned about accidentally propagating NaNs, they should be encouraged to check for NaNs at the entry points of their code.

As far as I can tell, there are two kinds of people when it comes to NANs: those who think that signalling NANs are a waste of time and NANs should always propogate, and those who hate NANs and wish that they would always signal (raise an exception). I'm not going to get into an argument about who is right or who is wrong.

...

- WARN is even less useful than FAIL. Seriously, who likes warnings?

Me :-)

...

NumPy uses this approach for in array operations that produce NaNs (e.g., when dividing by zero), because *some* but not all results may be valid. But statistics functions return scalars.

I'm not even entirely sure it makes sense to add the IGNORE option, or at least to add it only for NaN. None is also a reasonable sentinel for a missing value in Python, and user defined types (e.g., pandas.NaT) also fall in this category. It seems a little strange to single NaN out in particular.

I am considering adding support for a dedicated "missing" value, whether it is None or a special sentinel. But one thing at a time. Ignoring NANs is moderately common in other statistics libraries, and although I personally feel that NANs shouldn't be used for missing values, I know many people do so. -- Steve

Nathaniel Smith

7:31 a.m.

On Sun, Jan 6, 2019 at 11:06 PM Steven D'Aprano wrote:

...

I'm not wedded to the idea that the default ought to be the current behaviour. If there is a strong argument for one of the others, I'm listening.

"Errors should never pass silently"? Silently returning nonsensical results is hard to defend as a default behavior IMO :-)

...

How would you answer those who say that the right behaviour is not to propogate unwanted NANs, but to fail fast and raise an exception?

Both seem defensible a priori, but every other mathematical operation in Python propagates NaNs instead of raising an exception. Is there something unusual about median that would justify giving it unusual behavior? -n -- Nathaniel J. Smith -- https://vorpus.org

Steven D'Aprano

8:09 a.m.

(By the way, I'm not outright disagreeing with you, I'm trying to weigh up the pros and cons of your position. You've given me a lot to think about. More below.) On Sun, Jan 06, 2019 at 11:31:30PM -0800, Nathaniel Smith wrote:

...

On Sun, Jan 6, 2019 at 11:06 PM Steven D'Aprano wrote:

...
I'm not wedded to the idea that the default ought to be the current behaviour. If there is a strong argument for one of the others, I'm listening.

"Errors should never pass silently"? Silently returning nonsensical results is hard to defend as a default behavior IMO :-)

If you violate the assumptions of the function, just about everything can in principle return nonsensical results. True, most of the time you have to work hard at it: class MyList(list): def __len__(self): return random.randint(0, sys.maxint) but it isn't unreasonable to document the assumptions of a function, and if the caller violates those assumptions, Garbage In Garbage Out applies. E.g. bisect requires that your list is sorted in ascending order. If it isn't, the results you get are nonsensical. py> data = [8, 6, 4, 2, 0] py> bisect.bisect(data, 1) 0 That's not a bug in bisect, that's a bug in the caller's code, and it isn't bisect's responsibility to fix it. Although it could be documented better, that's the current situation with NANs and median(). Data with NANs don't have a total ordering, and total ordering is the unstated assumption behind the idea of a median or middle value. So all bets are off.

...

...
How would you answer those who say that the right behaviour is not to propogate unwanted NANs, but to fail fast and raise an exception?

Both seem defensible a priori, but every other mathematical operation in Python propagates NaNs instead of raising an exception. Is there something unusual about median that would justify giving it unusual behavior?

Well, not everything... py> NAN/0 Traceback (most recent call last): File "<stdin>", line 1, in <module> ZeroDivisionError: float division by zero There may be others. But I'm not sure that "everything else does it" is a strong justification. It is *a* justification, since consistency is good, but consistency does not necessarily outweigh other concerns. One possible argument for making PASS the default, even if that means implementation-dependent behaviour with NANs, is that in the absense of a clear preference for FAIL or RETURN, at least PASS is backwards compatible. You might shoot yourself in the foot, but at least you know its the same foot you shot yourself in using the previous version *wink* -- Steve

Neil Girdhar

10 Jan 10 Jan

4:42 p.m.

On Monday, January 7, 2019 at 3:16:07 AM UTC-5, Steven D'Aprano wrote:

...

(By the way, I'm not outright disagreeing with you, I'm trying to weigh up the pros and cons of your position. You've given me a lot to think about. More below.)

On Sun, Jan 06, 2019 at 11:31:30PM -0800, Nathaniel Smith wrote:

...
On Sun, Jan 6, 2019 at 11:06 PM Steven D'Aprano javascript:> wrote:

...
I'm not wedded to the idea that the default ought to be the current behaviour. If there is a strong argument for one of the others, I'm listening.

"Errors should never pass silently"? Silently returning nonsensical results is hard to defend as a default behavior IMO :-)

If you violate the assumptions of the function, just about everything can in principle return nonsensical results. True, most of the time you have to work hard at it:

class MyList(list): def __len__(self): return random.randint(0, sys.maxint)

but it isn't unreasonable to document the assumptions of a function, and if the caller violates those assumptions, Garbage In Garbage Out applies.

I'm with Antoine, Nathaniel, David, and Chris: it is unreasonable to silently return nonsensical results even if you've documented it. Documenting it only makes it worse because it's like an "I told you so" when people finally figure out what's wrong and go to file the bug.

...

E.g. bisect requires that your list is sorted in ascending order. If it isn't, the results you get are nonsensical.

py> data = [8, 6, 4, 2, 0] py> bisect.bisect(data, 1) 0

That's not a bug in bisect, that's a bug in the caller's code, and it isn't bisect's responsibility to fix it.

Although it could be documented better, that's the current situation with NANs and median(). Data with NANs don't have a total ordering, and total ordering is the unstated assumption behind the idea of a median or middle value. So all bets are off.

...
...
How would you answer those who say that the right behaviour is not to propogate unwanted NANs, but to fail fast and raise an exception?

Both seem defensible a priori, but every other mathematical operation in Python propagates NaNs instead of raising an exception. Is there something unusual about median that would justify giving it unusual behavior?

Well, not everything...

py> NAN/0 Traceback (most recent call last): File "<stdin>", line 1, in <module> ZeroDivisionError: float division by zero

There may be others. But I'm not sure that "everything else does it" is a strong justification. It is *a* justification, since consistency is good, but consistency does not necessarily outweigh other concerns.

One possible argument for making PASS the default, even if that means implementation-dependent behaviour with NANs, is that in the absense of a clear preference for FAIL or RETURN, at least PASS is backwards compatible.

You might shoot yourself in the foot, but at least you know its the same foot you shot yourself in using the previous version *wink*

-- Steve _______________________________________________ Python-ideas mailing list Python...@python.org javascript: https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/

Antoine Pitrou

7 Jan 7 Jan

8:24 a.m.

On Sun, 6 Jan 2019 19:40:32 -0800 Stephan Hoyer wrote:

...

On Sun, Jan 6, 2019 at 4:27 PM Steven D'Aprano wrote:

...
I propose adding a "nan_policy" keyword-only parameter to the relevant statistics functions (mean, median, variance etc), and defining the following policies:

IGNORE: quietly ignore all NANs FAIL: raise an exception if any NAN is seen in the data PASS: pass NANs through unchanged (the default) RETURN: return a NAN if any NAN is seen in the data WARN: ignore all NANs but raise a warning if one is seen

I don't think PASS should be the default behavior, and I'm not sure it would be productive to actually implement all of these options.

For reference, NumPy and pandas (the two most popular packages for data analytics in Python) support two of these modes: - RETURN (numpy.mean() and skipna=False for pandas) - IGNORE (numpy.nanmean() and skipna=True for pandas)

RETURN is the default behavior for NumPy; IGNORE is the default for pandas.

I agree with Stephan that RETURN and IGNORE are the only useful modes of operation here. Regards Antoine.

Steven D'Aprano

9 Jan 9 Jan

5:19 a.m.

On Mon, Jan 07, 2019 at 11:27:22AM +1100, Steven D'Aprano wrote: [...]

...

I propose adding a "nan_policy" keyword-only parameter to the relevant statistics functions (mean, median, variance etc), and defining the following policies:

I asked some heavy users of statistics software (not just Python users) what behaviour they would find useful, and as I feared, I got no conclusive answer. So far, the answers seem to be almost evenly split into four camps: - don't do anything, it is the caller's responsibility to filter NANs; - raise an immediate error; - return a NAN; - treat them as missing data. (Currently it is a small sample size, so I don't expect the answers will stay evenly split if more people answer.) On consideration of all the views expressed, thank you to everyone who commented, I'm now inclined to default to returning a NAN (which happens to be the current behaviour of mean etc, but not median except by accident) even if it impacts performance. -- Steve

Oscar Benjamin

10 Jan 10 Jan

1:21 a.m.

On Wed, 9 Jan 2019 at 05:20, Steven D'Aprano wrote:

...

On Mon, Jan 07, 2019 at 11:27:22AM +1100, Steven D'Aprano wrote:

[...]

...
I propose adding a "nan_policy" keyword-only parameter to the relevant statistics functions (mean, median, variance etc), and defining the following policies:

I asked some heavy users of statistics software (not just Python users) what behaviour they would find useful, and as I feared, I got no conclusive answer. So far, the answers seem to be almost evenly split into four camps:

- don't do anything, it is the caller's responsibility to filter NANs;

- raise an immediate error;

- return a NAN;

- treat them as missing data.

I would prefer to raise an exception in on nan. It's much easier to debug an exception than a nan. Take a look at the Julia docs for their statistics module: https://docs.julialang.org/en/v1/stdlib/Statistics/index.html In julia they have defined an explicit "missing" value. With that you can explicitly distinguish between a calculation error and missing data. The obvious Python equivalent would be None.

...

On consideration of all the views expressed, thank you to everyone who commented, I'm now inclined to default to returning a NAN (which happens to be the current behaviour of mean etc, but not median except by accident) even if it impacts performance.

Whichever way you go with this it might make sense to provide helper functions for users to deal with nans e.g.: xbar = mean(without_nans(data)) xbar = mode(replace_nans_with_None(data)) -- Oscar

1932

Age (days ago)

1935

Last active (days ago)

List overview

Download

12 comments

8 participants

participants (8)

Antoine Pitrou
David Mertz
Nathaniel Smith
Neil Girdhar
Oscar Benjamin
Stephan Hoyer
Steven D'Aprano
Tim Peters

NAN handling in the statistics module

tags

participants (8)