Re: [Python-ideas] NAN handling in the statistics module
On Sun, Jan 06, 2019 at 10:52:47PM -0500, David Mertz wrote:
Playing with Tim's examples, this suggests that statistics.median() is simply outright WRONG. I can think of absolutely no way to characterize these as reasonable results:
Python 3.7.1 | packaged by conda-forge | (default, Nov 13 2018, 09:50:42) In [4]: statistics.median([9, 9, 9, nan, 1, 2, 3, 4, 5]) Out[4]: 1 In [5]: statistics.median([9, 9, 9, nan, 1, 2, 3, 4]) Out[5]: nan
The second is possibly correct if one thinks that the median of a list containing NAN should return NAN -- but its only correct by accident, not design. As I wrote on the bug tracker: "I agree that the current implementation-dependent behaviour when there are NANs in the data is troublesome." The only reason why I don't call it a bug is that median() makes no promises about NANs at all, any more than it makes promises about the median of a list of sets or any other values which don't define a total order. help(median) says: Return the median (middle value) of numeric data. By definition, data containing Not A Number values isn't numeric :-) I'm not opposed to documenting this better. Patches welcome :-) There are at least three correct behaviours in the face of data containing NANs: propogate a NAN result, fail fast with an exception, or treat NANs as missing data that can be ignored. Only the caller can decide which is the right policy for their data set. Aside: the IEEE-754 standard provides both signalling and quiet NANs. It is hard and unreliable to generate signalling float NANs in Python, but we can do it with Decimal: py> from statistics import median py> from decimal import Decimal py> median([1, 3, 4, Decimal("sNAN"), 2]) Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/usr/local/lib/python3.5/statistics.py", line 349, in median data = sorted(data) decimal.InvalidOperation: [<class 'decimal.InvalidOperation'>] In principle, one ought to be able to construct float signalling NANs too, but unfortunately that's platform dependent: https://mail.python.org/pipermail/python-dev/2018-November/155713.html Back to the topic on hand: I agree that median() does "the wrong thing" when NANs are involved, but there is no one "right thing" that we can do in its place. People disagree as to whether NANs should propogate, or raise, or be treated as missing data, and I see good arguments for all three. -- Steve
On Mon, Jan 7, 2019 at 1:27 AM Steven D'Aprano <steve@pearwood.info> wrote:
In [4]: statistics.median([9, 9, 9, nan, 1, 2, 3, 4, 5]) Out[4]: 1 In [5]: statistics.median([9, 9, 9, nan, 1, 2, 3, 4]) Out[5]: nan
The second is possibly correct if one thinks that the median of a list containing NAN should return NAN -- but its only correct by accident, not design.
Exactly... in the second example, the nan just happens to wind up "in the middle" of the sorted() list. The fact that is the return value has nothing to do propagating the nan (if it did, I think it would be a reasonable answer). I contrived the examples to get these... the first answer which is the "most wrong number" is also selected for the same reason than a nan is "near the middle."
I'm not opposed to documenting this better. Patches welcome :-)
I'll provide a suggested batch on the bug. It will simply be a wholly different implementation of median and friends.
There are at least three correct behaviours in the face of data containing NANs: propogate a NAN result, fail fast with an exception, or treat NANs as missing data that can be ignored. Only the caller can decide which is the right policy for their data set.
I'm not sure that raising right away is necessary as an option. That feels like something a user could catch at the end when they get a NaN result. But those seem reasonable as three options. -- Keeping medicines from the bloodstreams of the sick; food from the bellies of the hungry; books from the hands of the uneducated; technology from the underdeveloped; and putting advocates of freedom in prisons. Intellectual property is to the 21st century what the slave trade was to the 16th.
On Mon, Jan 07, 2019 at 01:34:47AM -0500, David Mertz wrote:
I'm not opposed to documenting this better. Patches welcome :-)
I'll provide a suggested batch on the bug. It will simply be a wholly different implementation of median and friends.
I ask for a documentation patch and you start talking about a whole new implementation. Huh. A new implementation with precisely the same behaviour is a waste of time, so I presume you're planning to change the behaviour. How about if you start off by explaining what the new semantics are? -- Steve
Happy New Year (off topic). Based on a quick review of the python docs, the bug report, PEP 450 and this thread, I suggest 1. More carefully draw attention to the NaN feature, in the documentation for existing Python versions. 2. Consider revising statistics.py so that it raises an exception, when passed NaN data. https://www.python.org/dev/peps/pep-0450/#rationale says <quote> The proposed statistics module is motivated by the "batteries included" philosophy towards the Python standard library. Raymond Hettinger and other senior developers have requested a quality statistics library that falls somewhere in between high-end statistics libraries and ad hoc code. Statistical functions such as mean, standard deviation and others are obvious and useful batteries, familiar to any Secondary School student. </quote> The PEP makes no mention of NaN. Was it in error, in not stating that NaN data is admissable? Is NaN part of the "batteries familar to any Secondary School student?". https://docs.python.org/3/library/statistics.html says <quote> This module provides functions for calculating mathematical statistics of numeric (Real-valued) data. </quote> Some people regard NaN as not being a real-valued number. (Hint: There's a clue in the name: Not A Number.) Note that statistics.py already raises StatisticsError, when it regards the data as flawed. Finally, I suggest that we might learn from == Fix some special cases in Fractions? https://mail.python.org/pipermail/python-ideas/2018-August/053083.html == I'll put a brief summary of my message into the bug tracker for this issue. -- Jonathan
On Mon, Jan 07, 2019 at 02:01:34PM +0000, Jonathan Fine wrote:
Finally, I suggest that we might learn from == Fix some special cases in Fractions? https://mail.python.org/pipermail/python-ideas/2018-August/053083.html ==
I remember that thread from August, and I've just re-read the entire thing now, and I don't see the relevance. Can you explain why you think it is relevant to this thread? -- Steve
On Mon, Jan 7, 2019 at 6:50 AM Steven D'Aprano <steve@pearwood.info> wrote:
I'll provide a suggested batch on the bug. It will simply be a wholly different implementation of median and friends.
I ask for a documentation patch and you start talking about a whole new implementation. Huh. A new implementation with precisely the same behaviour is a waste of time, so I presume you're planning to change the behaviour. How about if you start off by explaining what the new semantics are?
I think it would be counter-productive to document the bug (as something other than a bug). Picking what is a completely arbitrary element in face of a non-total order can never be "correct" behavior, and is never worth preserving for compatibility. I think the use of statistics.median against partially ordered elements is simply rare enough that no one tripped against it, or at least no one reported it before. Notice that the code itself pretty much recognizes the bug in this comment: # FIXME: investigate ways to calculate medians without sorting? Quickselect? So it seems like the original author knew the implementation was wrong. But you're right, the new behavior needs to be decided. Propagating NaNs is reasonable. Filtering out NaN's is reasonable. Those are the default behaviors of NumPy and Pandas, respectively: np.median([1,2,3,nan]) # -> nan pd.Series([1,2,3,nan]).median() # -> 2.0 (Yes, of course there are ways in each to get the other behavior). Other non-Python tools similarly suggest one of those behaviors, but really nothing else. So yeah, what I was suggesting as a patch was an implementation that had PROPAGATE and IGNORE semantics. I don't have a real opinion about which should be the default, but the current behavior should simply not exist at all. As I think about it, warnings and exceptions are really too complex an API for this module. It's not hard to manually check for NaNs and generate those in your own code. -- Keeping medicines from the bloodstreams of the sick; food from the bellies of the hungry; books from the hands of the uneducated; technology from the underdeveloped; and putting advocates of freedom in prisons. Intellectual property is to the 21st century what the slave trade was to the 16th.
On Mon, Jan 07, 2019 at 10:05:19AM -0500, David Mertz wrote:
On Mon, Jan 7, 2019 at 6:50 AM Steven D'Aprano <steve@pearwood.info> wrote:
I'll provide a suggested batch on the bug. It will simply be a wholly different implementation of median and friends.
I ask for a documentation patch and you start talking about a whole new implementation. Huh. A new implementation with precisely the same behaviour is a waste of time, so I presume you're planning to change the behaviour. How about if you start off by explaining what the new semantics are?
I think it would be counter-productive to document the bug (as something other than a bug).
Its not a bug in median(), because median requires the data implement a total order. Although that isn't explicitly documented, it is common sense: if the data cannot be sorted into smallest-to-largest order, how can you decide which value is in the middle? What is explicitly documented is that median requires numeric data, and NANs aren't numbers. So the only bug here is the caller's failure to filter out NANs. If you pass it garbage data, you get garbage results. Nevertheless, it is a perfectly reasonable thing to want to use data which may or may not contain NANs, and I want to enhance the statistics module to make it easier for the caller to handle NANs in whichever way they see fit. This is a new feature, not a bug fix.
Picking what is a completely arbitrary element in face of a non-total order can never be "correct" behavior, and is never worth preserving for compatibility.
If you truly believe that, then you should also believe that both list.sort() and the bisect module are buggy, for precisely the same reason. Perhaps you ought to raise a couple of bug reports, and see if you can get Tim and Raymond to agree that sorting and bisect should do something other than what they already do in the face of data that doesn't define a total order.
I think the use of statistics.median against partially ordered elements is simply rare enough that no one tripped against it, or at least no one reported it before.
I'm sure it is rare. Nevertheless, I still want to make it easier for people to deal with this case.
Notice that the code itself pretty much recognizes the bug in this comment:
# FIXME: investigate ways to calculate medians without sorting? Quickselect?
I doubt Quickselect will be immune to the problem of NANs. It too relies on comparisons, and while I don't know for sure that it requires a total order, I'd be surprised if it doesn't. Quickselect is basically a variant of Quicksort that only partially sorts the data.
So it seems like the original author knew the implementation was wrong.
That's not why I put that comment in. Sorting is O(N log N) on average, and Quickselect can be O(N) on average. In principle, Quickselect or a similar selection algorithm could be faster than sorting. [...]
It's not hard to manually check for NaNs and generate those in your own code.
That is correct, but by that logic, we don't need to support *any* form of NAN handling at all. It is easy (if inefficent) for the caller to pre-filter their data. I want to make it easier and more convenient and avoid having to iterate over the data twice if it isn't necessary. -- Steve
On Mon, Jan 7, 2019 at 8:39 AM Steven D'Aprano <steve@pearwood.info> wrote:
Its not a bug in median(), because median requires the data implement a total order. Although that isn't explicitly documented, it is common sense: if the data cannot be sorted into smallest-to-largest order, how can you decide which value is in the middle?
What is explicitly documented is that median requires numeric data, and NANs aren't numbers. So the only bug here is the caller's failure to filter out NANs. If you pass it garbage data, you get garbage results.
Nevertheless, it is a perfectly reasonable thing to want to use data which may or may not contain NANs, and I want to enhance the statistics module to make it easier for the caller to handle NANs in whichever way they see fit. This is a new feature, not a bug fix.
So then you are arguing that making reasonable treatment of NANs the default is not breaking backwards compatibility (because previously the data was considered wrong). This sounds like a good idea to me. Presumably the NANs are inserted into the data explicitly in order to signal missing data -- this seems more plausible to me (given the typical use case for the statistics module) than that they would be the result of a computation like Inf/Inf. (While propagating NANs makes sense for the fundamental arithmetical and mathematical functions, given that we have chosen not to raise an error when encountering them, I think other stdlib libraries are not beholden to that behavior.) -- --Guido van Rossum (python.org/~guido)
On Mon, Jan 7, 2019, 11:38 AM Steven D'Aprano <steve@pearwood.info wrote:
Its not a bug in median(), because median requires the data implement a total order. Although that isn't explicitly documented, it is common sense: if the data cannot be sorted into smallest-to-largest order, how can you decide which value is in the middle?
I can see reason that median per-se requires a total order. Yes, the implementation chosen (and many reasonable and obvious implementations) make that assumption. But here is a perfectly reasonable definition of median: * A median is an element of a collection such that 1/2 of all elements of the collection are less than it. Depending on how you interpret median, this element might also not be in the original collection, but be some newly generated value that has that property. E.g. statistics.median([1,2,3,4]) == 2.5. Under a partial ordering, a median may not be unique. Even under a total ordering this is true if some subset of elements form an equivalence class. But under partial ordering, the non-uniqueness can get much weirder. What is explicitly documented is that median requires numeric data, and
NANs aren't numbers. So the only bug here is the caller's failure to filter out NANs. If you pass it garbage data, you get garbage results.
OK, then we should either raise an exception or propagate the NaN if that is the intended meaning of the function. And obviously document that such is the assumption. NaN's *are* explicitly in the floating-point domain, so it's fuzzy whether they are numeric or not, notwithstanding the name. I'm very happy to push NaN-filtering to users (as NumPy does, although it provides alternate functions for many reductions that incorporate this... the basic ones always propagate NaNs though).
Nevertheless, it is a perfectly reasonable thing to want to use data which may or may not contain NANs, and I want to enhance the statistics module to make it easier for the caller to handle NANs in whichever way they see fit. This is a new feature, not a bug fix.
I disagree about bug vs. feature. The old behavior is simply and unambiguously wrong, but was not previously noticed. Obviously, the bug does not affect most uses, which is why it was not noticed.
If you truly believe that, then you should also believe that both list.sort() and the bisect module are buggy, for precisely the same reason.
I cannot perceive any close connection between the correct behavior of statistics.mean() and that of list.sort() or bisect. I know the concrete implementation of the former uses the latter, but the answers for what is RIGHT feel completely independent to me. I doubt Quickselect will be immune to the problem of NANs. It too relies
on comparisons, and while I don't know for sure that it requires a total order, I'd be surprised if it doesn't. Quickselect is basically a variant of Quicksort that only partially sorts the data.
Yes, I was thinking of trying to tweak Quickselect to handle NaNs during the process. I.e. probably terminate and propagate the NaN early, as soon as one is encountered. That might save much of the work if a NaN is encountered early and most comparisons and moves can be avoided. Of course, I'm sure there is a worst case where almost all the work is done before a NaN check is performed in some constructed example.
On Mon, Jan 7, 2019 at 12:19 PM David Mertz <mertz@gnosis.cx> wrote:
Under a partial ordering, a median may not be unique. Even under a total ordering this is true if some subset of elements form an equivalence class. But under partial ordering, the non-uniqueness can get much weirder.
I'm sure with more thought, weirder things can be thought of. But just as a quick example, it would be easy to write classes such that: a < b < c < a In such a case (or expand for an odd number of distinct things), it would be reasonable to call ANY element of [a, b, c] a median. That's funny, but it is not imprecise. -- Keeping medicines from the bloodstreams of the sick; food from the bellies of the hungry; books from the hands of the uneducated; technology from the underdeveloped; and putting advocates of freedom in prisons. Intellectual property is to the 21st century what the slave trade was to the 16th.
On 2019-01-07 16:34, Steven D'Aprano wrote:
On Mon, Jan 07, 2019 at 10:05:19AM -0500, David Mertz wrote: [snip]
It's not hard to manually check for NaNs and generate those in your own code.
That is correct, but by that logic, we don't need to support *any* form of NAN handling at all. It is easy (if inefficent) for the caller to pre-filter their data. I want to make it easier and more convenient and avoid having to iterate over the data twice if it isn't necessary.
Could the functions optionally accept a callback that will be called when a NaN is first seen? If the callback returns False, NaNs are suppressed, otherwise they are retained and the function returns NaN (or whatever). The callback would give the user a chance to raise a warning or an exception, if desired.
This callback idea feels way over-engineered for this module. It would absolutely make sense in a more specialized numeric or statistical library. But `statistics` feels to me like it should be only simple and basic operations, with very few knobs attached. On Mon, Jan 7, 2019, 2:36 PM MRAB <python@mrabarnett.plus.com wrote:
On 2019-01-07 16:34, Steven D'Aprano wrote:
On Mon, Jan 07, 2019 at 10:05:19AM -0500, David Mertz wrote: [snip]
It's not hard to manually check for NaNs and generate those in your own code.
That is correct, but by that logic, we don't need to support *any* form of NAN handling at all. It is easy (if inefficent) for the caller to pre-filter their data. I want to make it easier and more convenient and avoid having to iterate over the data twice if it isn't necessary.
Could the functions optionally accept a callback that will be called when a NaN is first seen?
If the callback returns False, NaNs are suppressed, otherwise they are retained and the function returns NaN (or whatever).
The callback would give the user a chance to raise a warning or an exception, if desired. _______________________________________________ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/
On Mon, Jan 07, 2019 at 07:35:45PM +0000, MRAB wrote:
Could the functions optionally accept a callback that will be called when a NaN is first seen?
If the callback returns False, NaNs are suppressed, otherwise they are retained and the function returns NaN (or whatever).
That's an interesting API which I shall have to think about.
The callback would give the user a chance to raise a warning or an exception, if desired.
One practical annoyance of this API is that you cannot include raise from a lambda, so people desiring "fail fast" semantics can't do this: result = mean(data, callback=lambda: raise Exception) They have to pre-declare the callback using def. -- Steve
I'd like to see internal consistency across the central-tendency statistics in the presence of NaNs. What happens now: mean: the code appears to guarantee that a NaN will be returned if a NaN is in the input. median: as recently detailed, just about anything can happen, depending on how undefined behaviors in .sort() interact. mode: while NaN != NaN at the Python level, internally dicts use an identity shortcut so that, effectively, "is" takes precedence over `__eq__`. So a given NaN object will be recognized as repeated if it appears more than once, but distinct NaN objects remain distinct: So, e.g.,
from math import inf, nan import statistics statistics.mode([2, 2, nan, nan, nan]) nan
That's NOT "NaN-in, NaN-out", it's "a single NaN object is the object that appeared most often". Make those 3 distinct NaN objects (inf - inf results) instead, and the mode changes:
statistics.mode([2, 2, inf - inf, inf - inf, inf - inf]) 2
Since the current behavior of `mean()` is the only one that's sane, that should probably become the default for all of them (NaN in -> NaN out). "NaN in -> exception" and "pretend NaNs in the input don't exist" are the other possibly useful behaviors. About median speed, I wouldn't worry. Long ago I tried many variations of QuickSelect, and it required very large inputs for a Python-coded QuickSelect to run faster than a straightforward .sort()+index. It's bound to be worse now: - Current Python .sort() is significantly faster on one-type lists because it figures out the single type-specific comparison routine needed once at the start, instead of enduring N log N full-blown PyObject_RichCompareBool calls. - And the current .sort() can be very much faster than older ones on data with significant order. In the limit, .sort()+index will run faster than any QuickSelect variant on already-sorted or already-reverse-sorted data. QuickSelect variants aren't adaptive in any sense, except that a "fat pivot" version (3-way partition, into < pivot, == pivot, and > pivot regions) is very effective on data with many equal values. In Python 3.7.2, for randomly ordered random\-ish floats I find that median() is significantly faster than mean() even on lists with millions of elements, despite that the former sorts and the latter doesn't.
On Tue, Jan 8, 2019 at 11:57 PM Tim Peters <tim.peters@gmail.com> wrote:
I'd like to see internal consistency across the central-tendency statistics in the presence of NaNs. What happens now:
I think consistent NaN-poisoning would be excellent behavior. It will always make sense for median (and its variants).
statistics.mode([2, 2, nan, nan, nan]) nan
statistics.mode([2, 2, inf - inf, inf - inf, inf - inf]) 2
But in the mode case, I'm not sure we should ALWAYS treat a NaN as poisoning the result. If NaN means "missing value" then sometimes it could change things, and we shouldn't guess. But what if it cannot? >>> statistics.mode([9, 9, 9, 9, nan1, nan2, nan3]) No matter what missing value we take those nans to maybe-possibly represent, 9 is still the most common element. This is only true when the most common thing occurs at least as often as the 2nd most common thing PLUS the number of all NaNs. But in that case, 9 really is the mode. We have one example of non-poisoning NaN in basic operations: >>> nan**0 1.0 So if the NaN "cannot possibly change the answer" then its reasonable to produce a non-NaN answer IMO. Except we don't really get that with 0**nan or 0*nan already... so a NaN-poisoning mode wouldn't actually offend my sensibilities that much. :-). I guess you could argue that NaN "could be inf". In that case 0*nan being nan makes sense. But this still feels hard to slightly odd: >>> 0**inf 0.0 >>> 0**nan nan I guess it's supported by: >>> 0**-1 ZeroDivisionError: 0.0 cannot be raised to a negative power A *missing value* could be a negative one. -- Keeping medicines from the bloodstreams of the sick; food from the bellies of the hungry; books from the hands of the uneducated; technology from the underdeveloped; and putting advocates of freedom in prisons. Intellectual property is to the 21st century what the slave trade was to the 16th.
[David Mertz <mertz@gnosis.cx>]
I think consistent NaN-poisoning would be excellent behavior. It will always make sense for median (and its variants).
statistics.mode([2, 2, nan, nan, nan]) nan statistics.mode([2, 2, inf - inf, inf - inf, inf - inf]) 2
But in the mode case, I'm not sure we should ALWAYS treat a NaN as poisoning the result.
I am: I thought about the following but didn't write about it because it's too strained to be of actual sane use ;-)
If NaN means "missing value" then sometimes it could change things, ?and we shouldn't guess. But what if it cannot?
>>> statistics.mode([9, 9, 9, 9, nan1, nan2, nan3])
No matter what missing value we take those nans to maybe-possibly represent, 9 is still the most common element. This is only true when the most common thing occurs at least as often as the 2nd most common thing PLUS the number of all NaNs. But in that case, 9 really is the mode.
See "too strained" above. It's equally true that, e.g., the _median_ of your list above: [9, 9, 9, 9, nan1, nan2, nan3] is also 9 regardless of what values are plugged in for the nans. That may be easier to realize at first with a simpler list, like [5, 5, nan] It sounds essentially useless to me, just theoretically possible to make a mess of implementations to cater to. "The right" (obvious, unsurprising, useful, easy to implement, easy to understand) non-exceptional behavior in the presence of NaNs is to pretend they weren't in the list to begin with. But I'd rather ;people ask for that _if_ that's what they want.
I've just read statistics.py, and found something that might be usefully considered along with the NaN question.
median([1]) 1 median([1, 1]) 1.0
To record this, and associated behaviour involving Fraction, I've added: Division by 2 in statistics.median: https://bugs.python.org/issue35698 -- Jonathan
participants (6)
-
David Mertz
-
Guido van Rossum
-
Jonathan Fine
-
MRAB
-
Steven D'Aprano
-
Tim Peters