Hello, I think it would be nice to introduce an avg method for lists as a built-in function in python3. To get average of the list, I need to use some libs (eg numpy). In my opinion, if I can get sum of the list, I should get avg also in a same way. For ex [python3]:

l = [5, 9, 7,] ... ... import numpy as np ... print(np.mean(l)) 7.0 sum(l) / len(l) 7.0 avg(l) Traceback (most recent call last): File "<input>", line 1, in <module> NameError: name 'avg' is not defined

Cordialement/Regards Kemal DIRI

Just use `from statistics import mean as avg` (see https://docs.python.org/3/library/statistics.html#statistics.mean). Please provide some justification on why do you think it's desirable to make `avg` a builtin, considering, that doing so is a backwards incompatible change due to the more than likely name clash. On Thu, Dec 26, 2019 at 10:52 AM Kemal Diri <kemal.diri@sewan.fr> wrote:

Hello,

I think it would be nice to introduce an avg method for lists as a built-in function in python3. To get average of the list, I need to use some libs (eg numpy). In my opinion, if I can get *sum* of the list, I should get *avg *also in a same way.

For ex [python3]:

l = [5, 9, 7,] ... ... import numpy as np ... print(np.mean(l)) 7.0 sum(l) / len(l) 7.0 avg(l) Traceback (most recent call last): File "<input>", line 1, in <module> NameError: name 'avg' is not defined

Cordialement/Regards Kemal DIRI

_______________________________________________ Python-ideas mailing list -- python-ideas@python.org To unsubscribe send an email to python-ideas-leave@python.org https://mail.python.org/mailman3/lists/python-ideas.python.org/ Message archived at https://mail.python.org/archives/list/python-ideas@python.org/message/NQB6VU... Code of Conduct: http://python.org/psf/codeofconduct/

-- Sebastian Kreft

Thank you Sebastien for your contribution. I wasn't clear maybe. My idea is being able to use avg function without importing any library. The reason to propose this evolution is basically, * If I can do sum(list) and len(list), would be better to do avg(list) (since I know sum and len of my list), * No need to import a library for this basic operation (even it costs nothing) so I won't consume a line. Cordialement/Regards Kemal DIRI ________________________________ De : Sebastian Kreft <skreft@gmail.com> Envoyé : jeudi 26 décembre 2019 15:07 À : Kemal Diri <kemal.diri@sewan.fr> Cc : python-ideas@python.org <python-ideas@python.org>; kemaldiri@gmail.com <kemaldiri@gmail.com> Objet : Re: [Python-ideas] AVG Function as Built-in Just use `from statistics import mean as avg` (see https://docs.python.org/3/library/statistics.html#statistics.mean). Please provide some justification on why do you think it's desirable to make `avg` a builtin, considering, that doing so is a backwards incompatible change due to the more than likely name clash. On Thu, Dec 26, 2019 at 10:52 AM Kemal Diri <kemal.diri@sewan.fr<mailto:kemal.diri@sewan.fr>> wrote: Hello, I think it would be nice to introduce an avg method for lists as a built-in function in python3. To get average of the list, I need to use some libs (eg numpy). In my opinion, if I can get sum of the list, I should get avg also in a same way. For ex [python3]:

l = [5, 9, 7,] ... ... import numpy as np ... print(np.mean(l)) 7.0 sum(l) / len(l) 7.0 avg(l) Traceback (most recent call last): File "<input>", line 1, in <module> NameError: name 'avg' is not defined

Cordialement/Regards Kemal DIRI _______________________________________________ Python-ideas mailing list -- python-ideas@python.org<mailto:python-ideas@python.org> To unsubscribe send an email to python-ideas-leave@python.org<mailto:python-ideas-leave@python.org> https://mail.python.org/mailman3/lists/python-ideas.python.org/ Message archived at https://mail.python.org/archives/list/python-ideas@python.org/message/NQB6VU... Code of Conduct: http://python.org/psf/codeofconduct/ -- Sebastian Kreft

So why only mean and not median, that's better for statistics? :D Seriously, if you want it builtin, add it to PYTHONSTARTUP: https://docs.python.org/3/using/cmdline.html#envvar-PYTHONSTARTUP from statistics import mean

This came up in discussion here before, maybe a year ago, I think. There was a decision not to change the implementation, but that seemed like a mistake (and the discussion was about broader things). Anyway, I propose that the obviously broken version of `statistics.median()` be replaced with a better implementation. Python 3.8.0 (default, Nov 6 2019, 21:49:08)

import numpy as np import pandas as pd import statistics nan = float('nan') items1 = [nan, 1, 2, 3, 4] items2 = [1, 2, 3, 4, nan] statistics.median(items1) 2 statistics.median(items2) 3 np.median(items1) nan np.median(items2) nan pd.Series(items1).median() 2.5 pd.Series(items2).median() 2.5

The NumPy and Pandas answers are both "reasonable" under slightly different philosophies of how to handle bad values. I think raising an exception for NaNs would also be reasonable enough. The one thing that is NOT reasonable is returning different answers for median depending on the order of the elements. On Thu, Dec 26, 2019 at 10:10 AM Marco Sulla via Python-ideas < python-ideas@python.org> wrote:

So why only mean and not median, that's better for statistics? :D

Seriously, if you want it builtin, add it to PYTHONSTARTUP: https://docs.python.org/3/using/cmdline.html#envvar-PYTHONSTARTUP

from statistics import mean _______________________________________________ Python-ideas mailing list -- python-ideas@python.org To unsubscribe send an email to python-ideas-leave@python.org https://mail.python.org/mailman3/lists/python-ideas.python.org/ Message archived at https://mail.python.org/archives/list/python-ideas@python.org/message/JNMWEY... Code of Conduct: http://python.org/psf/codeofconduct/

-- Keeping medicines from the bloodstreams of the sick; food from the bellies of the hungry; books from the hands of the uneducated; technology from the underdeveloped; and putting advocates of freedom in prisons. Intellectual property is to the 21st century what the slave trade was to the 16th.

Oh yeah, I should have thrown in these cases:

statistics.median([1, 2, nan, 3, 4]) nan statistics.median([2, 3, 4, nan, 1]) 4

I admit that this issue *never* hits me in particular since I use NumPy and Pandas widely, and every time I do statistics it is from those packages (or statsmodels, etc), not the standard library. But users of the standard library shouldn't get such peculiar behavior (even if it's not that hard to understand *why* it behaves as it does). On Thu, Dec 26, 2019 at 10:31 AM David Mertz <mertz@gnosis.cx> wrote:

This came up in discussion here before, maybe a year ago, I think. There was a decision not to change the implementation, but that seemed like a mistake (and the discussion was about broader things).

Anyway, I propose that the obviously broken version of `statistics.median()` be replaced with a better implementation.

Python 3.8.0 (default, Nov 6 2019, 21:49:08)

import numpy as np import pandas as pd import statistics nan = float('nan') items1 = [nan, 1, 2, 3, 4] items2 = [1, 2, 3, 4, nan] statistics.median(items1) 2 statistics.median(items2) 3 np.median(items1) nan np.median(items2) nan pd.Series(items1).median() 2.5 pd.Series(items2).median() 2.5

The NumPy and Pandas answers are both "reasonable" under slightly different philosophies of how to handle bad values. I think raising an exception for NaNs would also be reasonable enough.

The one thing that is NOT reasonable is returning different answers for median depending on the order of the elements.

On Thu, Dec 26, 2019 at 10:10 AM Marco Sulla via Python-ideas < python-ideas@python.org> wrote:

So why only mean and not median, that's better for statistics? :D

Seriously, if you want it builtin, add it to PYTHONSTARTUP: https://docs.python.org/3/using/cmdline.html#envvar-PYTHONSTARTUP

from statistics import mean _______________________________________________ Python-ideas mailing list -- python-ideas@python.org To unsubscribe send an email to python-ideas-leave@python.org https://mail.python.org/mailman3/lists/python-ideas.python.org/ Message archived at https://mail.python.org/archives/list/python-ideas@python.org/message/JNMWEY... Code of Conduct: http://python.org/psf/codeofconduct/

-- Keeping medicines from the bloodstreams of the sick; food from the bellies of the hungry; books from the hands of the uneducated; technology from the underdeveloped; and putting advocates of freedom in prisons. Intellectual property is to the 21st century what the slave trade was to the 16th.

-- Keeping medicines from the bloodstreams of the sick; food from the bellies of the hungry; books from the hands of the uneducated; technology from the underdeveloped; and putting advocates of freedom in prisons. Intellectual property is to the 21st century what the slave trade was to the 16th.

On 12/26/19 10:31 AM, David Mertz wrote:

This came up in discussion here before, maybe a year ago, I think. There was a decision not to change the implementation, but that seemed like a mistake (and the discussion was about broader things).

Anyway, I propose that the obviously broken version of `statistics.median()` be replaced with a better implementation.

Python 3.8.0 (default, Nov 6 2019, 21:49:08)

import numpy as np import pandas as pd import statistics nan = float('nan') items1 = [nan, 1, 2, 3, 4] items2 = [1, 2, 3, 4, nan] statistics.median(items1) 2 statistics.median(items2) 3 np.median(items1) nan np.median(items2) nan pd.Series(items1).median() 2.5 pd.Series(items2).median() 2.5

The NumPy and Pandas answers are both "reasonable" under slightly different philosophies of how to handle bad values. I think raising an exception for NaNs would also be reasonable enough.

The one thing that is NOT reasonable is returning different answers for median depending on the order of the elements. Getting garbage answers for garbage input isn't THAT unreasonable. Perhaps it could be argued that detecting common garbage input and rejecting it (perhaps with an exception) would make more sense.

Note that the statistics module documentation implies the issue, as median implies that it requires the sequence to be orderable, and nan isn't orderable. Since the statistics module seems to be designed to handle types other than floats, detecting nan values is extra expensive, so I think it can be excused for not checking. -- Richard Damon

Well, *I* know the implementation. And I know about NaN being neither less than or greater than anything else (even itself). And I know the basic working of Timsort. But a lot of other folks, especially beginners or casual users, don't know all that. The do know that fractional numbers are a thing one is likely to want a median of (whether or not they know IEEE-754 intimately). And they may or may not know that not-a-number is a float, but it's not that hard to arrive at by a computation. Even if documentation vaguely hints at the behavior, it's a source of likely surprise. The fact that the median might be the very largest of a bunch of numbers (with at least one NaN in the collection) is surely not desired behavior, even if explainable. Or e.g. two sets that compare as equal can have different medians according to the statistics module. I can construct that example if needed. On Thu, Dec 26, 2019, 12:27 PM Richard Damon

Note that the statistics module documentation implies the issue, as median implies that it requires the sequence to be orderable, and nan isn't orderable. Since the statistics module seems to be designed to handle types other than floats, detecting nan values is extra expensive, so I think it can be excused for not checking.

Well, some days ago i didn't know about `statistics` module, so I wrote my own median implementation, that I improved with the help of a private discussion: ``` import math def median(it, member=False, sort_fn=sorted, **kwargs): if sort is None: # Don't sort. Coder must be carefull to pass an already sorted iterable sorted_it = it else: sorted_it = sort_fn(it, **kwargs) try: len_it = len(it) except TypeError: # Generator, iterator et similia it = tuple(it) len_it = len(it) if len_it == 0: raise ValueError("Iterable is empty") index = len_it // 2 if isEven(len_it): res1 = it[index] res2 = it[index-1] if math.isnan(res1): return res2 if math.isnan(res2): return res1 if member: # To remove bias if isEven(index): return min(res1, res2) else: return max(res1, res2) else: res = (it[index] + it[index-1]) / 2 else: res = it[index] return res def isEven(num): return num % 2 == 0 ``` As you can see, with `sort_fn` you can pass another function, maybe the pandas one (even if I do not recommend it, pandas is slow). Or you can pass None and sort the iterable before. Maybe you have already sorted the iterable, so there's no reason to sort it again. Furthermore, if the iterable have even length and the elements are not numbers, you can calculate the median in a predictable way choosing member=True. It will return one of the two central arguments, in a not-biased way. So you don't need median_high() or median_low() in this cases. Finally, if the iterable have even length and one of the two central values is NaN, the other value is returned. The function returns NaN only if both are NaNs.

On 12/26/19 1:38 PM, Marco Sulla via Python-ideas wrote:

Well, some days ago i didn't know about `statistics` module, so I wrote my own median implementation, that I improved with the help of a private discussion:

``` import math

def median(it, member=False, sort_fn=sorted, **kwargs): if sort is None: # Don't sort. Coder must be carefull to pass an already sorted iterable sorted_it = it else: sorted_it = sort_fn(it, **kwargs)

try: len_it = len(it) except TypeError: # Generator, iterator et similia it = tuple(it) len_it = len(it)

if len_it == 0: raise ValueError("Iterable is empty")

index = len_it // 2

if isEven(len_it): res1 = it[index] res2 = it[index-1]

if math.isnan(res1): return res2

if math.isnan(res2): return res1

if member: # To remove bias if isEven(index): return min(res1, res2) else: return max(res1, res2)

else: res = (it[index] + it[index-1]) / 2 else: res = it[index]

return res

def isEven(num): return num % 2 == 0

```

As you can see, with `sort_fn` you can pass another function, maybe the pandas one (even if I do not recommend it, pandas is slow). Or you can pass None and sort the iterable before. Maybe you have already sorted the iterable, so there's no reason to sort it again.

Furthermore, if the iterable have even length and the elements are not numbers, you can calculate the median in a predictable way choosing member=True. It will return one of the two central arguments, in a not-biased way. So you don't need median_high() or median_low() in this cases.

Finally, if the iterable have even length and one of the two central values is NaN, the other value is returned. The function returns NaN only if both are NaNs. Note, this functions still has issues with NaN values, unless you change to use a sort function different than sorted, as sorted doesn't check if items are strictly comparable, but just assumes that < provides a total order, which NaN values breaks. sorted basically can generate an unsorted result if there are items like NaN that break the assumption.

-- Richard Damon

Richard Damon wrote:

Note, this functions still has issues with NaN values, unless you change to use a sort function different than sorted

Well, yes. Indeed I wrote you can change the `sort_fn` parameter. Anyway IMHO the best part of this function is that you don't need anymore the biased median_high() and median_low() functions of `statistics`.

As I was saying, the issue is that statistics.median can deal with many types and to have it special case for nan would be awkward. The user could also have done something like use None values (but this does give an error). Perhaps where the test could be done would be in the built in function sorted which median uses, as all the arguments about confusing casual users with median also apply to the use of sorted or most other operations that use sorted internally. sorted also needs to naturally examine all the elements, while median doesn't, it just needs to get the sorted list and look at the middle value(s). The biggest part of the error comes from the fact that IEEE Floating point has a strong requirement on how to handle the NaN value, and that isn't so intuitive to the casual user. The answers basically are the following: 1) Not be IEEE compliant, which would bring complaints from everyone who wants to do serious work with math. 2) Not consider Floats to be sortable, which seems to be a bit of throwing the baby out with the bath water. 3) Add tests to places like sorted (slowing things down some) to catch this case, and deal with it (but you then need to decide HOW you want to deal with it). 4) Ignore the problem, and let NaNs generate some weird results. Python seems to have generally chosen 4, and in some more advanced packages 3. The question comes does adding the cost to have sorted test all the values for the cases of not being properly comparable, affecting every sort done, provide enough benefit to be worth it. Note, that NaN values are somewhat rare in most programs, I think they can only come about by explicitly requesting them (like float("nan") ) or perhaps with some of the more advanced math packages, and users of those should probably understand NaN values (Python throws errors for most simple math that might otherwise generate a NaN, like 0/0 or sqrt(-1) ). That means that if they don't know what a NaN is, they probably won't be dealing with them. On 12/26/19 12:42 PM, David Mertz wrote:

Well, *I* know the implementation. And I know about NaN being neither less than or greater than anything else (even itself). And I know the basic working of Timsort.

But a lot of other folks, especially beginners or casual users, don't know all that. The do know that fractional numbers are a thing one is likely to want a median of (whether or not they know IEEE-754 intimately). And they may or may not know that not-a-number is a float, but it's not that hard to arrive at by a computation.

Even if documentation vaguely hints at the behavior, it's a source of likely surprise. The fact that the median might be the very largest of a bunch of numbers (with at least one NaN in the collection) is surely not desired behavior, even if explainable.

Or e.g. two sets that compare as equal can have different medians according to the statistics module. I can construct that example if needed.

On Thu, Dec 26, 2019, 12:27 PM Richard Damon

Note that the statistics module documentation implies the issue, as median implies that it requires the sequence to be orderable, and nan isn't orderable. Since the statistics module seems to be designed to handle types other than floats, detecting nan values is extra expensive, so I think it can be excused for not checking.

-- Richard Damon

On Dec 26, 2019, at 10:58, Richard Damon <Richard@damon-family.org> wrote:

Note, that NaN values are somewhat rare in most programs, I think they can only come about by explicitly requesting them (like float("nan") ) or perhaps with some of the more advanced math packages

You can get them easily just from math itself. Or, once you can get infinite values, you can easily get nan values with just basic arithmetic: >>> 1e1000 - 1e1000 nan

On 12/26/19 2:10 PM, Andrew Barnert via Python-ideas wrote:

On Dec 26, 2019, at 10:58, Richard Damon <Richard@damon-family.org> wrote:

Note, that NaN values are somewhat rare in most programs, I think they can only come about by explicitly requesting them (like float("nan") ) or perhaps with some of the more advanced math packages You can get them easily just from math itself.

Or, once you can get infinite values, you can easily get nan values with just basic arithmetic:

>>> 1e1000 - 1e1000 nan

I guess I didn't try hard enough to get a Nan. But once the Newbie has hit infinities, NO answer is right. The number could have been 1e1000 - 1e999 (and thus should be big) or 1e999 - 1e1000 (and thus should be very negative) or 1e1000 - 1e1000 (and thus should be zero), which is why we get a NaN here. If you are really worried about a median with values like this confusing someone, then we should handle the issue MUCH earlier, maybe even trapping the overflow with an error message unless taken out of 'newbie' mode. -- Richard Damon

On Dec 26, 2019, at 12:36, Richard Damon <Richard@damon-family.org> wrote:

On 12/26/19 2:10 PM, Andrew Barnert via Python-ideas wrote:

On Dec 26, 2019, at 10:58, Richard Damon <Richard@damon-family.org> wrote: Note, that NaN values are somewhat rare in most programs, I think they can only come about by explicitly requesting them (like float("nan") ) or perhaps with some of the more advanced math packages You can get them easily just from math itself.

Or, once you can get infinite values, you can easily get nan values with just basic arithmetic:

>>> 1e1000 - 1e1000 nan

I guess I didn't try hard enough to get a Nan. But once the Newbie has hit infinities, NO answer is right.

I don’t think that’s true. Surely the median of (-inf, 1, 2, 3, inf, inf, inf) is well defined and can only be 3? The only case where it’s a problem is when all the values are infinite and exactly half of them are positive, in which case the median has to be halfway between -inf and inf. But even then, the only reasonable answers are nan or an exception.

The number could have been 1e1000 - 1e999 (and thus should be big) or 1e999 - 1e1000 (and thus should be very negative) or 1e1000 - 1e1000 (and thus should be zero), which is why we get a NaN here.

Well, here both numbers are clearly 1e1000, and the right answer is 0. The problem is that (in systems where float is IEEE double) that number can’t be represented as a float in the first place, so Python approximates it with inf, so you (inaccurately, but predictably and understandably) get nan instead of 0. It’s like a very extreme case of “float rounding error”. If you have actual infinite values instead, then nan or an exception is the only appropriate answer in the first place, because subtraction is undefined. (Assuming you’re taking the floats as an approximate model of the affinely extended reals. If you’re taking them as a model of themselves, then it is well defined, as nan.)

If you are really worried about a median with values like this confusing someone, then we should handle the issue MUCH earlier, maybe even trapping the overflow with an error message unless taken out of 'newbie' mode.

This amounts to an argument that in ‘newbie’ mode there should be no inf or nan values in float in the first place, and anything that returns one should instead raise an OverflowError or MathDomainError. Which is actually what many functions actually do, but I don’t think anyone has tried to divide existing functions into ‘newbie’ mode and ‘float programmer mode’ functions, so trying to do the same with new higher-level functions on top of them is probably a mug’s game. (You can use Decimal with an appropriate context to get that kind of behavior, but I don’t think any newbie would know how to even begin doing that…) Plus, as mentioned at the top, taking a median with some infinite values usually makes perfectly good sense that a newbie can understand. It’s not the same as taking a median with some nan values, which behaves in a way that only makes sense if you think through how sorting works.

On Thu, Dec 26, 2019 at 02:23:42PM -0800, Andrew Barnert via Python-ideas wrote:

I don’t think that’s true. Surely the median of (-inf, 1, 2, 3, inf, inf, inf) is well defined and can only be 3?

It's well-defined, but probably not good statistics. I'm not sure what measurement you are making that gives you infinities, they would surely be outliers and suspect data :-) But for what it's worth, median() is fine in the presence of infinities: py> statistics.median([INF, 3, INF, -INF, 1, INF, 2]) 3 I can't take credit for this. Thanks to IEEE-754 semantics, median() ought to do the right thing for any number or combination of positive or negative infinities and finite numbers. Including the case Andrew mentioned where your data consists of all infinities, exactly half of which are positive and half negative: py> statistics.median([INF, -INF]) nan which is correct IEEE-754 semantics for ∞ - ∞.

The number could have been 1e1000 - 1e999 (and thus should be big) or 1e999 - 1e1000 (and thus should be very negative) or 1e1000 - 1e1000 (and thus should be zero), which is why we get a NaN here.

Well, here both numbers are clearly 1e1000, and the right answer is 0.

median() does the right thing with Decimals. py> statistics.median([Decimal("-1e1000"), Decimal("1e1000")]) Decimal('0E+1000')

The problem is that (in systems where float is IEEE double) that number can’t be represented as a float in the first place, so Python approximates it with inf, so you (inaccurately, but predictably and understandably) get nan instead of 0. It’s like a very extreme case of “float rounding error”.

More like float overflow. 1e1000 overflows to infinity, and the rest follows from that. [...]

This amounts to an argument that in ‘newbie’ mode there should be no inf or nan values in float in the first place, and anything that returns one should instead raise an OverflowError or MathDomainError.

I don't think the distinction here is between "newbies" and "experts", there are plenty of experts who dislike NANs and I'm sure I wasn't the only newbie that fell in love with NANs when I first learned about them way back in the late 1980s using Apple's Hypertalk. -- Steven

On Dec 26, 2019, at 12:36, Richard Damon <Richard@damon-family.org> wrote:

On 12/26/19 2:10 PM, Andrew Barnert via Python-ideas wrote:

On Dec 26, 2019, at 10:58, Richard Damon <Richard@damon-family.org> wrote: Note, that NaN values are somewhat rare in most programs, I think they can only come about by explicitly requesting them (like float("nan") ) or perhaps with some of the more advanced math packages You can get them easily just from math itself.

Or, once you can get infinite values, you can easily get nan values with just basic arithmetic:

>>> 1e1000 - 1e1000 nan

I guess I didn't try hard enough to get a Nan. But once the Newbie has hit infinities, NO answer is right. I don’t think that’s true. Surely the median of (-inf, 1, 2, 3, inf, inf, inf) is well defined and can only be 3?

The only case where it’s a problem is when all the values are infinite and exactly half of them are positive, in which case the median has to be halfway between -inf and inf. But even then, the only reasonable answers are nan or an exception. But you seem to assume that a program to compute the median is likely

On 12/26/19 5:23 PM, Andrew Barnert via Python-ideas wrote: the only function of that program. But the fact that the programmer has overflowed values and then did some math with the overflow values starts to lead to all types of strangeness, (and you don't need to even get to infinities to get strangeness with math, for many values of x, we can have x being equal to x+1).

The number could have been 1e1000 - 1e999 (and thus should be big) or 1e999 - 1e1000 (and thus should be very negative) or 1e1000 - 1e1000 (and thus should be zero), which is why we get a NaN here. Well, here both numbers are clearly 1e1000, and the right answer is 0. The problem is that (in systems where float is IEEE double) that number can’t be represented as a float in the first place, so Python approximates it with inf, so you (inaccurately, but predictably and understandably) get nan instead of 0. It’s like a very extreme case of “float rounding error”.

If you have actual infinite values instead, then nan or an exception is the only appropriate answer in the first place, because subtraction is undefined. (Assuming you’re taking the floats as an approximate model of the affinely extended reals. If you’re taking them as a model of themselves, then it is well defined, as nan.)

And that is my point, IF we have decided that we need to protect the newbie, then at the point we have converted 1e1000 to inf we have put him on the path of problems. Fixing just median is like taking a leaky boat and bailing ONE bucket of water out of it.

If you are really worried about a median with values like this confusing someone, then we should handle the issue MUCH earlier, maybe even trapping the overflow with an error message unless taken out of 'newbie' mode. This amounts to an argument that in ‘newbie’ mode there should be no inf or nan values in float in the first place, and anything that returns one should instead raise an OverflowError or MathDomainError. Which is actually what many functions actually do, but I don’t think anyone has tried to divide existing functions into ‘newbie’ mode and ‘float programmer mode’ functions, so trying to do the same with new higher-level functions on top of them is probably a mug’s game. (You can use Decimal with an appropriate context to get that kind of behavior, but I don’t think any newbie would know how to even begin doing that…)

Plus, as mentioned at the top, taking a median with some infinite values usually makes perfectly good sense that a newbie can understand. It’s not the same as taking a median with some nan values, which behaves in a way that only makes sense if you think through how sorting works.

As I have been saying, fixing *median* is the wrong spot to fix it, as there are many similar traps in the system. If we really want to protect the newbie from this sort of error, and not treat it as a teachable moment, then we need to make a more fundamental change. One option is to make floating math by default safer, and require some special statement added to the program to enable the extra features. One problem is that you can't totally protect from these issues, as long as you use floats, the value of numbers will not always be precise and round off errors will accumulate, but perhaps making overflow 'noisy' by signalling would catch some of the more confusing parts (and Python already does some of these, like 0/0 is an error, not a NaN.). The cost of this would be a bit of efficiency, as there would need to be some test if we are in simple or advanced mode or a check for the overflow at each operation, and people who know what they are doing and WANT the support for full IEEE mode would need to add something to their program (or environment maybe). -- Richard Damon

Maybe we can just change the function signature: statistics.median(it, do_wrong_ass_thing_with_nans=False) :-) But yes, the problem is really with sorted(). However, the implementation of statistics.median() doesn't HAVE TO use sorted(), that's just one convenient way to do it. There IS NO right answer for `sorted([nan, 1, 2, 3])`. However, there is a very plausibly right answer for `statistics.median([nan, 1, 2, 3])` ... or rather, both 'nan' and '2' are plausible (one approach is what Numpy does, the other is what Pandas does). -- Keeping medicines from the bloodstreams of the sick; food from the bellies of the hungry; books from the hands of the uneducated; technology from the underdeveloped; and putting advocates of freedom in prisons. Intellectual property is to the 21st century what the slave trade was to the 16th.

Maybe we can just change the function signature:

statistics.median(it, do_wrong_ass_thing_with_nans=False)

:-)

But yes, the problem is really with sorted(). However, the implementation of statistics.median() doesn't HAVE TO use sorted(), that's just one convenient way to do it. Yes, median could do the sort some other way, and in fact the code for median makes a comment to investigate doing it some other way. The fact

On 12/26/19 3:14 PM, David Mertz wrote: that median doesn't actually need the full list sorted, says that

There IS NO right answer for `sorted([nan, 1, 2, 3])`. However, there is a very plausibly right answer for `statistics.median([nan, 1, 2, 3])` ... or rather, both 'nan' and '2' are plausible (one approach is what Numpy does, the other is what Pandas does).

Other possible answers would be 1.5 or 2.5 (if the sorting method ended up putting NaNs at the bottom or top of the order, based on the definition of the median as the value which half the values are greater than (or less than) it. In one sense if there isn't *A* right answer, NO answer is right. As was pointed out, the statistics module specifically doesn't claim to replace more powerful packages, like Numpy, so expecting it to handle this level of nuance is beyond its specification. -- Richard Damon

On Thu, Dec 26, 2019 at 4:12 PM Richard Damon <Richard@damon-family.org> wrote:

As was pointed out, the statistics module specifically doesn't claim to replace more powerful packages, like Numpy, so expecting it to handle this level of nuance is beyond its specification.

Not being flat-out crazy in its answer isn't extreme nuance. The reason the comment is in the code is because this discussion happened before on this same list. Returning "the largest non-NaN value" I consider flat-out crazy as a possibility (all the more so since doing that is sensitive to exactly where the NaNs occur in a way that requires a good understanding of Timsort to predict). See my possible solution in this thread. -- Keeping medicines from the bloodstreams of the sick; food from the bellies of the hungry; books from the hands of the uneducated; technology from the underdeveloped; and putting advocates of freedom in prisons. Intellectual property is to the 21st century what the slave trade was to the 16th.

I’m going to strongly support David Mertz’s point here: It is a well known axiom of computing that returning an *incorrect* result is a very bad thing. What the correct result of the median of a sequence of floats that contains some NaNs is up for debate. As David points out there are (at least) three reasonable answers: - NaN - the median that ignores the NaNs - an Exception But having a result that essentially treats NaN as having an arbitrary value depending on where it is the sequence is not correct by any definition. And while it’s nice that median supports duck typing, floats are probably the most (certainly the second most) common type used with it — some special case code is called for if needed. Python is not a high performance computational language. If you are crunching enough numbers that a NaN check is going to make a difference, you probably should be using numpy or some other higher performance stats lib. So I think this should be addressed, and it's well worth a performance hit to do so. Which does bring us to how to do it. I think there are two possible approaches: 1) do checks in the median function itself -- I'm going to suggest that that's the place to do it -- and indeed, probably a good idea to review the statistics module for other places where NaNs will cause problems. 2) in the sort function(s) -- it would b nice if sort did something "smarter" with NaNs, but if that smarter thing is something other than a Exception, median() should still check for NaNs; if they are al at the end of beginning, you will at least get consistent results, but still not really meaningful ones -- I think NaNs should either be treated as missing values, or not allowed at all. NOTE: a while back someone was suggesting that the sort function(s) check for simple C datatypes in the keys -- and if they are, e.g. floats or ints, use fast C code for the sorting -- he claimed it would make sorting much faster for these common cases. I don't know what came of that, but if it is implemented, then adding NaN handling there might make some sense. -CHB

On 12/27/19 9:15 PM, Christopher Barker wrote:

I’m going to strongly support David Mertz’s point here:

It is a well known axiom of computing that returning an *incorrect* result is a very bad thing.

There is also an axiom that you can only expect valid results if you meet the operations pre-conditions. Sometimes, being totally defensive in checking for 'bad' inputs costs you too much performance. The stated requirement on the statistics module is you feed it 'numbers', and a NaN is by definition Not a Number. The Median function also implies that its inputs have the property of having an order, as well as being able to be added (and if you can't add them, then you need to use median_lower or median_upper) I will also point out that really the median function is a small part of the problem, all the x-tile based functions have the same issue, and fundamentally it is a problem with sorted(). Has anyone tried to implement a version of these that checks for inputs that don't have a total order and seen what the performance impact is? Testing for NaNs isn't trivial, as elsewhere it was pointed out that how you check is based on the type of the number you have (Decimals being different from floats). To be really complete, you want to actually detect that you have some elements that don't form a total order with the other elements. In many ways, half fixing the issue makes it worse, as improving reliability can lead you to forget that you need to take care, so the remaining issues catch you harder. -- Richard Damon

On Fri, Dec 27, 2019 at 8:14 PM Richard Damon <Richard@damon-family.org> wrote:

It is a well known axiom of computing that returning an *incorrect* result is a very bad thing.

There is also an axiom that you can only expect valid results if you meet the operations pre-conditions.

sure.

Sometimes, being totally defensive in checking for 'bad' inputs costs you too much performance.

it can, yes, there are no hard rules about anything.

The stated requirement on the statistics module is you feed it 'numbers', and a NaN is by definition Not a Number.

Sure, but NaN IS a part of the Python float type, and is can and will show up once in a while. That is not the same as expecting median (ir any other function is the module) to work with some arbitrary other type. Practicality beats purity -- floats WILL be widely used in teh module, they should be accommodated. Here is the text in the docs: """ This module provides functions for calculating mathematical statistics of numeric (Real-valued) data. <snip> Unless explicitly noted, these functions support int, float, Decimal and Fraction. Behaviour with other types (whether in the numeric tower or not) is currently unsupported. """ So this is pretty clear - the module is designed to work with int, float, Decimal and Fraction -- so I think it's worth some effort to well-support those. And both float and Decimal have a NaN of some sort.

The Median function also implies that its inputs have the property of having an order, as well as being able to be added (and if you can't add them, then you need to use median_lower or median_upper)

sure, but those are expected of the types above anyway. It doesn't seem to me that this package is designed to work with ANY type that has order and supports the operations needed by the function. For instance, lists of numbers can be compared, so: In [69]: stuff = [[1,2,3], ...: [4,6,1], ...: [8,9], ...: [4,5,6,7,8,9], ...: [5.0] ...: ] In [70]: In [70]: statistics.median(stuff) Out[70]: [4, 6, 1] The fact that that worked is just a coincidence, it is not an important part of the design that it does. I will also point out that really the median function is a small part of

the problem, all the x-tile based functions have the same issue,

absolutely, and pretty much all of them, though many will just give you NaN as a result, which is much better than an arbitrary result.

and fundamentally it is a problem with sorted().

I don't think so. In fact, sorted() is explicitly designed to work with any type that is consistently ordered (I'm not sure if they HAVE to be "total ordered), and floats with NaNs are not that. As the statistics module is specifically designed to work with numbers (and essentially only with numbers) it's the appropriate place to put an accommodation for NaNs. Not to mention that while sorted() could be adapted to do something more consistent with NaNs, since it is a general purpose function, it's hard to know what the behavior should be -- raise an Exception? remove them? put them all at the front or back? Which makes sense depends entirely on the application. The statistics module, on the other hand, is for statistics, so while there are still multiple options, it's a lot easier to pick one and go for it. Has anyone tried to implement a version of these that checks for inputs

that don't have a total order and seen what the performance impact is?

I think checking for total order is a red herring -- there really is a reason to specifically deal with NaNs in floats (and decimals), not ANY type that may not be total ordered. Testing for NaNs isn't trivial, as elsewhere it was pointed out that how

you check is based on the type of the number you have (Decimals being different from floats).

yes, that is unfortunate.

To be really complete, you want to actually detect that you have some elements that don't form a total order with the other elements.

as above, being "complete" isn't necessary here. And even if you were complete, what would you DO with a general non-total-ordered type? Raising an Exception would be OK, but any other option (like treat it as a missing value) wouldn't make sense if you doin't know more about the type.

In many ways, half fixing the issue makes it worse, as improving reliability can lead you to forget that you need to take care, so the remaining issues catch you harder.

I can't think of an example of this -- it's a fine principle, but if we were to make the entire statistics module do something reasonable with Nans -- what exactly other issues would that hide? In short: practicality beats purity: - The fact that the module is designed to work with all the standard number types doesn't mean it has to work with ANY type that supports the operations needed for a given function. - NaNs are part of the Python float and Decimal implementations -- so they WILL show up once in a while. It would be good to handle them. - NaNs can also be very helpful to indicate missing values -- this is can actually be very handy for statistics calculations. So it could be a nice feature to add -- that NaN means "missing value" -CHB -- Christopher Barker, PhD Python Language Consulting - Teaching - Scientific Software Development - Desktop GUI and Web Development - wxPython, numpy, scipy, Cython

On 12/28/19 1:14 AM, Christopher Barker wrote:

On Fri, Dec 27, 2019 at 8:14 PM Richard Damon <Richard@damon-family.org <mailto:Richard@damon-family.org>> wrote:

> It is a well known axiom of computing that returning an *incorrect* > result is a very bad thing.

There is also an axiom that you can only expect valid results if you meet the operations pre-conditions.

sure.

Sometimes, being totally defensive in checking for 'bad' inputs costs you too much performance.

it can, yes, there are no hard rules about anything.

The stated requirement on the statistics module is you feed it 'numbers', and a NaN is by definition Not a Number.

Sure, but NaN IS a part of the Python float type, and is can and will show up once in a while. That is not the same as expecting median (ir any other function is the module) to work with some arbitrary other type. Practicality beats purity -- floats WILL be widely used in teh module, they should be accommodated.

Here is the text in the docs: """ This module provides functions for calculating mathematical statistics of numeric (Real-valued) data. <snip> Unless explicitly noted, these functions support int, float, Decimal and Fraction. Behaviour with other types (whether in the numeric tower or not) is currently unsupported. """

So this is pretty clear - the module is designed to work with int, float, Decimal and Fraction -- so I think it's worth some effort to well-support those. And both float and Decimal have a NaN of some sort.

But the documentation that you reference say it works with NUMBERS, and NaN are explicitly NOT A NUMBER, so the statistic module specifically hasn't made a claim that it will work with them. Also, the general section says, unless explicitly noted, and median makes an explicit reference to type that support order but not addition needing to use a different function, which implies that a data type the IS ordered and supports addition is usable with median (presumably some user defined number class that acts close enough to other number classes that the average of a and b is a value between them)

The Median function also implies that its inputs have the property of having an order, as well as being able to be added (and if you can't add them, then you need to use median_lower or median_upper)

sure, but those are expected of the types above anyway. It doesn't seem to me that this package is designed to work with ANY type that has order and supports the operations needed by the function. For instance, lists of numbers can be compared, so:

In [69]: stuff = [[1,2,3], ...: [4,6,1], ...: [8,9], ...: [4,5,6,7,8,9], ...: [5.0] ...: ]

In [70]:

In [70]: statistics.median(stuff) Out[70]: [4, 6, 1]

The fact that that worked is just a coincidence, it is not an important part of the design that it does.

I will also point out that really the median function is a small part of the problem, all the x-tile based functions have the same issue,

absolutely, and pretty much all of them, though many will just give you NaN as a result, which is much better than an arbitrary result. x-tile base functions (like median is 50th percentile, upper and lower quartile (25 and 75 percentile), the one provide in the module is quantiles (which actually can compute any even grouping).

and fundamentally it is a problem with sorted().

I don't think so. In fact, sorted() is explicitly designed to work with any type that is consistently ordered (I'm not sure if they HAVE to be "total ordered), and floats with NaNs are not that. Most sort routines require that the data be at least define a partial order and effectively define a total order based on considering that if a < b is false and b < a is false then a and b form an equivalency class

Actually, since lists don't support addition in the manner requested, median isn't appropriate, but perhaps it has meaning with median_lower(). For example, if the lists represented hierarchical references in a document (Chapter 1, Section 2, Paragraph 3) then the median might have a meaning. that we don't care about order. Sets for instance with < being sub set of, have a consistent ordering but don't form a consistent equivalency class and thus don't sort properly.

As the statistics module is specifically designed to work with numbers (and essentially only with numbers) it's the appropriate place to put an accommodation for NaNs. Not to mention that while sorted() could be adapted to do something more consistent with NaNs, since it is a general purpose function, it's hard to know what the behavior should be -- raise an Exception? remove them? put them all at the front or back? Which makes sense depends entirely on the application. The statistics module, on the other hand, is for statistics, so while there are still multiple options, it's a lot easier to pick one and go for it.

Has anyone tried to implement a version of these that checks for inputs that don't have a total order and seen what the performance impact is?

I think checking for total order is a red herring -- there really is a reason to specifically deal with NaNs in floats (and decimals), not ANY type that may not be total ordered.

Some of the problems is that fixing one issue doesn't come close to fixing the problem. The stated problem is that newbie/casual programmers get confused that some things don't work when you get NaNs in the mix. Why is it ok to say that [3, 1, 4, nan, 2] is a 'sorted' array, but that 4 can't be the median of that array. I am not saying that we can't fix the problem (though I think I have made a reasonable argument that it isn't a problem that MUST be fixed), but more that a change in just median is the wrong spot to make this fix. The real problem is that to the naive program when they do something wrong and get NaNs into their data, get confused at some of the strange answers that they can get. The fact that NaNs don't order is one of the points of confusion, and rather than try to fix one by one the various operations that are defined for sorted data to handle to issue, why not go to the core issue and deal with the base sorting operations. Perhaps even better would be a math mode (perhaps unfortunately needed to be the default since we are trying to help beginners) that isn't fully IEEE compliant but throws exceptions on the errors that get us into the territory that causes the confusion. This might actually not impact that many programs in real life, as how many programs actually need to be generating NaNs as a result of calculations.

Testing for NaNs isn't trivial, as elsewhere it was pointed out that how you check is based on the type of the number you have (Decimals being different from floats).

yes, that is unfortunate.

To be really complete, you want to actually detect that you have some elements that don't form a total order with the other elements.

as above, being "complete" isn't necessary here. And even if you were complete, what would you DO with a general non-total-ordered type? Raising an Exception would be OK, but any other option (like treat it as a missing value) wouldn't make sense if you doin't know more about the type.The

The only answer that I see that makes sense is Raising an Exception (likely in sorted). You also probably can't be totally thorough in checking, as checking completely associativity would be an N**3 operation which is way to slow. Likely you would live with just testing that for each pair you check has exactly one of a < b, a == b, b < a being true.

In many ways, half fixing the issue makes it worse, as improving reliability can lead you to forget that you need to take care, so the remaining issues catch you harder.

I can't think of an example of this -- it's a fine principle, but if we were to make the entire statistics module do something reasonable with Nans -- what exactly other issues would that hide?

In short: practicality beats purity:

- The fact that the module is designed to work with all the standard number types doesn't mean it has to work with ANY type that supports the operations needed for a given function.

- NaNs are part of the Python float and Decimal implementations -- so they WILL show up once in a while. It would be good to handle them.

- NaNs can also be very helpful to indicate missing values -- this is can actually be very handy for statistics calculations. So it could be a nice feature to add -- that NaN means "missing value"

ASSUMING that NaNs represent missing data is just ONE possible interpretation for it. Making it THE interpretation in a low level package seems wrong. It also is an interpretation that is easy to create with a simple helper function that takes one sequence and returns another one where all the NaNs are removed. Other interpretations of what a NaN should reflect aren't as easy to implement outside the operation, so if you are going to pick an interpretation, it probably should be something else that is hard to handle externally. Also, in Python, because it is dynamically typed, would be used better in my mind using something like None to indicate missing data in the array. NaN was chosen is some language as a missing value because they couldn't handle mixed type data arrays.

-CHB

-- Christopher Barker, PhD

-- h Richard Damon

This is sophistry. NaN is an instance of the abstract type numbers.Number and the concrete type float. IEEE-754 defines NaN as collection of required values in any floating point type. I know the acronym suggests otherwise in a too-cute way, but NaN is archetypically a number in a computer science sense (but not in a pure math way, of course). Likewise, YAML (YAML Ain't Markup Language) is a markup language. And GNU (GNU's Not Unix) is a Unix system. On Sat, Dec 28, 2019, 8:30 PM Richard Damon <Richard@damon-family.org> wrote:

On 12/28/19 1:14 AM, Christopher Barker wrote:

On Fri, Dec 27, 2019 at 8:14 PM Richard Damon <Richard@damon-family.org <mailto:Richard@damon-family.org>> wrote:

> It is a well known axiom of computing that returning an *incorrect* > result is a very bad thing.

There is also an axiom that you can only expect valid results if you meet the operations pre-conditions.

sure.

Sometimes, being totally defensive in checking for 'bad' inputs costs you too much performance.

it can, yes, there are no hard rules about anything.

The stated requirement on the statistics module is you feed it 'numbers', and a NaN is by definition Not a Number.

Sure, but NaN IS a part of the Python float type, and is can and will show up once in a while. That is not the same as expecting median (ir any other function is the module) to work with some arbitrary other type. Practicality beats purity -- floats WILL be widely used in teh module, they should be accommodated.

Here is the text in the docs: """ This module provides functions for calculating mathematical statistics of numeric (Real-valued) data. <snip> Unless explicitly noted, these functions support int, float, Decimal and Fraction. Behaviour with other types (whether in the numeric tower or not) is currently unsupported. """

So this is pretty clear - the module is designed to work with int, float, Decimal and Fraction -- so I think it's worth some effort to well-support those. And both float and Decimal have a NaN of some sort.

But the documentation that you reference say it works with NUMBERS, and NaN are explicitly NOT A NUMBER, so the statistic module specifically hasn't made a claim that it will work with them.

Also, the general section says, unless explicitly noted, and median makes an explicit reference to type that support order but not addition needing to use a different function, which implies that a data type the IS ordered and supports addition is usable with median (presumably some user defined number class that acts close enough to other number classes that the average of a and b is a value between them)

The Median function also implies that its inputs have the property of having an order, as well as being able to be added (and if you can't add them, then you need to use median_lower or median_upper)

sure, but those are expected of the types above anyway. It doesn't seem to me that this package is designed to work with ANY type that has order and supports the operations needed by the function. For instance, lists of numbers can be compared, so:

In [69]: stuff = [[1,2,3], ...: [4,6,1], ...: [8,9], ...: [4,5,6,7,8,9], ...: [5.0] ...: ]

In [70]:

In [70]: statistics.median(stuff) Out[70]: [4, 6, 1]

The fact that that worked is just a coincidence, it is not an important part of the design that it does.

I will also point out that really the median function is a small part of the problem, all the x-tile based functions have the same issue,

absolutely, and pretty much all of them, though many will just give you NaN as a result, which is much better than an arbitrary result. x-tile base functions (like median is 50th percentile, upper and lower quartile (25 and 75 percentile), the one provide in the module is quantiles (which actually can compute any even grouping).

and fundamentally it is a problem with sorted().

I don't think so. In fact, sorted() is explicitly designed to work with any type that is consistently ordered (I'm not sure if they HAVE to be "total ordered), and floats with NaNs are not that. Most sort routines require that the data be at least define a partial order and effectively define a total order based on considering that if a < b is false and b < a is false then a and b form an equivalency class

Actually, since lists don't support addition in the manner requested, median isn't appropriate, but perhaps it has meaning with median_lower(). For example, if the lists represented hierarchical references in a document (Chapter 1, Section 2, Paragraph 3) then the median might have a meaning. that we don't care about order. Sets for instance with < being sub set of, have a consistent ordering but don't form a consistent equivalency class and thus don't sort properly.

As the statistics module is specifically designed to work with numbers (and essentially only with numbers) it's the appropriate place to put an accommodation for NaNs. Not to mention that while sorted() could be adapted to do something more consistent with NaNs, since it is a general purpose function, it's hard to know what the behavior should be -- raise an Exception? remove them? put them all at the front or back? Which makes sense depends entirely on the application. The statistics module, on the other hand, is for statistics, so while there are still multiple options, it's a lot easier to pick one and go for it.

Has anyone tried to implement a version of these that checks for inputs that don't have a total order and seen what the performance impact

is?

I think checking for total order is a red herring -- there really is a reason to specifically deal with NaNs in floats (and decimals), not ANY type that may not be total ordered.

Some of the problems is that fixing one issue doesn't come close to fixing the problem. The stated problem is that newbie/casual programmers get confused that some things don't work when you get NaNs in the mix. Why is it ok to say that [3, 1, 4, nan, 2] is a 'sorted' array, but that 4 can't be the median of that array.

I am not saying that we can't fix the problem (though I think I have made a reasonable argument that it isn't a problem that MUST be fixed), but more that a change in just median is the wrong spot to make this fix. The real problem is that to the naive program when they do something wrong and get NaNs into their data, get confused at some of the strange answers that they can get. The fact that NaNs don't order is one of the points of confusion, and rather than try to fix one by one the various operations that are defined for sorted data to handle to issue, why not go to the core issue and deal with the base sorting operations. Perhaps even better would be a math mode (perhaps unfortunately needed to be the default since we are trying to help beginners) that isn't fully IEEE compliant but throws exceptions on the errors that get us into the territory that causes the confusion. This might actually not impact that many programs in real life, as how many programs actually need to be generating NaNs as a result of calculations.

Testing for NaNs isn't trivial, as elsewhere it was pointed out that how you check is based on the type of the number you have (Decimals being different from floats).

yes, that is unfortunate.

To be really complete, you want to actually detect that you have some elements that don't form a total order with the other elements.

as above, being "complete" isn't necessary here. And even if you were complete, what would you DO with a general non-total-ordered type? Raising an Exception would be OK, but any other option (like treat it as a missing value) wouldn't make sense if you doin't know more about the type.The

The only answer that I see that makes sense is Raising an Exception (likely in sorted). You also probably can't be totally thorough in checking, as checking completely associativity would be an N**3 operation which is way to slow. Likely you would live with just testing that for each pair you check has exactly one of a < b, a == b, b < a being true.

In many ways, half fixing the issue makes it worse, as improving reliability can lead you to forget that you need to take care, so the remaining issues catch you harder.

I can't think of an example of this -- it's a fine principle, but if we were to make the entire statistics module do something reasonable with Nans -- what exactly other issues would that hide?

In short: practicality beats purity:

- The fact that the module is designed to work with all the standard number types doesn't mean it has to work with ANY type that supports the operations needed for a given function.

- NaNs are part of the Python float and Decimal implementations -- so they WILL show up once in a while. It would be good to handle them.

- NaNs can also be very helpful to indicate missing values -- this is can actually be very handy for statistics calculations. So it could be a nice feature to add -- that NaN means "missing value"

ASSUMING that NaNs represent missing data is just ONE possible interpretation for it. Making it THE interpretation in a low level package seems wrong. It also is an interpretation that is easy to create with a simple helper function that takes one sequence and returns another one where all the NaNs are removed. Other interpretations of what a NaN should reflect aren't as easy to implement outside the operation, so if you are going to pick an interpretation, it probably should be something else that is hard to handle externally.

Also, in Python, because it is dynamically typed, would be used better in my mind using something like None to indicate missing data in the array. NaN was chosen is some language as a missing value because they couldn't handle mixed type data arrays.

-CHB

-- Christopher Barker, PhD

-- h Richard Damon _______________________________________________ Python-ideas mailing list -- python-ideas@python.org To unsubscribe send an email to python-ideas-leave@python.org https://mail.python.org/mailman3/lists/python-ideas.python.org/ Message archived at https://mail.python.org/archives/list/python-ideas@python.org/message/VRDAF4... Code of Conduct: http://python.org/psf/codeofconduct/

On 12/28/19 8:40 PM, David Mertz wrote:

This is sophistry. NaN is an instance of the abstract type numbers.Number and the concrete type float. IEEE-754 defines NaN as collection of required values in any floating point type.

I know the acronym suggests otherwise in a too-cute way, but NaN is archetypically a number in a computer science sense (but not in a pure math way, of course).

Likewise, YAML (YAML Ain't Markup Language) is a markup language. And GNU (GNU's Not Unix) is a Unix system.

NaN may be an instance of the abstract type Number, but is isn't a mathematical number. It exists as that because it is a value of the type float (and then adopted in some others). It is a floating point value because in the Field of the Real Numbers, and the approximation made in the floating point types, not all operations are closed under the full domain of values, and because a computer hardware system needs to produce SOME answer, the standard defined a value (or actually a set of values) to represent these Non-Number answers. Yes, NaN really is a value that represents Not A Number. Note, that the statistics module doesn't say it works with Numbers, but with numeric data. Also, YAML is NOT a markup language, but a data serialization format. It in fact started as Yet Another Markup Language, but when they realized that what they were designing wasn't really a mark up language, but a data based (not document formatting) language they change it to be more descriptive. Also GNU isn't Unix, as Unix is a trademarked name for a specific set of operating systems. GNU produced Linux, which is a mostly work-alike operating system. Its sort of like saying Scotties isn't a Kleenex (Both brands of facial tissues). -- Richard Damon

An idea I haven't seen anyone suggest: Since testing for nan can be expensive, maybe it would make sense to provide a statistics.median_unsafe(), or perhaps median_fast(), method with the current implementation (for situations where a slow down isn't acceptable), and update the median() function to include the most reliable/comprehensive testing for nan? If this were desirable it might also make sense to provide other, similar misbehaving *_unsafe or *_fast methods as well. On Sat, Dec 28, 2019, 9:35 PM Richard Damon <Richard@damon-family.org> wrote:

On 12/28/19 8:40 PM, David Mertz wrote:

This is sophistry. NaN is an instance of the abstract type numbers.Number and the concrete type float. IEEE-754 defines NaN as collection of required values in any floating point type.

I know the acronym suggests otherwise in a too-cute way, but NaN is archetypically a number in a computer science sense (but not in a pure math way, of course).

Likewise, YAML (YAML Ain't Markup Language) is a markup language. And GNU (GNU's Not Unix) is a Unix system.

NaN may be an instance of the abstract type Number, but is isn't a mathematical number. It exists as that because it is a value of the type float (and then adopted in some others). It is a floating point value because in the Field of the Real Numbers, and the approximation made in the floating point types, not all operations are closed under the full domain of values, and because a computer hardware system needs to produce SOME answer, the standard defined a value (or actually a set of values) to represent these Non-Number answers. Yes, NaN really is a value that represents Not A Number.

Note, that the statistics module doesn't say it works with Numbers, but with numeric data.

Also, YAML is NOT a markup language, but a data serialization format. It in fact started as Yet Another Markup Language, but when they realized that what they were designing wasn't really a mark up language, but a data based (not document formatting) language they change it to be more descriptive.

Also GNU isn't Unix, as Unix is a trademarked name for a specific set of operating systems. GNU produced Linux, which is a mostly work-alike operating system. Its sort of like saying Scotties isn't a Kleenex (Both brands of facial tissues).

--

Richard Damon _______________________________________________ Python-ideas mailing list -- python-ideas@python.org To unsubscribe send an email to python-ideas-leave@python.org https://mail.python.org/mailman3/lists/python-ideas.python.org/ Message archived at https://mail.python.org/archives/list/python-ideas@python.org/message/CLUZDF... Code of Conduct: http://python.org/psf/codeofconduct/

On Sat, Dec 28, 2019, 9:36 PM Richard Damon <Richard@damon-family.org> wrote:

NaN may be an instance of the abstract type Number, but is isn't a mathematical number.

Yes, floating point numbers are not pure-math Reals. Not even Rationals. They are a CS construct that is very useful for computer programs that very approximately resemble Rationals. One of the necessary details in that approximation is that NaN is a special "computer number." You get it as a result of some operations where other answers aren't possible in the system. Inf is the same way. It's not a Rational. It's also not omega-zero in set theory. It's just a thing the system does to be "pretty good" more of the time. They are all "numbers." ... or if you insist: no floating point number is a "number." (not a closed field, operations not transitive, blah blah) Also, YAML is NOT a markup language, but a data serialization format. It in

fact started as Yet Another Markup Language, but when they realized that what they were designing wasn't really a mark up language...

I communicated with Clark Evans in 2001 and wrote about the format near then. Both funny acronyms are equally old. YAML is a way of *marking up* structured data. I create YAML much more in my text editor by hand than by serializing in-memory objects. You are free to have a personal definition of the loose concept "markup language" the excludes YAML. Also GNU isn't Unix, as Unix is a trademarked name for a specific set of

operating systems. GNU produced Linux, which is a mostly work-alike operating system. Its sort of like saying Scotties isn't a Kleenex (Both brands of facial tissues).

Yes, it's like arguing at great length that Scotties isn't Kleenex. Or that I don't make a Xerox copy on my Ricoh brand printer. Pedantically you are correct. FWIW, I am unusual in almost never using genericized brand names, which is party of why I sound funny to everyone else :-).

On 12/28/19 10:05 PM, David Mertz wrote:

On Sat, Dec 28, 2019, 9:36 PM Richard Damon <Richard@damon-family.org <mailto:Richard@damon-family.org>> wrote:

> NaN may be an instance of the abstract type Number, but is isn't a mathematical number.

Yes, floating point numbers are not pure-math Reals. Not even Rationals. They are a CS construct that is very useful for computer programs that very approximately resemble Rationals.

One of the necessary details in that approximation is that NaN is a special "computer number." You get it as a result of some operations where other answers aren't possible in the system. Inf is the same way. It's not a Rational. It's also not omega-zero in set theory. It's just a thing the system does to be "pretty good" more of the time. They are all "numbers."

... or if you insist: no floating point number is a "number." (not a closed field, operations not transitive, blah blah)

Every value of the type float, except NaN and perhaps +inf and -inf (depending on which version of the Real Number Line you use) IS actually a representation of a Real Number (so I don't understand in what way you can say they aren't "numbers"). And yes, they happen to also be Rationals. Yes, the type float doesn't create a precise field in itself, but it is a close model to a reasonable subset of the actual Real Numbers. If you use that approximation, there is a lot you can say about using the float type (in fact the whole statistics module is based on this sort of correlation). This approximation only works if you realize and define that the float value NaN isn't a "Real Number" (and +inf and -inf are just "Extended Real Numbers"). Break that definition and you lose all the mathematical support to understand floating point numbers (or you have to keep saying float numbers except NaN). I suppose this is part of the root of the confusion with IEEE floating point, that this numeric type as a value that isn't a number (and the rules of IEEE handle it that way). -- Richard Damon

On Sat, Dec 28, 2019 at 10:31 PM Richard Damon <Richard@damon-family.org> wrote:

Every value of the type float, except NaN and perhaps +inf and -inf (depending on which version of the Real Number Line you use) IS actually a representation of a Real Number (so I don't understand in what way you can say they aren't "numbers"). And yes, they happen to also be Rationals.

This is not subtle, and it is not something you do not understand. C'mon. Tell me three Real Numbers (or Rational Numbers) such that: (a + b) + c != a + (b + c) You will not succeed in that endeavor because of the nature of "numbers." ... and yes, because of the usual meaning given to the symbol '+', etc. OK... now I'll tell you three floating point "numbers" that follow that pattern: a = float("0.1") b = float("0.2") c = float("0.3") We can conveniently write the constructors for these in Python as `0.1` and so on. These objects that occupy computer memory are not "numbers" in the sense of obeying the axioms of arithmetic. But they kinda-sorta resemble numbers, and computers can do things with them quickly, so we pretend. -- Keeping medicines from the bloodstreams of the sick; food from the bellies of the hungry; books from the hands of the uneducated; technology from the underdeveloped; and putting advocates of freedom in prisons. Intellectual property is to the 21st century what the slave trade was to the 16th.

On Sun, Dec 29, 2019 at 2:44 PM David Mertz <mertz@gnosis.cx> wrote:

On Sat, Dec 28, 2019 at 10:31 PM Richard Damon <Richard@damon-family.org> wrote:

Every value of the type float, except NaN and perhaps +inf and -inf (depending on which version of the Real Number Line you use) IS actually a representation of a Real Number (so I don't understand in what way you can say they aren't "numbers"). And yes, they happen to also be Rationals.

This is not subtle, and it is not something you do not understand. C'mon.

Tell me three Real Numbers (or Rational Numbers) such that:

(a + b) + c != a + (b + c)

You will not succeed in that endeavor because of the nature of "numbers."

... and yes, because of the usual meaning given to the symbol '+', etc.

It's not about numbers, it's about the nature of addition in the real numbers. Since floats are a subset of real numbers, it is entirely possible for the result of any operation to be outside of the subset representable as floats.

OK... now I'll tell you three floating point "numbers" that follow that pattern:

a = float("0.1") b = float("0.2") c = float("0.3")

We can conveniently write the constructors for these in Python as `0.1` and so on. These objects that occupy computer memory are not "numbers" in the sense of obeying the axioms of arithmetic. But they kinda-sorta resemble numbers, and computers can do things with them quickly, so we pretend.

They really truly ARE numbers. They are, in fact, these numbers: a = 3602879701896397 / 36028797018963968 b = 3602879701896397 / 18014398509481984 c = 5404319552844595 / 18014398509481984 When you perform addition on these, the result must repeatedly be rounded to the available resolution. This is not about numbers, it is about limited subsets of numbers. ChrisA

On Sat, Dec 28, 2019, 11:02 PM Chris Angelico <rosuav@gmail.com> wrote:

They really truly ARE numbers. They are, in fact, these numbers:

a = 3602879701896397 / 36028797018963968 b = 3602879701896397 / 18014398509481984 c = 5404319552844595 / 18014398509481984

When you perform addition on these, the result must repeatedly be rounded to the available resolution. This is not about numbers, it is about limited subsets of numbers.

They really truly are NOT numbers. Numbers are not substantial things in themselves. They are defined by forming a field under two operators, '+' and '*'. Often numbers are constructed, in a somewhat arbitrary way, from sets. A successor operation is defined in terms of subset, then we prove that that construction obeys the Peano axioms. Rationals can be constructed by defining a ratio or division operation, and that can be shown to create a field. Reals can be defined as Cauchy sequences of infinitely many Rationals, and they also define a field of weirdly convoluted sets. You can construct these numbers in other ways. For example, you can create a FINITE field on bit-patterns of 32-bits, or 64-bits, or 42-bits, or whatever. However, the bit patterns used for IEEE-754 "numbers" are not ones that form a field. They are NOT NUMBERS under the meaning given to the operations '+' and '*' by computer chips and programming languages. Now yes, you are free to say that NOTIONALLY you want to think of the bit pattern created by float("0.1") as standing for the Rational number 3602879701896397 / 36028797018963968. And that loose way of thinking is useful in the following way: When you do the computer operation '+' on two bit patterns, you get a new bit pattern whose notional "number" is within a reasonably small delta of the number you'd get if you did an operation on Rational numbers. In fact, the maximum functional form of this delta is provided in the IEEE-754 spec. But the bit patterns making up floating point "numbers" are not NUMBERS, and the operations performed on them do not make a FINITE field, let alone an isomorphism with the infinite field of Rational numbers.

On 12/28/19 11:44 PM, David Mertz wrote:

On Sat, Dec 28, 2019, 11:02 PM Chris Angelico <rosuav@gmail.com <mailto:rosuav@gmail.com>> wrote:

They really truly ARE numbers. They are, in fact, these numbers:

a = 3602879701896397 / 36028797018963968 b = 3602879701896397 / 18014398509481984 c = 5404319552844595 / 18014398509481984

When you perform addition on these, the result must repeatedly be rounded to the available resolution. This is not about numbers, it is about limited subsets of numbers.

They really truly are NOT numbers. Numbers are not substantial things in themselves. They are defined by forming a field under two operators, '+' and '*'.

Often numbers are constructed, in a somewhat arbitrary way, from sets. A successor operation is defined in terms of subset, then we prove that that construction obeys the Peano axioms. Rationals can be constructed by defining a ratio or division operation, and that can be shown to create a field. Reals can be defined as Cauchy sequences of infinitely many Rationals, and they also define a field of weirdly convoluted sets.

You can construct these numbers in other ways. For example, you can create a FINITE field on bit-patterns of 32-bits, or 64-bits, or 42-bits, or whatever. However, the bit patterns used for IEEE-754 "numbers" are not ones that form a field. They are NOT NUMBERS under the meaning given to the operations '+' and '*' by computer chips and programming languages.

Now yes, you are free to say that NOTIONALLY you want to think of the bit pattern created by float("0.1") as standing for the Rational number 3602879701896397 / 36028797018963968. And that loose way of thinking is useful in the following way:

When you do the computer operation '+' on two bit patterns, you get a new bit pattern whose notional "number" is within a reasonably small delta of the number you'd get if you did an operation on Rational numbers. In fact, the maximum functional form of this delta is provided in the IEEE-754 spec.

But the bit patterns making up floating point "numbers" are not NUMBERS, and the operations performed on them do not make a FINITE field, let alone an isomorphism with the infinite field of Rational numbers.

But practicality beats purity, and practically, bit pattern CAN represent numbers. If you want to argue that floats are not numbers, than we can't use the statistics package, as we can't have any numbers to perform the statistics on. Also, Number DO have a precise practical meaning. While we can start defining number as sets, and that provides a useful mathematical basis. We can also define the Natural numbers by simple counting. Two is a concert concept, it IS substantial. And from the concrete start we can get concrete definitions for the Rationals (which are actually enough to get to floats) and from the concrete definition of Rationals we can define the Reals by various methods where the numbers have REAL meanings, not just abstract concepts. You seem to understand Pure Math, but not the Applied Mathematics of computers. The Applied Mathematics of Computing is based on the concept of finite approximation, which is something that Pure Math, like the type that builds up the Number line starting with Set Theory doesn't handle approximation well. In Pure Math, something that is just mostly true, and can be proved to not always be true, is considered False, but to applied math, it can be good enough. -- Richard Damon

On 2019-12-28 21:11, Richard Damon wrote: <not addressing me>

You seem to understand Pure Math, but not the Applied Mathematics of computers. The Applied Mathematics of Computing is based on the concept of finite approximation, which is something that Pure Math, like the type that builds up the Number line starting with Set Theory doesn't handle approximation well. In Pure Math, something that is just mostly true, and can be proved to not always be true, is considered False, but to applied math, it can be good enough.

But that is the problem. "The applied mathematics of computing" is floating point, and in floating point, NaN is a number (despite its name). You can't have your cake and eat it too. You can say take the pure mathematical view that that NaN isn't a number, or you can say take the applied view that computers work with things (like floats) that aren't actually numbers but are similar to them. But what you seem to be trying to say is that computers work with things that aren't actually numbers (like floats), but that even so, NaN isn't one of those things, and that position just strikes me as senseless. The things that computers work with are floats, and NaN is a float, so in any relevant sense it is a number; it is an instance of a numerical type. With regard to the larger issue of what statistics.median should do, I think the simplest solution is to just put an explicit warning in the docs that says "Results may be meaningless if your data contain NaN". I think it's fine to adopt a GIGO approach, but the problem is that the current docs aren't explicit enough about what counts as garbage, especially for the naive user. We don't have to worry about whether it's logically sufficient to say "the data must be orderable" or "the data must be numbers". The only case of practical relevance is the specific case of the single value NaN, so just put a specific note in the docs to specifically warn people about that specific value. -- Brendan Barnwell "Do not follow where the path may lead. Go, instead, where there is no path, and leave a trail." --author unknown

On Sat, Dec 28, 2019 at 9:42 PM Brendan Barnwell <brenbarn@brenbarn.net> wrote:

But that is the problem. "The applied mathematics of computing" is floating point, and in floating point, NaN is a number (despite its name).

careful here -- that may just re-ignite the argument :-(

computers work with are floats, and NaN is a float, so in any relevant sense it is <snip> an instance of a numerical type.

Stick with that -- we hopefully can all agree to it. One of the things the docs specifically say is that the statistics module is designed to work with floats and Decimal, and with either of these can, and will occasionally have a NaN value. And that value(s) maybe have gotten there without the users knowing it, and many users will not understand the implications of them. In the case of mean(), that's not too bad -- they'll get a NaN as a result, go WTF? and then hopefully do a bit of research to figure out what happened. But for median (and friends), they will get an arbitrary, incorrect result. That is not good.

I think the simplest solution is to just put an explicit warning in the docs that says "Results may be meaningless if your data contain NaN".

Absolutely -- behavior around NaN should be well documented. Ideally we'd add a bit more explanation that that -- but SOMETHING would be good. And that can be done far more easily than actually changing the module (and back-ported to older versions of the docs) A next step might be to provide a couple nan-handling utility functions: maybe an is_nan() function, or a filter_nan function, or a all_finite() function. These are not as trivial as they seem at first, so good to provide them. Then we'd ideally update the code itself to better handle NaNs -- which is a bigger lift. But yes -- AT LEAST UPDATETHE DOCS! - CHB -- Christopher Barker, PhD Python Language Consulting - Teaching - Scientific Software Development - Desktop GUI and Web Development - wxPython, numpy, scipy, Cython

On 12/29/19 12:20 AM, Brendan Barnwell wrote:

On 2019-12-28 21:11, Richard Damon wrote: <not addressing me>

You seem to understand Pure Math, but not the Applied Mathematics of computers. The Applied Mathematics of Computing is based on the concept of finite approximation, which is something that Pure Math, like the type that builds up the Number line starting with Set Theory doesn't handle approximation well. In Pure Math, something that is just mostly true, and can be proved to not always be true, is considered False, but to applied math, it can be good enough.

But that is the problem. "The applied mathematics of computing" is floating point, and in floating point, NaN is a number (despite its name). You can't have your cake and eat it too. You can say take the pure mathematical view that that NaN isn't a number, or you can say take the applied view that computers work with things (like floats) that aren't actually numbers but are similar to them. But what you seem to be trying to say is that computers work with things that aren't actually numbers (like floats), but that even so, NaN isn't one of those things, and that position just strikes me as senseless. The things that computers work with are floats, and NaN is a float, so in any relevant sense it is a number; it is an instance of a numerical type.

With regard to the larger issue of what statistics.median should do, I think the simplest solution is to just put an explicit warning in the docs that says "Results may be meaningless if your data contain NaN". I think it's fine to adopt a GIGO approach, but the problem is that the current docs aren't explicit enough about what counts as garbage, especially for the naive user. We don't have to worry about whether it's logically sufficient to say "the data must be orderable" or "the data must be numbers". The only case of practical relevance is the specific case of the single value NaN, so just put a specific note in the docs to specifically warn people about that specific value.

Per the IEEE standard for floating point, which I think would be the controlling document here, we have that Floating Point Values (or Representations, which is what every bit pattern is) that are divided into 3 major classes: 1) Finite Number, which for binary floating point are sub divided into Normalized and Denormalized Numbers, which are our representations for the set of the Real Numbers we can express. 2) Infinities, which extend the representation into the Extended Real Numbers 3) Not A Numbers, which are the result of exceptions in computations when we don't get a numeric result. Thus, in IEEE-754, NaN is NOT a Number (and the infinities might or not be numbers too depending on which definition of number you are using) Also, in the applied mathematics dealing with computing, NaN is NOT thought of as a "Number", but just a "Value", it has none of the properties of a Number, so to treat it as such is a mistake. I suppose that is part of the heart of the trickiness of IEEEE floating point, that not every value it represents is really a number. Note that the number values obey, to the limits of precision, the normal axioms of mathematics (and sometimes the confusion is that it is ONLY to the limits of precision, so the basic axioms don't hold to the last bit, but only within the limits of precision so they match 'close enough'). This is a proper Engineering equivalence, even if not a precise Pure Mathematical one. I would have no problem with the idea that the statistic model, or even more general Python documentation include a warning about NaN values (and there are actually a whole lot of values that are NaNs) behaving in unexpected manners. -- Richard Damon

On Sat, Dec 28, 2019 at 09:20:49PM -0800, Brendan Barnwell wrote:

The things that computers work with are floats, and NaN is a float, so in any relevant sense it is a number; it is an instance of a numerical type.

You seem to be assuming that `x is an instance of type float (or Decimal)` implies `x is a number`. That seems reasonable in the general case (e.g. we would rightly expect that every instance of a type "Car" represents an actual car) but in the specific case of floats (and Decimal) the assumption breaks down. The writers of the IEEE-754 standard didn't choose the name "Not A Number" at random. The name tells you their intention: NANs represent values or quantities which are not numbers. It seems particularly perverse to insist that even when a value is explicitly described as Not A Number, it's actually a number. Especially since it fails quite a few commonsense tests for whether or not something is a number: - Do NANs appear anywhere on the real number line or complex plane? - Can we successfully measure (say) the length of a piece of string and get a result of NAN inches? - Can we successfully count the number of (say) spoons on a table and get a result of NAN? - Do NANs obey, even approximately, the usual properties of numbers? The answer in all four cases is No. If something doesn't quack like a duck, doesn't swim like a duck, and doesn't walk like a duck, and is explicitly called Not A Duck, would we insist that it's actually a duck? One of the earliest computer companies to implement the IEEE-754 was Apple, in their "Standard Apple Numerics Environment" or SANE. The SANE manual states: "When a SANE operation cannot produce a meaningful result, the operation delivers a special bit pattern called a NaN (Not-a-Number)." So NANs are placeholders applying where there is no meaningful result. One of the operations where SANE would return a NAN result was when trying to convert a string to a number. In Python terms, the SANE equivalent of: float("Hello World") would return a NAN. David Goldberg's "What Every Computer Scientist Should Know About Floating-Point Arithmetic" contrasts two different styles of numeric programming: (1) One where every bit-pattern is a number (example: IBM System/370) (2) One where some bit-patterns don't represent numbers, but represent "special quantities such as INDEFINITE and INFINITY" (e.g. the CDC 6600 and the VAX). Goldberg says: "The IEEE standard continues in this tradition and has NaNs (Not a Number) and infinities. Without any special quantities, there is no good way to handle exceptional situations like taking the square root of a negative number, other than aborting computation." https://docs.oracle.com/cd/E19957-01/806-3568/ncg_goldberg.html so it should be clear that Goldberg considers IEEE-754 to belong to (2) where some bit-patterns are special quantities, not numbers. He then goes on to contrast the situation on System/370: "Since every bit pattern represents a valid number [on System/370], the return value of square root must be some floating-point number. In the case of System/370 FORTRAN, √(−4)=2 is returned. with the situation in IEEE-754 arithmetic which has "special value[s] called NaN": "In IEEE arithmetic, a NaN is returned in this situation." I think that in order to be precise we need to distinguish between at least four cases: - instances of the abstract type Number; - instances of concrete types such as float or Decimal; - and instances whose value represents a number. (Aside: recall that in Python, the object model is that every object has a type, an identity and a value.) NANs, it is true, are instances of both the abstract type Number and the concrete type float or Decimal; but their values don't represent numbers. -- Steven

On Mon, Dec 30, 2019 at 5:48 AM Steven D'Aprano <steve@pearwood.info> wrote:

On Sat, Dec 28, 2019 at 09:20:49PM -0800, Brendan Barnwell wrote:

The things that computers work with are floats, and NaN is a float, so in any relevant sense it is a number; it is an instance of a numerical type.

You seem to be assuming that `x is an instance of type float (or Decimal)` implies `x is a number`. That seems reasonable in the general case (e.g. we would rightly expect that every instance of a type "Car" represents an actual car) but in the specific case of floats (and Decimal) the assumption breaks down.

The writers of the IEEE-754 standard didn't choose the name "Not A Number" at random. The name tells you their intention: NANs represent values or quantities which are not numbers. It seems particularly perverse to insist that even when a value is explicitly described as Not A Number, it's actually a number.

Especially since it fails quite a few commonsense tests for whether or not something is a number:

- Do NANs appear anywhere on the real number line or complex plane?

- Can we successfully measure (say) the length of a piece of string and get a result of NAN inches?

- Can we successfully count the number of (say) spoons on a table and get a result of NAN?

- Do NANs obey, even approximately, the usual properties of numbers?

The answer in all four cases is No. If something doesn't quack like a duck, doesn't swim like a duck, and doesn't walk like a duck, and is explicitly called Not A Duck, would we insist that it's actually a duck?

Be careful: This kind of logic and intuition doesn't always hold true even for things that we actually DO call numbers. The counting numbers follow logical intuition, but you can't count the number of spoons on a table and get a result of "negative five" or "the square root of two" or "3 + 2i". Extending the concept of "numbers" to include zero, negatives, laterals (complexes), etc is something that various different cultures had to come to terms with. (The ancient Romans, for instance, didn't have any way to represent zero - which means it was extremely difficult for their C programs to indicate successful termination. This led directly to the empire's downfall.) More useful would be to look at the useful operations and invariants that can be maintained, but that doesn't work too well for finite subsets of numbers. Very few operations are closed for, say, "sixteen bit integers". Nor for "rational numbers with denominators that are powers of two between -1024 and 1024". There is no easy way to define "number" in a way that is both mathematically pure AND practically useful. ChrisA

On 12/29/19 4:30 PM, Chris Angelico wrote:

On Mon, Dec 30, 2019 at 5:48 AM Steven D'Aprano <steve@pearwood.info> wrote:

On Sat, Dec 28, 2019 at 09:20:49PM -0800, Brendan Barnwell wrote:

The things that computers work with are floats, and NaN is a float, so in any relevant sense it is a number; it is an instance of a numerical type. You seem to be assuming that `x is an instance of type float (or Decimal)` implies `x is a number`. That seems reasonable in the general case (e.g. we would rightly expect that every instance of a type "Car" represents an actual car) but in the specific case of floats (and Decimal) the assumption breaks down.

The writers of the IEEE-754 standard didn't choose the name "Not A Number" at random. The name tells you their intention: NANs represent values or quantities which are not numbers. It seems particularly perverse to insist that even when a value is explicitly described as Not A Number, it's actually a number.

Especially since it fails quite a few commonsense tests for whether or not something is a number:

- Do NANs appear anywhere on the real number line or complex plane?

- Can we successfully measure (say) the length of a piece of string and get a result of NAN inches?

- Can we successfully count the number of (say) spoons on a table and get a result of NAN?

- Do NANs obey, even approximately, the usual properties of numbers?

The answer in all four cases is No. If something doesn't quack like a duck, doesn't swim like a duck, and doesn't walk like a duck, and is explicitly called Not A Duck, would we insist that it's actually a duck? Be careful: This kind of logic and intuition doesn't always hold true even for things that we actually DO call numbers. The counting numbers follow logical intuition, but you can't count the number of spoons on a table and get a result of "negative five" or "the square root of two" or "3 + 2i". Extending the concept of "numbers" to include zero, negatives, laterals (complexes), etc is something that various different cultures had to come to terms with. (The ancient Romans, for instance, didn't have any way to represent zero - which means it was extremely difficult for their C programs to indicate successful termination. This led directly to the empire's downfall.)

More useful would be to look at the useful operations and invariants that can be maintained, but that doesn't work too well for finite subsets of numbers. Very few operations are closed for, say, "sixteen bit integers". Nor for "rational numbers with denominators that are powers of two between -1024 and 1024". There is no easy way to define "number" in a way that is both mathematically pure AND practically useful.

ChrisA

The first and forth test are reasonable tests for a real number (expressed as a float). The second does limit itself to positive reals, but is otherwise a reasonable test of something being numberish. The third is more a test of the Natural Numbers (or the Non-Negative Whole Numbers), which would map to the 'unsigned' type in some language, or a part of the range of integers. Since Practicality trumps Purity, we don't need (or really want) a mathematically pure definition, and since it is admitted that the floating point representation is just an approximation of the Real Numbers, a pure definition probably doesn't work anyway (Pure Mathematics doesn't deal with approximations well). We have a mathematic definition of the ideal Real Numbers, using any of a number of ways to build it, and then we say that floats are an approximation to that system. The question remains, what in the Real Numbers do you propose that the NaN IEEE value approximate? It really doesn't, so isn't a 'number', but just a value, used to represent things that aren't numerical answers when a operation needs to generate SOME answer. The alternative to having NaN values in the representation would be for the operations that currently generate it to fault out and either kill the program or at least generate an exception. -- Richard Damon

On Mon, Dec 30, 2019 at 9:01 AM Richard Damon <Richard@damon-family.org> wrote:

On 12/29/19 4:30 PM, Chris Angelico wrote:

On Mon, Dec 30, 2019 at 5:48 AM Steven D'Aprano <steve@pearwood.info> wrote:

On Sat, Dec 28, 2019 at 09:20:49PM -0800, Brendan Barnwell wrote:

Especially since it fails quite a few commonsense tests for whether or not something is a number:

- Do NANs appear anywhere on the real number line or complex plane?

- Can we successfully measure (say) the length of a piece of string and get a result of NAN inches?

- Can we successfully count the number of (say) spoons on a table and get a result of NAN?

- Do NANs obey, even approximately, the usual properties of numbers?

Be careful: This kind of logic and intuition doesn't always hold true even for things that we actually DO call numbers.

The first and forth test are reasonable tests for a real number (expressed as a float).

There's nothing about "expressed as a float" in the first test; it basically just uses the definition of "real number line" and "complex plane" to define them. Actually performing the test is not exactly practical, but it's valid. I don't know what you mean by the fourth test being a test for a real number.

The second does limit itself to positive reals, but is otherwise a reasonable test of something being numberish.

Have you ever successfully measured the length of a piece of string as being precisely pi inches? Does that mean that pi is not a positive real number?

The third is more a test of the Natural Numbers (or the Non-Negative Whole Numbers), which would map to the 'unsigned' type in some language, or a part of the range of integers.

No, the "unsigned" type in most languages is a finite subset of unsigned integers. To be fair, a 64-bit unsigned number is going to include every number of spoons you'll ever see on any table, but that doesn't equate it to the naturals. Also, zero creates a lot of strange paradoxes. There are zero spoons on this table, AND there are zero forks AND zero knives. Furthermore, there are zero woolly mammoths, zero cans of Heinz Baked Beans, and zero chunks of red Kryptonite. How can you count the number of Infinite Improbability Drives sitting on the table, if you don't even know what an Infinite Improbability Drive looks like? Intuition simply can't answer all of these questions. You have to start mathematically or by performing operations on numbers ("if I have three spoons and I remove one, I have 3-1 == 2 two spoons"), at which point you are taking a completely different approach.

Since Practicality trumps Purity, we don't need (or really want) a mathematically pure definition, and since it is admitted that the floating point representation is just an approximation of the Real Numbers, a pure definition probably doesn't work anyway (Pure Mathematics doesn't deal with approximations well).

We have a mathematic definition of the ideal Real Numbers, using any of a number of ways to build it, and then we say that floats are an approximation to that system.

But since they are an approximation, the rules for reals do not hold. For instance, addition of two positive real numbers always results in another real number that is larger than each of the originals. Not so for floats. (Although the (infinite) sum of every power of two can be shown to be -1, so take of that what you will.)

The question remains, what in the Real Numbers do you propose that the NaN IEEE value approximate? It really doesn't, so isn't a 'number', but just a value, used to represent things that aren't numerical answers when a operation needs to generate SOME answer.

The equally valid question also remains: what in the real numbers does float("inf") represent? That equally isn't a number, it is a value. It is used to represent things that aren't numerical answers.

The alternative to having NaN values in the representation would be for the operations that currently generate it to fault out and either kill the program or at least generate an exception.

Also true of infinity. Are you going to object to that? Floats are NOT pure mathematical concepts of numbers. They are a disgustingly practical approximation to them, with many compromises and many unintuitive behaviours, but incredibly useful ones. ChrisA

On 12/29/19 5:41 PM, Chris Angelico wrote:

On 12/29/19 4:30 PM, Chris Angelico wrote:

On Mon, Dec 30, 2019 at 5:48 AM Steven D'Aprano <steve@pearwood.info> wrote:

On Sat, Dec 28, 2019 at 09:20:49PM -0800, Brendan Barnwell wrote:

Especially since it fails quite a few commonsense tests for whether or not something is a number:

- Do NANs appear anywhere on the real number line or complex plane?

- Can we successfully measure (say) the length of a piece of string and get a result of NAN inches?

- Can we successfully count the number of (say) spoons on a table and get a result of NAN?

- Do NANs obey, even approximately, the usual properties of numbers?

Be careful: This kind of logic and intuition doesn't always hold true even for things that we actually DO call numbers. The first and forth test are reasonable tests for a real number (expressed as a float). There's nothing about "expressed as a float" in the first test; it basically just uses the definition of "real number line" and "complex

On Mon, Dec 30, 2019 at 9:01 AM Richard Damon <Richard@damon-family.org> wrote: plane" to define them. Actually performing the test is not exactly practical, but it's valid. I don't know what you mean by the fourth test being a test for a real number.

When using the 'usual properties of numbers' we needed to condition our testing based on the limited precision of the float system, these usual properties are things like a + b = b + a, a + b - b = a, (a + b) + c = a + (b + c), and the like. In all of these, the equality may not be exact, but should only be tested within the precision of the floating point system, which also means that using extreme differences in magnitude can cause a total loss of precision, but that is all explainable by the limited precision (as opposed to some strangeness in operation). There is a large set of operations that we consider 'normal' to perform on normal real numbers, and the floating point values that fall in the finite number range do well at these, when you consider limits of precision and range. Yes, the first test isn't something that the computer is going to be able to answer, but actually IS a very simple answer for the programmer/analyst to answer by inspection. All the finite numbers naturally meet it. The infinities meet it if we consider the Extended Real Number Line (or if not, we see that they are NOT numbers, and for the 'usual properties of numbers', the infinities don't meet as well as normal numbers, so them being an edge case of being a number makes sense). NaNs clearly fail this test (unless you want to show me how it does). I don't see what is so impractical about that decision.

The second does limit itself to positive reals, but is otherwise a reasonable test of something being numberish. Have you ever successfully measured the length of a piece of string as being precisely pi inches? Does that mean that pi is not a positive real number?

The third is more a test of the Natural Numbers (or the Non-Negative Whole Numbers), which would map to the 'unsigned' type in some language, or a part of the range of integers. No, the "unsigned" type in most languages is a finite subset of unsigned integers. To be fair, a 64-bit unsigned number is going to include every number of spoons you'll ever see on any table, but that doesn't equate it to the naturals. Also, zero creates a lot of strange paradoxes. There are zero spoons on this table, AND there are zero forks AND zero knives. Furthermore, there are zero woolly mammoths, zero cans of Heinz Baked Beans, and zero chunks of red Kryptonite. How can you count the number of Infinite Improbability Drives sitting on the table, if you don't even know what an Infinite Improbability Drive looks like? Intuition simply can't answer all of these questions. You have to start mathematically or by performing operations on numbers ("if I have three spoons and I remove one, I have 3-1 == 2 two spoons"), at which point you are taking a completely different approach. But that is a precision/range limit, which is part of the approximation. Also the test is to map from the computer value to the mathematical, not can all mathematical map to the computer. We acknowledge that the computer model is finite, and thus there is a range limit .

Since Practicality trumps Purity, we don't need (or really want) a mathematically pure definition, and since it is admitted that the floating point representation is just an approximation of the Real Numbers, a pure definition probably doesn't work anyway (Pure Mathematics doesn't deal with approximations well).

We have a mathematic definition of the ideal Real Numbers, using any of a number of ways to build it, and then we say that floats are an approximation to that system. But since they are an approximation, the rules for reals do not hold. For instance, addition of two positive real numbers always results in another real number that is larger than each of the originals. Not so for floats. (Although the (infinite) sum of every power of two can be shown to be -1, so take of that what you will.) The rules for the Reals DO hold WITHIN the approximation. This again is

Again, pure theory getting in the way for practicality. First 'Pi' is not a possible floating point value, so we don't need to make a string of Pi inches (but I can make one by wrapping it around a 1" diameter dowel). Second, the test again is really a mental exercise to make numbers something 'real' to the mind, so is really a thought experiment. One issue with this with double precision floating point numbers is that they are more precise than most physical things we can practically measure, which sort of says they are good enough of an approximation for most things. practical beating purity.

The question remains, what in the Real Numbers do you propose that the NaN IEEE value approximate? It really doesn't, so isn't a 'number', but just a value, used to represent things that aren't numerical answers when a operation needs to generate SOME answer. The equally valid question also remains: what in the real numbers does float("inf") represent? That equally isn't a number, it is a value. It is used to represent things that aren't numerical answers.

inf is a member of the Extended Real Numbers in mathematics. Yes, if you are looking at the rules of regular Real Numbers, you need to exclude the values of the infinities as being 'a number', and that is acknowledged by the standard. If you are willing to adjust the rules you expect to those under the Extended Real Numbers, then you can consider them as numbers. I have no problem with the fact that the infinities are in many respects not numbers but in others they can be.

The alternative to having NaN values in the representation would be for the operations that currently generate it to fault out and either kill the program or at least generate an exception. Also true of infinity. Are you going to object to that?

Floats are NOT pure mathematical concepts of numbers. They are a disgustingly practical approximation to them, with many compromises and many unintuitive behaviours, but incredibly useful ones.

And I suppose if you aren't willing to deal with them as APPROXIMATIONS, then you need to not use computers for most things. I suppose you could use symbolic mathematics packages, which never have numbers as anything other that exact values (and thus sqrt(2) stays as sqrt(2) and not 1.414....). When you do accept them as approximations, you can't demand from them what isn't in the approximation. Floats were NEVER said to be an exact representation of the mathematical concepts. It is people who ignore the difference, and the effect of the approximations, that get themselves into trouble. Perhaps that is part of the natural education of a programmer, that they learn that computers are practical devices, not theoretical ones. Maybe that is why 'Programming' falls in the domain of Engineering instead of the Sciences.

ChrisA

-- Richard Damon

On Dec 29, 2019, at 13:33, Chris Angelico <rosuav@gmail.com> wrote:

More useful would be to look at the useful operations and invariants that can be maintained, but that doesn't work too well for finite subsets of numbers. Very few operations are closed for, say, "sixteen bit integers". Nor for "rational numbers with denominators that are powers of two between -1024 and 1024". There is no easy way to define "number" in a way that is both mathematically pure AND practically useful.

There are a whole lot of operations that are closed over 16-bit integers or IEEE doubles. Every possible total function from uint16 x uint16 -> uint16 or binary64 x binary64 -> binary64 is a closed operation, and every pair of such functions defines a valid magma-over-magma structure. The problem isn’t that there aren’t enough such operations, it’s that none of them meet the usual “number” axioms without being trivial in some important way. In particular, IEEE addition isn’t associative, invertible, etc., so IEEE binary64 isn’t even a semiring, much less a field, it’s just a magma-over-magma. But it is, out of all of the possible magma-over-magma structures on those values, the one that most closely approximates—in a well-defined and useful, if very complicated, way—the rationals. That’s not a structure you’d ever study mathematically for its own sake, but it’s a structure you can study mathematically if it’s useful—as it is for the sake of making sense of computer programs. (In the case of uint16, that actually does form a useful structure on its own if you consider them as Z/65536Z, but maybe it’s still worth considering them as an approximation of N instead.)

On Sun, Dec 29, 2019, 5:20 PM Andrew Barnert via Python-ideas

But it is, out of all of the possible magma-over-magma structures on those values, the one that most closely approximates—in a well-defined and useful, if very complicated, way—the rationals.

I'm sort of convinced that Posits better approximate the behavior of rationals than do IEEE-754 floats:

https://en.m.wikipedia.org/wiki/Unum_(number_format) But that's a small footnote. Until or unless this system enters widespread use, tradition floats are our system. And anyway, all the philosophy of mathematics stuff I wrote before would apply to Posits equally, just with somewhat different distributions of errors.

On Dec 29, 2019, at 15:19, David Mertz <mertz@gnosis.cx> wrote:

On Sun, Dec 29, 2019, 5:20 PM Andrew Barnert via Python-ideas

But it is, out of all of the possible magma-over-magma structures on those values, the one that most closely approximates—in a well-defined and useful, if very complicated, way—the rationals.

I'm sort of convinced that Posits better approximate the behavior of rationals than do IEEE-754 floats:

I played with whatever previous version of the concept is available in Julia, and it seemed cool. But I’m pretty sure they’re still not a field, and they don’t approximate the rationals (or any other field) more closely than IEEE floats do in the same well-defined way; rather, they’re the closest approximation in an equally well-defined but very different way, one that’s hopefully usually more useful in practice, but apparently even harder to define.

On Sun, Dec 29, 2019 at 7:35 PM Andrew Barnert <abarnert@yahoo.com> wrote:

On Dec 29, 2019, at 15:19, David Mertz <mertz@gnosis.cx> wrote:On Sun, Dec 29, 2019, 5:20 PM Andrew Barnert via Python-ideas

But it is, out of all of the possible magma-over-magma structures on those

values, the one that most closely approximates—in a well-defined and useful, if very complicated, way—the rationals.

I'm sort of convinced that Posits better approximate the behavior of rationals than do IEEE-754 floats:

https://en.m.wikipedia.org/wiki/Unum_(number_format)

I played with whatever previous version of the concept is available in Julia, and it seemed cool. But I’m pretty sure they’re still not a field, and they don’t approximate the rationals (or any other field) more closely than IEEE floats do in the same well-defined way; rather, they’re the closest approximation in an equally well-defined but very different way, one that’s hopefully usually more useful in practice, but apparently even harder to define.

Of course posits are not a field. At a high level it's just a strategy to distribute deltas differently over the notional approximations that bit-patterns give of Rational numbers (not sure why Richard keeps insisting it's about Reals, not Rationals... albeit there is no difference that means anything to this discussion). My understanding is that there is less "stair-stepping" in the size of an ULP in posits than in IEEE-754, basically; that seems roughly desirable, but not something I need to evangelize about. I did have a really fun private conversation of zillions of emails with Uncle Timmy about this a few years back. He can speak for himself, and knows way more than I do. But I think he's of a similar overall opinion. -- Keeping medicines from the bloodstreams of the sick; food from the bellies of the hungry; books from the hands of the uneducated; technology from the underdeveloped; and putting advocates of freedom in prisons. Intellectual property is to the 21st century what the slave trade was to the 16th.

On Sun, Dec 29, 2019 at 07:43:28PM -0500, David Mertz wrote:

the notional approximations that bit-patterns give of Rational numbers (not sure why Richard keeps insisting it's about Reals, not Rationals... albeit there is no difference that means anything to this discussion).

Excluding NANs and INFs, every float is a rational number, but that's forced on us by the implementation. The intention is for floats to model (even if imperfectly) the Reals ℝ, not the Rationals ℚ. If floats were intended to model the Rationals, trig functions would be useless, since almost no sines or cosines are rational numbers. https://en.wikipedia.org/wiki/Niven's_theorem Likewise the sqrt() function, and most powers, logs and exp() functions would also be useless. Very, very few logarithms, roots and exponents would return Rationals. This is why there are no Fraction.sin, Fraction.log, Fraction.sqrt etc methods, but if there were, they would have to return a float or a decimal or other equivalent non-Rational number. The bottom line is: floats model Reals, not Rationals. -- Steven

On 30/12/19 12:19 pm, David Mertz wrote:

Is it just me, or is that Wikipedia article abysmally written? I feel like I gained a negative amount of knowledge from reading it. -- Greg

On Mon, Dec 30, 2019 at 08:30:41AM +1100, Chris Angelico wrote:

Especially since it fails quite a few commonsense tests for whether or not something is a number: [...] The answer in all four cases is No. If something doesn't quack like a duck, doesn't swim like a duck, and doesn't walk like a duck, and is explicitly called Not A Duck, would we insist that it's actually a duck?

Be careful: This kind of logic and intuition doesn't always hold true even for things that we actually DO call numbers. The counting numbers follow logical intuition, but you can't count the number of spoons on a table and get a result of "negative five" or "the square root of two" or "3 + 2i".

That's because none of those examples are counting numbers :-) My set of "commonsense tests" weren't intended to be an exhaustive or bulletproof set of tests for numberness. They were intended to be simple, obvious and useful tests: if it quacks, swims and walks like a duck, it's probably a duck. The silent Legless Burrowing Duck being a rare exception.[1] We could extend those duck tests appropriately to cover more exotic kinds of numbers: - could your credit card have a balance of NAN dollars or euros? - might the amount of spin of a space craft be NAN? http://danceswithcode.net/engineeringnotes/quaternions/quaternions.html https://thatsmaths.com/2018/10/04/the-many-modern-uses-of-quaternions/ but the answers will still be "No". NANs are artifacts of the system of computation in use, they don't represent a physical measurement: they represent (e.g.) an "out of bounds" error: 1e10000 - 1e9000 returns NAN or some kind of "impossible value" or "there is no answer" placeholder. (These interpretations aren't intended to be exhaustive.) If your bank statement reported your balance as "NAN", that would be a clear programming error, not the actual balance. If your bogometer measured the bogosity of a something as NAN micro- lenarts, that would indicate a failure of the bogometer, not an actual measurement. (Any similarity between the microlenart and the creator of systemd is purely coincidental.) http://itre.cis.upenn.edu/~myl/languagelog/archives/003515.html [1] Not actually a real duck. But it could be. -- Steven

On Mon, Dec 30, 2019 at 11:47 AM Steven D'Aprano <steve@pearwood.info> wrote:

On Mon, Dec 30, 2019 at 08:30:41AM +1100, Chris Angelico wrote:

Especially since it fails quite a few commonsense tests for whether or not something is a number: [...] The answer in all four cases is No. If something doesn't quack like a duck, doesn't swim like a duck, and doesn't walk like a duck, and is explicitly called Not A Duck, would we insist that it's actually a duck?

Be careful: This kind of logic and intuition doesn't always hold true even for things that we actually DO call numbers. The counting numbers follow logical intuition, but you can't count the number of spoons on a table and get a result of "negative five" or "the square root of two" or "3 + 2i".

That's because none of those examples are counting numbers :-)

My set of "commonsense tests" weren't intended to be an exhaustive or bulletproof set of tests for numberness. They were intended to be simple, obvious and useful tests: if it quacks, swims and walks like a duck, it's probably a duck. The silent Legless Burrowing Duck being a rare exception.[1]

Exactly my point! Counting numbers follow logical intuition; but you attested that you could use logical intuition to figure out if something is a "number". Not a "counting number". Logical intuition does NOT explain all the behaviours of non-counting numbers, and you can't say "oh this is illogical ergo it's not a number". The logic of logical intuition is illogical. :) ChrisA

On Dec 29, 2019, at 18:20, Chris Angelico <rosuav@gmail.com> wrote:

On Mon, Dec 30, 2019 at 11:47 AM Steven D'Aprano <steve@pearwood.info> wrote:

On Mon, Dec 30, 2019 at 08:30:41AM +1100, Chris Angelico wrote:

Especially since it fails quite a few commonsense tests for whether or not something is a number: [...] The answer in all four cases is No. If something doesn't quack like a duck, doesn't swim like a duck, and doesn't walk like a duck, and is explicitly called Not A Duck, would we insist that it's actually a duck?

Be careful: This kind of logic and intuition doesn't always hold true even for things that we actually DO call numbers. The counting numbers follow logical intuition, but you can't count the number of spoons on a table and get a result of "negative five" or "the square root of two" or "3 + 2i".

That's because none of those examples are counting numbers :-)

My set of "commonsense tests" weren't intended to be an exhaustive or bulletproof set of tests for numberness. They were intended to be simple, obvious and useful tests: if it quacks, swims and walks like a duck, it's probably a duck. The silent Legless Burrowing Duck being a rare exception.[1]

Exactly my point! Counting numbers follow logical intuition; but you attested that you could use logical intuition to figure out if something is a "number". Not a "counting number". Logical intuition does NOT explain all the behaviours of non-counting numbers, and you can't say "oh this is illogical ergo it's not a number". The logic of logical intuition is illogical. :)

Counting numbers are intuitively numbers. So are measures. And yet, they’re different. Which one is the “one true numbers”? Who cares? Medieval mathematicians did spend thousands of pages trying to resolve that question, but it’s a lot more productive to just accept that the intuitive notion of “number” is vague and instead come up with systematic ways to define and compare and contrast and relate different algebras (not just those two). Are complex numbers numbers? Sure, if you want. Or no, if you prefer. But they’re still not real numbers, much less natural numbers. That’s obvious, and nearly useless. What you really want to know is which properties of the reals also hold of the complex numbers, and that’s a lot less obvious and a lot more useful. And the same is true for IEEE binary64. You can say they’re not numbers, or that they are, or that some of them are and some of them aren’t, but they’re not the rationals (or the reals or the affinely extended reals or a subalgebra of any of the above); what you really want to know is which properties of the rationals hold under what approximation regime for the IEEE binary64s.

On Mon, Dec 30, 2019 at 1:40 PM Andrew Barnert <abarnert@yahoo.com> wrote:

On Dec 29, 2019, at 18:20, Chris Angelico <rosuav@gmail.com> wrote:

On Mon, Dec 30, 2019 at 11:47 AM Steven D'Aprano <steve@pearwood.info> wrote:

On Mon, Dec 30, 2019 at 08:30:41AM +1100, Chris Angelico wrote:

That's because none of those examples are counting numbers :-)

My set of "commonsense tests" weren't intended to be an exhaustive or bulletproof set of tests for numberness. They were intended to be simple, obvious and useful tests: if it quacks, swims and walks like a duck, it's probably a duck. The silent Legless Burrowing Duck being a rare exception.[1]

Exactly my point! Counting numbers follow logical intuition; but you attested that you could use logical intuition to figure out if something is a "number". Not a "counting number". Logical intuition does NOT explain all the behaviours of non-counting numbers, and you can't say "oh this is illogical ergo it's not a number". The logic of logical intuition is illogical. :)

Counting numbers are intuitively numbers. So are measures. And yet, they’re different. Which one is the “one true numbers”? Who cares? Medieval mathematicians did spend thousands of pages trying to resolve that question, but it’s a lot more productive to just accept that the intuitive notion of “number” is vague and instead come up with systematic ways to define and compare and contrast and relate different algebras (not just those two).

That's what I said. You cannot use intuition to define numbers unless you're willing to restrict it to counting numbers. Therefore, if intuition cannot define numbers, intuition cannot be used to exclude non-numbers.

Are complex numbers numbers? Sure, if you want. Or no, if you prefer. But they’re still not real numbers, much less natural numbers. That’s obvious, and nearly useless. What you really want to know is which properties of the reals also hold of the complex numbers, and that’s a lot less obvious and a lot more useful.

And the same is true for IEEE binary64. You can say they’re not numbers, or that they are, or that some of them are and some of them aren’t, but they’re not the rationals (or the reals or the affinely extended reals or a subalgebra of any of the above); what you really want to know is which properties of the rationals hold under what approximation regime for the IEEE binary64s.

And the same is true of the debate about float("nan"). You cannot use intuition to figure out whether this is a number or not, because intuition has ALREADY failed us. "Commonsense tests" such as Steven put forward are not a valid way to debate the edge cases, because they fail on what we would consider clear cases. ChrisA

On Mon, Dec 30, 2019 at 1:40 PM Andrew Barnert <abarnert@yahoo.com> wrote:

On Dec 29, 2019, at 18:20, Chris Angelico <rosuav@gmail.com> wrote:

On Mon, Dec 30, 2019 at 11:47 AM Steven D'Aprano <steve@pearwood.info> wrote:

On Mon, Dec 30, 2019 at 08:30:41AM +1100, Chris Angelico wrote:

Especially since it fails quite a few commonsense tests for whether or not something is a number: [...] The answer in all four cases is No. If something doesn't quack like a duck, doesn't swim like a duck, and doesn't walk like a duck, and is explicitly called Not A Duck, would we insist that it's actually a duck? Be careful: This kind of logic and intuition doesn't always hold true even for things that we actually DO call numbers. The counting numbers follow logical intuition, but you can't count the number of spoons on a table and get a result of "negative five" or "the square root of two" or "3 + 2i". That's because none of those examples are counting numbers :-)

My set of "commonsense tests" weren't intended to be an exhaustive or bulletproof set of tests for numberness. They were intended to be simple, obvious and useful tests: if it quacks, swims and walks like a duck, it's probably a duck. The silent Legless Burrowing Duck being a rare exception.[1] Exactly my point! Counting numbers follow logical intuition; but you attested that you could use logical intuition to figure out if something is a "number". Not a "counting number". Logical intuition does NOT explain all the behaviours of non-counting numbers, and you can't say "oh this is illogical ergo it's not a number". The logic of logical intuition is illogical. :) Counting numbers are intuitively numbers. So are measures. And yet, they’re different. Which one is the “one true numbers”? Who cares? Medieval mathematicians did spend thousands of pages trying to resolve that question, but it’s a lot more productive to just accept that the intuitive notion of “number” is vague and instead come up with systematic ways to define and compare and contrast and relate different algebras (not just those two).

That's what I said. You cannot use intuition to define numbers unless you're willing to restrict it to counting numbers. Therefore, if intuition cannot define numbers, intuition cannot be used to exclude non-numbers. I don't know about you, but I can intuitively think of real numbers as measures on a linear scale. Given the definition of an inch, I can think of a distance of x inches, where x is some real number that I know. I may be physically limited in the scale and precision that I can

On 12/29/19 9:46 PM, Chris Angelico wrote: physically create, but can imagine what it represents.

Are complex numbers numbers? Sure, if you want. Or no, if you prefer. But they’re still not real numbers, much less natural numbers. That’s obvious, and nearly useless. What you really want to know is which properties of the reals also hold of the complex numbers, and that’s a lot less obvious and a lot more useful.

And the same is true for IEEE binary64. You can say they’re not numbers, or that they are, or that some of them are and some of them aren’t, but they’re not the rationals (or the reals or the affinely extended reals or a subalgebra of any of the above); what you really want to know is which properties of the rationals hold under what approximation regime for the IEEE binary64s.

And the same is true of the debate about float("nan"). You cannot use intuition to figure out whether this is a number or not, because intuition has ALREADY failed us. "Commonsense tests" such as Steven put forward are not a valid way to debate the edge cases, because they fail on what we would consider clear cases.

ChrisA

Maybe YOUR intuition has failed to handle practical real numbers. Mine handles it fine. Maybe the difference between an Engineer and a mathematician. I will also add that since the Standard that defines them is fairly clear that they are not numbers for you to call them numbers puts the onus on YOU to show that they should be. Now, I suppose that if you look at things differently, we could call them 'Numbers' in the sense that the first group of values, the finite numbers, are numbers in the Real Numbers, but the infinities aren't, but if you move your number system to the Extended Real Numbers then we can add the infinities (but in extending the number system, we lose some properties), we could further extend the number system to some Nan Number System that includes the NaNs, but in doing that we lose even more properties. The one issue is that this new invented number system, being an invented rare bird, can't be assumed when most people say 'Number'. The proper term that most people use is a floating point value or representation. -- Richard Damon

On Dec 29, 2019, at 18:50, Chris Angelico <rosuav@gmail.com> wrote:

On Mon, Dec 30, 2019 at 1:40 PM Andrew Barnert <abarnert@yahoo.com> wrote:

On Dec 29, 2019, at 18:20, Chris Angelico <rosuav@gmail.com> wrote:

Counting numbers are intuitively numbers. So are measures. And yet, they’re different. Which one is the “one true numbers”? Who cares? Medieval mathematicians did spend thousands of pages trying to resolve that question, but it’s a lot more productive to just accept that the intuitive notion of “number” is vague and instead come up with systematic ways to define and compare and contrast and relate different algebras (not just those two).

That's what I said. You cannot use intuition to define numbers unless you're willing to restrict it to counting numbers.

I was agreeing with you in general, but I think the truth is even stronger than you’re claiming. You can also use intuition to define numbers if you’re willing to restrict it to measures. In fact, Steven is doing so in at least half his arguments. And anyone who tries to make their intuition rigorous by starting off with “numbers are elements of fields that…” is using the measures intuition, not the counting intuition. And that’s the real problem: even counting numbers are not “the numbers”; we really do have conflicting intuitions, and there really is no way around that.

And the same is true of the debate about float("nan"). You cannot use intuition to figure out whether this is a number or not, because intuition has ALREADY failed us. "Commonsense tests" such as Steven put forward are not a valid way to debate the edge cases, because they fail on what we would consider clear cases.

Agreed. There actually is an important place for common sense tests, but that place is in coming up with new systems, not in deciding which system is “right”. Obviously complex numbers are numbers. Obviously numbers can be ordered. Complex numbers don’t have a natural order. Which one of those intuitions was wrong? Neither; they’re both right, and therefore we just found a new way to distinguish between two useful classes of “number” structures. We’re farther from ever from knowing which things are “really numbers”, but who cares?

On Mon, Dec 30, 2019 at 3:52 PM Andrew Barnert <abarnert@yahoo.com> wrote:

On Dec 29, 2019, at 18:50, Chris Angelico <rosuav@gmail.com> wrote:

On Mon, Dec 30, 2019 at 1:40 PM Andrew Barnert <abarnert@yahoo.com> wrote:

On Dec 29, 2019, at 18:20, Chris Angelico <rosuav@gmail.com> wrote:

Counting numbers are intuitively numbers. So are measures. And yet, they’re different. Which one is the “one true numbers”? Who cares? Medieval mathematicians did spend thousands of pages trying to resolve that question, but it’s a lot more productive to just accept that the intuitive notion of “number” is vague and instead come up with systematic ways to define and compare and contrast and relate different algebras (not just those two).

That's what I said. You cannot use intuition to define numbers unless you're willing to restrict it to counting numbers.

I was agreeing with you in general, but I think the truth is even stronger than you’re claiming.

You can also use intuition to define numbers if you’re willing to restrict it to measures. In fact, Steven is doing so in at least half his arguments. And anyone who tries to make their intuition rigorous by starting off with “numbers are elements of fields that…” is using the measures intuition, not the counting intuition. And that’s the real problem: even counting numbers are not “the numbers”; we really do have conflicting intuitions, and there really is no way around that.

Gotcha. And yeah, it's hard to come up with any sort of intuition about numbers that doesn't break down *somewhere*. I mean, what does "equal" mean? Is it true that a 100mm stick is "equal" in length to another 100mm stick? And if you cut a stick 100mm long, then cut another stick the same length as it, and keep taking the most recent stick and cutting another stick the same length, are they all equal? (If you say "yes", then I encourage you to try it.) Similarly, is it valid to say that the infinite sum 1+2+4+8+16.... is "equal" to -1?

And the same is true of the debate about float("nan"). You cannot use intuition to figure out whether this is a number or not, because intuition has ALREADY failed us. "Commonsense tests" such as Steven put forward are not a valid way to debate the edge cases, because they fail on what we would consider clear cases.

Agreed.

There actually is an important place for common sense tests, but that place is in coming up with new systems, not in deciding which system is “right”. Obviously complex numbers are numbers. Obviously numbers can be ordered. Complex numbers don’t have a natural order. Which one of those intuitions was wrong? Neither; they’re both right, and therefore we just found a new way to distinguish between two useful classes of “number” structures. We’re farther from ever from knowing which things are “really numbers”, but who cares?

3blue1brown discussed this as regards the term "vector". What IS a vector? Is it an ordered tuple of numbers representing N-dimensional coordinates? A point in N-dimensional space? An arrow pointing from somewhere to somewhere else? Or is it, instead, "a thing that you can scale by a unitless value, and which you can add to another vector to produce a vector"? But the question is: What should the statistics module do with a nan? This can only be answered by understanding what "nan" means in a statistical context, which is only tangentially related to the question of whether nan is a number. And I think this thread has proven that there is no single obvious meaning for a nan. It could mean missing data. It could mean out-of-bounds or illogical data. It could mean something different entirely. ChrisA

On Mon, Dec 30, 2019 at 3:52 PM Andrew Barnert <abarnert@yahoo.com> wrote:

On Dec 29, 2019, at 18:50, Chris Angelico <rosuav@gmail.com> wrote:

On Mon, Dec 30, 2019 at 1:40 PM Andrew Barnert <abarnert@yahoo.com> wrote:

On Dec 29, 2019, at 18:20, Chris Angelico <rosuav@gmail.com> wrote: Counting numbers are intuitively numbers. So are measures. And yet, they’re different. Which one is the “one true numbers”? Who cares? Medieval mathematicians did spend thousands of pages trying to resolve that question, but it’s a lot more productive to just accept that the intuitive notion of “number” is vague and instead come up with systematic ways to define and compare and contrast and relate different algebras (not just those two).

That's what I said. You cannot use intuition to define numbers unless you're willing to restrict it to counting numbers. I was agreeing with you in general, but I think the truth is even stronger than you’re claiming.

You can also use intuition to define numbers if you’re willing to restrict it to measures. In fact, Steven is doing so in at least half his arguments. And anyone who tries to make their intuition rigorous by starting off with “numbers are elements of fields that…” is using the measures intuition, not the counting intuition. And that’s the real problem: even counting numbers are not “the numbers”; we really do have conflicting intuitions, and there really is no way around that.

Gotcha. And yeah, it's hard to come up with any sort of intuition about numbers that doesn't break down *somewhere*. I mean, what does "equal" mean? Is it true that a 100mm stick is "equal" in length to another 100mm stick? And if you cut a stick 100mm long, then cut another stick the same length as it, and keep taking the most recent stick and cutting another stick the same length, are they all equal? (If you say "yes", then I encourage you to try it.) Similarly, is it valid to say that the infinite sum 1+2+4+8+16.... is "equal" to -1? Perhaps the intuitive understanding of numbers includes a finite but

On 12/30/19 2:10 AM, Chris Angelico wrote: perhaps arbitrary precision, but if anything that makes it a better test for what is a number in the case of the float type, since it is a finite approximation. In the case of comparing two 100mm sticks, conceptually, it is simple, align one end of them together, then look at the other end and see which is longer. Intuitively simple. Yes, if we actually try to do this, we are constrained by the limits of precision of out senses, perhaps augmented by tools, and by the quantization of reality. Perhaps it is better to say that the mathematical concept of 'Real Numbers' is what is off, and no such thing can actually exist

And the same is true of the debate about float("nan"). You cannot use intuition to figure out whether this is a number or not, because intuition has ALREADY failed us. "Commonsense tests" such as Steven put forward are not a valid way to debate the edge cases, because they fail on what we would consider clear cases. Agreed.

There actually is an important place for common sense tests, but that place is in coming up with new systems, not in deciding which system is “right”. Obviously complex numbers are numbers. Obviously numbers can be ordered. Complex numbers don’t have a natural order. Which one of those intuitions was wrong? Neither; they’re both right, and therefore we just found a new way to distinguish between two useful classes of “number” structures. We’re farther from ever from knowing which things are “really numbers”, but who cares?

3blue1brown discussed this as regards the term "vector". What IS a vector? Is it an ordered tuple of numbers representing N-dimensional coordinates? A point in N-dimensional space? An arrow pointing from somewhere to somewhere else? Or is it, instead, "a thing that you can scale by a unitless value, and which you can add to another vector to produce a vector"?

But the question is: What should the statistics module do with a nan? This can only be answered by understanding what "nan" means in a statistical context, which is only tangentially related to the question of whether nan is a number. And I think this thread has proven that there is no single obvious meaning for a nan. It could mean missing data. It could mean out-of-bounds or illogical data. It could mean something different entirely.

ChrisA

My first answer is that to statistics in general, NaN is meaningless. statistics as a Science, deals in the domain of Real Numbers (as in Numbers that really exist), it is a Science to help us measure and understand reality from those measurements. As such, a value like 'NaN' has no meaning. Historically, based on the history of using languages like Fortran, some statistical packages have used it as a convenient indicator for no data. Originally, to handle this case, an application would use some absurdly out of bounds value for this, to allow the package to filter them out later. Due to language constraints, you had to use some floating point value as this marker. When the IEEE format came out, and had this strange value, it became a somewhat natural choice, as not being a number, it couldn't really be part of the statistics. This did break the ability to use the generation of NaNs as a way to detect calculation errors, but statisticians were used to having to do analysis ahead of time to avoid those avoidable loss of precision that NaN caught when the hit the extremums, so losing its diagnostic value wasn't that bad, and you could still test for them being generated before adding them to the list to confirm things. Personally, I don't think the simple statistics package needs to support this usage, as if you really want to use it, you really should understand the meaning of the NaN, and how to program with it, and can the either handle it yourself before passing to the statistics package, or use a higher grade package that is built with that interpretation built in. It also likely gives better control over how the values are computed. -- Richard Damon

On Mon, Dec 30, 2019 at 06:10:37PM +1100, Chris Angelico wrote:

But the question is: What should the statistics module do with a nan? This can only be answered by understanding what "nan" means in a statistical context, which is only tangentially related to the question of whether nan is a number. And I think this thread has proven that there is no single obvious meaning for a nan. It could mean missing data. It could mean out-of-bounds or illogical data. It could mean something different entirely.

Indeed, NANs can mean missing data, or the results of an invalid calculation, or some other error condition, and the way you treat them should depend on the interpretation you are giving them. What NANs cannot represent is any sort of valid statistical observation or data. Ask yourself: what sort of measurement or observation or survey question would give a NAN as a valid answer? (Technically, one might use a NAN as a form of nominal data, but that would be an exceedingly contrived thing to do: "What is your favourite IEEE-754 feature?". There are much more natural ways to encode nominal data than to use actual NANs. Strings, for example.) -- Steven

"*Die ganzen Zahlen hat der liebe Gott gemacht, alles andere ist Menschenwerk*" ("God made the integers, all else is the work of man"). –Leopold Kronecker .... of course, Kronecker was wrong, and Cantor was right. But the quote is an excellent dis. :-) On Sun, Dec 29, 2019 at 9:41 PM Andrew Barnert via Python-ideas < python-ideas@python.org> wrote:

On Dec 29, 2019, at 18:20, Chris Angelico <rosuav@gmail.com> wrote:

On Mon, Dec 30, 2019 at 11:47 AM Steven D'Aprano <steve@pearwood.info>

wrote:

On Mon, Dec 30, 2019 at 08:30:41AM +1100, Chris Angelico wrote:

Especially since it fails quite a few commonsense tests for whether or not something is a number: [...] The answer in all four cases is No. If something doesn't quack like a duck, doesn't swim like a duck, and doesn't walk like a duck, and is explicitly called Not A Duck, would we insist that it's actually a

duck?

That's because none of those examples are counting numbers :-)

Exactly my point! Counting numbers follow logical intuition; but you attested that you could use logical intuition to figure out if something is a "number". Not a "counting number". Logical intuition does NOT explain all the behaviours of non-counting numbers, and you can't say "oh this is illogical ergo it's not a number". The logic of logical intuition is illogical. :)

Are complex numbers numbers? Sure, if you want. Or no, if you prefer. But they’re still not real numbers, much less natural numbers. That’s obvious, and nearly useless. What you really want to know is which properties of the reals also hold of the complex numbers, and that’s a lot less obvious and a lot more useful.

And the same is true for IEEE binary64. You can say they’re not numbers, or that they are, or that some of them are and some of them aren’t, but they’re not the rationals (or the reals or the affinely extended reals or a subalgebra of any of the above); what you really want to know is which properties of the rationals hold under what approximation regime for the IEEE binary64s. _______________________________________________ Python-ideas mailing list -- python-ideas@python.org To unsubscribe send an email to python-ideas-leave@python.org https://mail.python.org/mailman3/lists/python-ideas.python.org/ Message archived at https://mail.python.org/archives/list/python-ideas@python.org/message/6OMWP4... Code of Conduct: http://python.org/psf/codeofconduct/

On Sun, Dec 29, 2019, 12:14 AM Richard Damon <Richard@damon-family.org> wrote:

But practicality beats purity, and practically, bit pattern CAN represent numbers. If you want to argue that floats are not numbers, than we can't use the statistics package, as we can't have any numbers to perform the statistics on.

I definitely DO NOT want to argue that. You DID argue that, and I'm showing the reduction ad absurdum to your claim. In particular, you proposed to disregard the IEEE-754 spec in regards to those bit patterns that make up NaN's. In contrast, I argue for practically rather than purity. In practical programming terms, a floating point number is no more and no less than something where 'isinstance(x, float)' ... i.e. including NaNs.

OMG! Thus is fun and all, but: On Sat, Dec 28, 2019 at 9:11 PM Richard Damon <Richard@damon-family.org> wrote:

... practicality beats purity.

And practically, everyone in this thread understands what a float is, and what a NaN is and is not. Richard: I am honestly confused about what you think we should do. Sure, you can justify why the statistics module doesn’t currently handle NaN’s well, but that doesn’t address the question of what it should do. As far as I can tell, the only reasons for the current approach is ease of implementation and performance. Which are fine reasons, and why it was done that way in the first place. But there seems to be (mostly) a consensus that it would be good to better handle NaNs in the statistics module. I think the thing to do is decide what we want NaNs to mean: should they be interpreting as missing values or, essentially, errors. You’ve made a good case that None is the “right” thing to use for missing values — and could be used with int and other types. So yes, if the statistics module were to grow support for missing values, that could be the way to do it. Which means that NaNs should either raise an exception or return NaN as a result. Those are options that are better than the current state. Nevertheless, I think there is a practical argument for NaN-as-missing value. Granted, it is widely used in other systems because it can be stored in a float data type, and that is not required for Python. But it is widely used, so is familiar to many. But if we don’t go that route, it would be good to provide NaN-filtering routines in the statistics module — as the related thread shows, NaN detection is not trivial. Frankly, I’m also confused as to why folks seem to think this is an issue to be addressed in the sort() functions — those are way too general and low level to be expected to solve this. And it would be a much heavier lift to make a change that central to Python anyway. -CHB -- Christopher Barker, PhD Python Language Consulting - Teaching - Scientific Software Development - Desktop GUI and Web Development - wxPython, numpy, scipy, Cython

On 12/29/19 1:16 AM, Christopher Barker wrote:

OMG! Thus is fun and all, but:

On Sat, Dec 28, 2019 at 9:11 PM Richard Damon <Richard@damon-family.org <mailto:Richard@damon-family.org>> wrote:

... practicality beats purity.

And practically, everyone in this thread understands what a float is, and what a NaN is and is not.

Richard: I am honestly confused about what you think we should do. Sure, you can justify why the statistics module doesn’t currently handle NaN’s well, but that doesn’t address the question of what it should do.

As far as I can tell, the only reasons for the current approach is ease of implementation and performance. Which are fine reasons, and why it was done that way in the first place.

But there seems to be (mostly) a consensus that it would be good to better handle NaNs in the statistics module.

I think the thing to do is decide what we want NaNs to mean: should they be interpreting as missing values or, essentially, errors.

You’ve made a good case that None is the “right” thing to use for missing values — and could be used with int and other types. So yes, if the statistics module were to grow support for missing values, that could be the way to do it.

Which means that NaNs should either raise an exception or return NaN as a result. Those are options that are better than the current state.

Nevertheless, I think there is a practical argument for NaN-as-missing value. Granted, it is widely used in other systems because it can be stored in a float data type, and that is not required for Python. But it is widely used, so is familiar to many.

But if we don’t go that route, it would be good to provide NaN-filtering routines in the statistics module — as the related thread shows, NaN detection is not trivial.

Frankly, I’m also confused as to why folks seem to think this is an issue to be addressed in the sort() functions — those are way too general and low level to be expected to solve this. And it would be a much heavier lift to make a change that central to Python anyway.

-CHB

The way I see it, is that median doesn't handle NaNs in a reasonable way, because sorted doesn't handle them, because it is easy and quick to not handle NaN, and to handle them you need to define an Official meaning for them, and there are multiple reasonable meanings. The reason to push most solutions to sorted, is that except for ignore, which can easily be implemented as a data filter to the input of the function, the exact same problem occurs in multiple functions (in the statistics module, that would include quantile) so by the principle of DRY, that is the logical place to implement the solution (if we don't implement the solution as an input filter) At its beginning, the statistics module disclaims being a complete all encompassing statistics package, and suggests using one if you need more advanced features, which I would consider most processing of NaN to be included in. One big reason to NOT fix the issue with NaNs in median is that such a fix likely has a measurable impact ction the processing of the median. I suspect that the simplest solution, and one that doesn't impact other uses would be simple filter functions (and perhaps median could be defined with a arguement for what function to use, with a None option that would be fairly quick. One filter would remove Nans (or None), one would throw an exception if there is a Nan, and another would just return the sequence [nan] if there are any NaNs in the input sequence (so the median would be nan). The same options could be added other operations like quantile which has the similar issue, and made available to the program for other use. There is one other option that might be possible to fix sorted, is that the IEEE spec does define another comparison function that could be used by sorted to sort the numbers, called something like total_order(a, b) which returns true if a has a total order less than b, the total order being defined such that it acts like < for normal numbers, but also provides a total order for value that < doesn't work as well for, (including equal values that have different representations, like -0 < +0 in the total_order but are == in the normal order). total_order defines positive NaNs to be greater than infinity (and negative NaNs less then negative infinity) NaNs with differing representations being ordered by their representation, which puts sNaNs on the extremes beyond the quiet NaNs. To do this, float would need to define a dunder (maybe __ltto__) for total order compare, which sorted would use instead of __lt__ if it exists (and either sorted does the fallback, or Object just defaults this new dunder to call __lt__ if not overridden. Having object do the fallover would allow classes like set to remove this new dunder so sorted generates an error if you try to sort them, since in general, sets don't provide anthing close to a total order with < This would say that sorted would work with NaNs, but for median most NaNs are treated as more positive than infinity, so the median is biased, but at least you don't get absurd results. My expectation would be if written in C or assembly, the total order comparison of two floats would be fast, maybe not as fast as a simple compare, but is just a couple of machine instructions, so small compared to the other code in the loop. -- Richard Damon

On Sun, Dec 29, 2019 at 3:26 PM Richard Damon <Richard@damon-family.org> wrote:

Frankly, I’m also confused as to why folks seem to think this is an issue to be addressed in the sort() functions

The way I see it, is that median doesn't handle NaNs in a reasonable way, because sorted doesn't handle them,

I don't think so -- it doesn't handle NaNs because it takes a decision about how they should be handled, and code to write; maybe more code because you can't use the bare sort() functions, but sort will never solve the problem both generically and properly by itself.

because it is easy and quick to not handle NaN, and to handle them you need to define an Official meaning for them, and there are multiple reasonable meanings.

exactly.

The reason to push most solutions to sorted, is that except for ignore, which can easily be implemented as a data filter to the input of the function, the exact same problem occurs in multiple functions (in the statistics module, that would include quantile) so by the principle of DRY, that is the logical place to implement the solution (if we don't implement the solution as an input filter)

well, no -- the logical place for DRY is to use the SAME sort implementation for all functions in the statistics module that need a sort. It only makes sense to try to push this to the standard sort if it were to be used, in the same way, but many other uses od sort, and it didn't break any current uses. ON the other hand, saying "this is how the statistics module interprets NaNs, and how things will be sorted" is a localized -- it does not require it be useful for anything else, and it will, by definition, not break any code that doesn't use the statistics module. At its beginning, the statistics module disclaims being a complete all

encompassing statistics package,

sure -- but that doesn't mean it couldn't be more complete than it currently is. and suggests using one if you need more

advanced features, which I would consider most processing of NaN to be included in.

That's a perfectly valid opinion, but while I think that perhaps "handling missing values" could be considered advanced, I'm not sure "giving a correct and meaningful answer for all values of expressly supported data types is "advanced" -- in a way, quite the opposite -- it's less "advanced" coders, ones that are not thinking about where NaNs might appear, and what the implication of that is, that are going to be bitten by the current implementation. Docs can help, but I think we can, and should, do better than that -- after all it's well known that "no one reads documentation". One big reason to NOT fix the issue with NaNs in median is

that such a fix likely has a measurable impact ction the processing of the median.

You mean performance? Sure, but as I've argued before (no idea if anyone agrees with me) the statistics package is already not a high performance package anyway. If it turns out that it slows it down by, say, a factor of two or more, then yes, maybe we need to forget it.

I suspect that the simplest solution, and one that doesn't impact other uses would be simple filter functions (and perhaps median could be defined with a arguement for what function to use, with a None option that would be fairly quick. One filter would remove Nans (or None), one would throw an exception if there is a Nan, and another would just return the sequence [nan] if there are any NaNs in the input sequence (so the median would be nan). The same options could be added other operations like quantile which has the similar issue, and made available to the program for other use.

I agree -- this could be a good way to go.

There is one other option that might be possible to fix sorted,

<snip> see the last post if you want the details ...

This would say that sorted would work with NaNs, but for median most NaNs are treated as more positive than infinity, so the median is biased, but at least you don't get absurd results.

yeah, but this is what I meant above -- you'd still want to check for NaNs in the statistics functions. Though It would be a fast check, 'cause you could check only the ones on the end after sorting. But the biggest barrier is that it would be a fair bit of churn on the sort() functions (and the float class), and would only help for floats anyway. If someone want to propose this, please do -- but I don't think we should wait for that to do something with the statistics module. Also, if you want to pursue this, do go back and find the thread about type-checked sorting -- I think this is it: https://mail.python.org/pipermail/python-dev/2016-October/146613.html I'm not sure if anything ever came of that. - CHB -- Christopher Barker, PhD Python Language Consulting - Teaching - Scientific Software Development - Desktop GUI and Web Development - wxPython, numpy, scipy, Cython

On Sun, Dec 29, 2019 at 4:05 PM Christopher Barker <pythonchb@gmail.com> wrote:

You mean performance? Sure, but as I've argued before (no idea if anyone agrees with me) the statistics package is already not a high performance package anyway. If it turns out that it slows it down by, say, a factor of two or more, then yes, maybe we need to forget it.

You never know 'till you profile, so I did a quick experiment -- adding a NaN filter is substantial overhead: This is for a list of 10,000 random floats (no nans in there, but the check is made by pre-filtering with a generator comprehension) # this just calls statistics.median directly In [14]: %timeit plainmedian(lots_of_floats) 1.54 ms ± 12.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) # this filters with math.isnan() In [15]: %timeit nanmedianfloat(lots_of_floats) 3.5 ms ± 176 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) # this filters with a complex NAN-checker that works with most types and values: floats, Decimals, numpy scalars, ... In [16]: %timeit nanmedian(lots_of_floats) 13.5 ms ± 1.11 ms per loop (mean ± std. dev. of 7 runs, 100 loops each) So the simple math,isnan filter slows it down by a factor of a bit more than two -- maybe tolerable. and the full featured isnan checker by almost a factor of ten -- that's pretty bad. I suspect if it were inline more, it could be median bit faster, and I'm sure the nan-checking code could be better optimized, but this is a pretty big hit. Note that numpy has a number of "nan*" functions, for nan-aware versions that treat NaN as missing values (including nanquantile) -- we could take a similar route, and have new names or a flag to disable or enable nan-checking. Code enclosed - CHB -- Christopher Barker, PhD Python Language Consulting - Teaching - Scientific Software Development - Desktop GUI and Web Development - wxPython, numpy, scipy, Cython

Sorry for all these posts, but maybe someone mentioned this already, but maybe this is a time to consider a new algorithm anyway: https://rcoh.me/posts/linear-time-median-finding/ And doing the NaN-check inline might be faster than pre-filtering. -CHB On Sun, Dec 29, 2019 at 4:39 PM Christopher Barker <pythonchb@gmail.com> wrote:

On Sun, Dec 29, 2019 at 4:05 PM Christopher Barker <pythonchb@gmail.com> wrote:

You mean performance? Sure, but as I've argued before (no idea if anyone agrees with me) the statistics package is already not a high performance package anyway. If it turns out that it slows it down by, say, a factor of two or more, then yes, maybe we need to forget it.

You never know 'till you profile, so I did a quick experiment -- adding a NaN filter is substantial overhead:

This is for a list of 10,000 random floats (no nans in there, but the check is made by pre-filtering with a generator comprehension)

# this just calls statistics.median directly In [14]: %timeit plainmedian(lots_of_floats)

1.54 ms ± 12.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

# this filters with math.isnan() In [15]: %timeit nanmedianfloat(lots_of_floats)

3.5 ms ± 176 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

# this filters with a complex NAN-checker that works with most types and values: floats, Decimals, numpy scalars, ... In [16]: %timeit nanmedian(lots_of_floats)

13.5 ms ± 1.11 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)

So the simple math,isnan filter slows it down by a factor of a bit more than two -- maybe tolerable. and the full featured isnan checker by almost a factor of ten -- that's pretty bad.

I suspect if it were inline more, it could be median bit faster, and I'm sure the nan-checking code could be better optimized, but this is a pretty big hit.

Note that numpy has a number of "nan*" functions, for nan-aware versions that treat NaN as missing values (including nanquantile) -- we could take a similar route, and have new names or a flag to disable or enable nan-checking.

Code enclosed

- CHB

-- Christopher Barker, PhD

Python Language Consulting - Teaching - Scientific Software Development - Desktop GUI and Web Development - wxPython, numpy, scipy, Cython

-- Christopher Barker, PhD Python Language Consulting - Teaching - Scientific Software Development - Desktop GUI and Web Development - wxPython, numpy, scipy, Cython

[Christopher Barker <pythonchb@gmail.com>]

... But the biggest barrier is that it would be a fair bit of churn on the sort() functions (and the float class), and would only help for floats anyway. If someone want to propose this, please do -- but I don't think we should wait for that to do something with the statistics module. Also, if you want to pursue this, do go back and find the thread about type-checked sorting -- I think this is it:

https://mail.python.org/pipermail/python-dev/2016-October/146613.html

I'm not sure if anything ever came of that.

It was completed and is part of released CPython now. But it's useless for this particular purpose. Leaving aside that sort() is a wrong place to put this (as David M keeps saying, pass a key= function to sort if you want a non-default sorting order), type-checked sorting was purely a speed optimization, and applies only to lists with a single type of element. So, e.g., it could plug in a different sort order for lists containing only floats, but mix in a single int (or any other non-float object) and the optimization is 100% disabled, leaving you with the default comparison logic instead. That said, if we could wind back the clock, I'd make Python's floats use something like the later IEEE total_order ordering instead. No goofy results, nothing special about NaNs. For people who really wanted IEEE's insane ;-) comparison semantics, that's fine - they'd get it via, say, math.ieee_goofy_compare(a, b, flags) instead, where flags is the bitwise-or of 0 or more of {LT, EQ, GT, UNORD}. E.g., pass GT | UNORD to return True iif the IEEE mandated result is "greater than" or "unordered" (each IEEE comparison outcome is exactly one of those four).

On 12/29/19 7:43 PM, Tim Peters wrote:

[Christopher Barker <pythonchb@gmail.com>]

... But the biggest barrier is that it would be a fair bit of churn on the sort() functions (and the float class), and would only help for floats anyway. If someone want to propose this, please do -- but I don't think we should wait for that to do something with the statistics module. Also, if you want to pursue this, do go back and find the thread about type-checked sorting -- I think this is it:

https://mail.python.org/pipermail/python-dev/2016-October/146613.html

I'm not sure if anything ever came of that. It was completed and is part of released CPython now.

But it's useless for this particular purpose. Leaving aside that sort() is a wrong place to put this (as David M keeps saying, pass a key= function to sort if you want a non-default sorting order), type-checked sorting was purely a speed optimization, and applies only to lists with a single type of element. So, e.g., it could plug in a different sort order for lists containing only floats, but mix in a single int (or any other non-float object) and the optimization is 100% disabled, leaving you with the default comparison logic instead.

That said, if we could wind back the clock, I'd make Python's floats use something like the later IEEE total_order ordering instead. No goofy results, nothing special about NaNs. For people who really wanted IEEE's insane ;-) comparison semantics, that's fine - they'd get it via, say, math.ieee_goofy_compare(a, b, flags) instead, where flags is the bitwise-or of 0 or more of {LT, EQ, GT, UNORD}. E.g., pass GT | UNORD to return True iif the IEEE mandated result is "greater than" or "unordered" (each IEEE comparison outcome is exactly one of those four).

Question, could that optimization detect that we have a list of just floats, and pass it a 'total_order' comparison function to sort instead of the default __lt__? If so, then I don't think it would be a breaking change to do so. Code that explicitly does floating point comparisons still get the IEEE mandated results, but sorts that happen to have NaNs in them now get a defined order, rather than the currently undefined order. I suppose that there may be some programs out there that depend on the current behavior (even though not promised, and subject to change if the sort algorithm changes). The total order maintains all the current ordered relationships, it just adds a defined order for NaNs, and an order for equal values that have differing representations (which I think for binary floating point is just -0 and +0). I don;t think sorted makes any promises for stability for 'equal' values, so sorting -0 before +0 shouldn't be a problem. -- Richard Damon

On 12/29/19 7:05 PM, Christopher Barker wrote:

On Sun, Dec 29, 2019 at 3:26 PM Richard Damon <Richard@damon-family.org <mailto:Richard@damon-family.org>> wrote:

> Frankly, I’m also confused as to why folks seem to think this is an > issue to be addressed in the sort() functions

The way I see it, is that median doesn't handle NaNs in a reasonable way, because sorted doesn't handle them,

I don't think so -- it doesn't handle NaNs because it takes a decision about how they should be handled, and code to write; maybe more code because you can't use the bare sort() functions, but sort will never solve the problem both generically and properly by itself.

It doesn't handle NaNs because it decided to be a simple routine using the basic definition, the middle value based on the basic sort. I would expect that the basic sort routine has a possibility that it has been optimized by dropping down parts into raw C for speed, while

because it is easy and quick to not handle NaN, and to handle them you need to define an Official meaning for them, and there are multiple reasonable meanings.

exactly.

The reason to push most solutions to sorted, is that except for ignore, which can easily be implemented as a data filter to the input of the function, the exact same problem occurs in multiple functions (in the statistics module, that would include quantile) so by the principle of DRY, that is the logical place to implement the solution (if we don't implement the solution as an input filter)

well, no -- the logical place for DRY is to use the SAME sort implementation for all functions in the statistics module that need a sort. It only makes sense to try to push this to the standard sort if it were to be used, in the same way, but many other uses od sort, and it didn't break any current uses. ON the other hand, saying "this is how the statistics module interprets NaNs, and how things will be sorted" is a localized -- it does not require it be useful for anything else, and it will, by definition, not break any code that doesn't use the statistics module.

At its beginning, the statistics module disclaims being a complete all encompassing statistics package,

sure -- but that doesn't mean it couldn't be more complete than it currently is.

If being more complete is 'simple', yes, but it doesn't look to be (unless the fix goes into sorted)

and suggests using one if you need more advanced features, which I would consider most processing of NaN to be included in.

That's a perfectly valid opinion, but while I think that perhaps "handling missing values" could be considered advanced, I'm not sure "giving a correct and meaningful answer for all values of expressly supported data types is "advanced" -- in a way, quite the opposite -- it's less "advanced" coders, ones that are not thinking about where NaNs might appear, and what the implication of that is, that are going to be bitten by the current implementation.

Docs can help, but I think we can, and should, do better than that -- after all it's well known that "no one reads documentation".

Which is EXACTLY the reason I say that if this is important enough to fix in median, it is important enough to fix in sorted. sorted gives exactly the same nonsense result, it is only a bit more obvious because it gives all the points. Is [3, nan, 1, 2, 4] a sorted list?

One big reason to NOT fix the issue with NaNs in median is that such a fix likely has a measurable impact ction the processing of the median.

I suspect that the simplest solution, and one that doesn't impact other uses would be simple filter functions (and perhaps median could be defined with a arguement for what function to use, with a None option that would be fairly quick. One filter would remove Nans (or None), one would throw an exception if there is a Nan, and another would just return the sequence [nan] if there are any NaNs in the input sequence (so the median would be nan). The same options could be added other operations like quantile which has the similar issue, and made available to the program for other use.

I agree -- this could be a good way to go.

There is one other option that might be possible to fix sorted,

<snip> see the last post if you want the details ...

This would say that sorted would work with NaNs, but for median most NaNs are treated as more positive than infinity, so the median is biased, but at least you don't get absurd results.

yeah, but this is what I meant above -- you'd still want to check for NaNs in the statistics functions. Though It would be a fast check, 'cause you could check only the ones on the end after sorting.

But the biggest barrier is that it would be a fair bit of churn on the sort() functions (and the float class), and would only help for floats anyway. If someone want to propose this, please do -- but I don't think we should wait for that to do something with the statistics module. Also, if you want to pursue this, do go back and find the thread about type-checked sorting -- I think this is it:

https://mail.python.org/pipermail/python-dev/2016-October/146613.html

I'm not sure if anything ever came of that.

- CHB

-- Christopher Barker, PhD

Python Language Consulting - Teaching - Scientific Software Development - Desktop GUI and Web Development - wxPython, numpy, scipy, Cython

_______________________________________________ Python-ideas mailing list -- python-ideas@python.org To unsubscribe send an email to python-ideas-leave@python.org https://mail.python.org/mailman3/lists/python-ideas.python.org/ Message archived at https://mail.python.org/archives/list/python-ideas@python.org/message/75BZ5U... Code of Conduct: http://python.org/psf/codeofconduct/

-- Richard Damon

On Sun, Dec 29, 2019 at 8:00 PM Richard Damon <Richard@damon-family.org> wrote:

Which is EXACTLY the reason I say that if this is important enough to fix in median, it is important enough to fix in sorted. sorted gives exactly the same nonsense result, it is only a bit more obvious because it gives all the points. Is [3, nan, 1, 2, 4] a sorted list?

As me and Uncle Timmy have pointed out, it IS FIXED in sorted(). You just need to call: sorted_stuff = sorted(stuff, key=nan_aware_transform) Guido's time machine to the rescue... you've had this for more than a decade now. -- Keeping medicines from the bloodstreams of the sick; food from the bellies of the hungry; books from the hands of the uneducated; technology from the underdeveloped; and putting advocates of freedom in prisons. Intellectual property is to the 21st century what the slave trade was to the 16th.

On 12/29/19 8:13 PM, David Mertz wrote:

On Sun, Dec 29, 2019 at 8:00 PM Richard Damon <Richard@damon-family.org <mailto:Richard@damon-family.org>> wrote:

Which is EXACTLY the reason I say that if this is important enough to fix in median, it is important enough to fix in sorted. sorted gives exactly the same nonsense result, it is only a bit more obvious because it gives all the points. Is [3, nan, 1, 2, 4] a sorted list?

As me and Uncle Timmy have pointed out, it IS FIXED in sorted(). You just need to call:

sorted_stuff = sorted(stuff, key=nan_aware_transform)

Guido's time machine to the rescue... you've had this for more than a decade now.

So we just need to make available a suitable key function, and because this issue was aimed at confusion for inexperienced users, make it the default sorting floats. Since I think median passes parameters to sorted, it says it is also solved for median (if you are willing to use that meaning for NaNs) -- Richard Damon

median() does not currently take a key function. This is not hard to see. It could, but as I've written, I don't think that's the best approach. In [16]: statistics.median?? Signature: statistics.median(data) Source: def median(data): """Return the median (middle value) of numeric data. When the number of data points is odd, return the middle data point. When the number of data points is even, the median is interpolated by taking the average of the two middle values: >>> median([1, 3, 5]) 3 >>> median([1, 3, 5, 7]) 4.0 """ data = sorted(data) n = len(data) if n == 0: raise StatisticsError("no median for empty data") if n%2 == 1: return data[n//2] else: i = n//2 return (data[i - 1] + data[i])/2 On Sun, Dec 29, 2019 at 8:27 PM Richard Damon <Richard@damon-family.org> wrote:

On 12/29/19 8:13 PM, David Mertz wrote:

On Sun, Dec 29, 2019 at 8:00 PM Richard Damon <Richard@damon-family.org <mailto:Richard@damon-family.org>> wrote:

As me and Uncle Timmy have pointed out, it IS FIXED in sorted(). You just need to call:

sorted_stuff = sorted(stuff, key=nan_aware_transform)

Guido's time machine to the rescue... you've had this for more than a decade now.

So we just need to make available a suitable key function, and because this issue was aimed at confusion for inexperienced users, make it the default sorting floats.

Since I think median passes parameters to sorted, it says it is also solved for median (if you are willing to use that meaning for NaNs)

-- Richard Damon _______________________________________________ Python-ideas mailing list -- python-ideas@python.org To unsubscribe send an email to python-ideas-leave@python.org https://mail.python.org/mailman3/lists/python-ideas.python.org/ Message archived at https://mail.python.org/archives/list/python-ideas@python.org/message/AUGSQR... Code of Conduct: http://python.org/psf/codeofconduct/

On Sun, Dec 29, 2019 at 5:14 PM David Mertz <mertz@gnosis.cx> wrote:

As me and Uncle Timmy have pointed out, it IS FIXED in sorted(). You just need to call:

sorted_stuff = sorted(stuff, key=nan_aware_transform)

But what would that be? floats have inf and -inf -- so how could you force the NaNs to be at the end or beginning? (or anywhere else determinant? (though replacing NaN with inf would push all teh NaNs and inf to the end, in arbitrary order, but who's to know whether a NaN is bigger than inf anyway. You could provide a custom cmp() function, though, which would be slower, probably slower than pre-filtering the values. And I just noticed that Python3 removed it anyway. But it wouldn't "fix" median anyway -- you'd still need to do something with those NaNs -- I guess saying "all NaNs" are larger than any other value" would al least provide consistent results, but they wouldn't be meaningful. The only option for a meaningful result is to remove them. Raising, of course, is also correct. -CHB -- Christopher Barker, PhD Python Language Consulting - Teaching - Scientific Software Development - Desktop GUI and Web Development - wxPython, numpy, scipy, Cython

[David Mertz]

As me and Uncle Timmy have pointed out, it IS FIXED in sorted(). You just need to call:

sorted_stuff = sorted(stuff, key=nan_aware_transform)

[Christopher Barker]

But what would that be? floats have inf and -inf -- so how could you force the NaNs to be at the end or beginning? ...

Python floats are encoded in 64 bits, but a key function can return anything at all - it doesn't have to return a float. David gave an example that returned a tuple, and I pointed to a msg in a bug report that returned a signed 64-bit integer implementing the relatively recent IEEE "total_order" function: https://bugs.python.org/msg336487 although that last moves all "negative" NaNs "to the far left" and all "positive" NaNs "to the far right".

On Sun, Dec 29, 2019 at 07:59:26PM -0500, Richard Damon wrote:

Mu. https://en.wikipedia.org/wiki/Mu_%28negative%29#%22Unasking%22_the_question From one perspective, the very concept of a linear smallest to biggest order is incoherent when you have a partial order (a little like asking which is the strongest out of Rock, Paper, Scissors[1]: Rock beats Scissors Scissors beats Paper Paper beats Rock so there is no linear arrangement of [Paper, Scissors, Rock] that satisfies the order. Any linear arrangement is "not sorted" by that definition. But another perspective is that, yes, it is sorted: Paper is beaten by Scissors, so it goes before Scissors; Scissors is beaten by Rock, so it goes before Rock and the ordering relations are satisfied. In your example: 3 is no bigger than NAN, so it goes before NAN; NAN is no bigger than 1, so it goes before 1; 1 is no bigger than 2, so it goes before 2 2 is no bigger than 4, so it goes before 4; giving the order [3, NAN, 1, 2, 3] and all the ordering relations are satisfied. (Just don't ask about *other* ordering relations.) [1] "Good ol' Rock always wins." -- Steven

Several points: * NaN as missing-value is widely used outside the Python standard library. One could argue, somewhat reasonably, that Pandas and NumPy and PyTorch misinterpret the IEEE-754 intention here, but this is EVERYWHERE in numeric/scientific Python. We could DOCUMENT that None is a better placeholder for *missing* but we shouldn't be obnoxious to millions of users of stuff outside stdlib. * sorted() is WAY too low-level to add this logic to, and numeric types with NaNs are much too special for the generic sorting. That said, we DO NOT NEED IT. list.sort() and sorted() and friends already take a key parameter. This lets the appropriate tool—i.e. the statistics module, and other things—develop a total_order() key function to match the IEEE suggested ordering. There is absolutely no reason or need to change sorted() to accommodate this. * Yes, obviously I made the subject line about statistics.median(), but the xtile() functions have all the same concerns, and live in the same module. * For quiet NaNs, it really is easy to get them innocently. E.g.: def my_results(it): for x in it: x_1 = func1_with_asymptotes(x) x_2 = func2_with_asymptotes(x) result = x_1 / x_2 yield result median = statistics.median(my_results(my_iter)) That's perfectly reasonable code that will SOMETIMES wind up with qNaNs in the collection of values... but that USUALLY will not. * There is absolutely no need to lose any efficiency by making the statistics functions more friendly. All we need is an optional parameter whose spelling I've suggested as `on_nan` (but bikeshed freely). Under at least one value of that parameter, we can keep EXACTLY the current implementation, with all its warts and virtues as-is. Maybe a spelling for that option could be 'unsafe' or 'fast'? * Another option can be 'ignore' (maybe 'skip', but 'ignore' is more Pandas-like) which is simply: def median(it, on_nan=DEFAULT): if on_nan == 'unsafe': ... do all the current stuff ... elif on_nan == "ignore": return median((x for x in it if not is_nan(x)), on_nan='unsafe') elif on_nan = "ieee_total_order": ... something with sorted(it, key=total_order) ... Yes, this requires agreeing on the right implementation of is_nan(), with several plausible versions proposed in this thread. * With the 'raise' and 'poison' ('propagate'?) options, the implementation would be more like this: items = [] for x in it: if is_nan(x): if on_nan == 'raise': raise ValueError('No median exists of collections with NaNs') elif on_nan == 'poison': return float('nan') else: items.append(x) return median(items, on_nan='unsafe') I think that's everything, really. Nothing gets any slower, all use cases are accommodated. -- Keeping medicines from the bloodstreams of the sick; food from the bellies of the hungry; books from the hands of the uneducated; technology from the underdeveloped; and putting advocates of freedom in prisons. Intellectual property is to the 21st century what the slave trade was to the 16th.

Thanks David for laying a proposal out clearly: +1 to the whole thing. -CHB On Sun, Dec 29, 2019 at 4:06 PM David Mertz <mertz@gnosis.cx> wrote:

Several points:

* NaN as missing-value is widely used outside the Python standard library. One could argue, somewhat reasonably, that Pandas and NumPy and PyTorch misinterpret the IEEE-754 intention here, but this is EVERYWHERE in numeric/scientific Python. We could DOCUMENT that None is a better placeholder for *missing* but we shouldn't be obnoxious to millions of users of stuff outside stdlib.

* sorted() is WAY too low-level to add this logic to, and numeric types with NaNs are much too special for the generic sorting. That said, we DO NOT NEED IT. list.sort() and sorted() and friends already take a key parameter. This lets the appropriate tool—i.e. the statistics module, and other things—develop a total_order() key function to match the IEEE suggested ordering. There is absolutely no reason or need to change sorted() to accommodate this.

* Yes, obviously I made the subject line about statistics.median(), but the xtile() functions have all the same concerns, and live in the same module.

* For quiet NaNs, it really is easy to get them innocently. E.g.:

def my_results(it): for x in it: x_1 = func1_with_asymptotes(x) x_2 = func2_with_asymptotes(x) result = x_1 / x_2 yield result

median = statistics.median(my_results(my_iter))

That's perfectly reasonable code that will SOMETIMES wind up with qNaNs in the collection of values... but that USUALLY will not.

* There is absolutely no need to lose any efficiency by making the statistics functions more friendly. All we need is an optional parameter whose spelling I've suggested as `on_nan` (but bikeshed freely). Under at least one value of that parameter, we can keep EXACTLY the current implementation, with all its warts and virtues as-is. Maybe a spelling for that option could be 'unsafe' or 'fast'?

* Another option can be 'ignore' (maybe 'skip', but 'ignore' is more Pandas-like) which is simply:

def median(it, on_nan=DEFAULT): if on_nan == 'unsafe': ... do all the current stuff ... elif on_nan == "ignore": return median((x for x in it if not is_nan(x)), on_nan='unsafe') elif on_nan = "ieee_total_order": ... something with sorted(it, key=total_order) ...

Yes, this requires agreeing on the right implementation of is_nan(), with several plausible versions proposed in this thread.

* With the 'raise' and 'poison' ('propagate'?) options, the implementation would be more like this:

items = [] for x in it: if is_nan(x): if on_nan == 'raise': raise ValueError('No median exists of collections with NaNs') elif on_nan == 'poison': return float('nan') else: items.append(x) return median(items, on_nan='unsafe')

I think that's everything, really. Nothing gets any slower, all use cases are accommodated.

-- Keeping medicines from the bloodstreams of the sick; food from the bellies of the hungry; books from the hands of the uneducated; technology from the underdeveloped; and putting advocates of freedom in prisons. Intellectual property is to the 21st century what the slave trade was to the 16th. _______________________________________________ Python-ideas mailing list -- python-ideas@python.org To unsubscribe send an email to python-ideas-leave@python.org https://mail.python.org/mailman3/lists/python-ideas.python.org/ Message archived at https://mail.python.org/archives/list/python-ideas@python.org/message/V5JTTR... Code of Conduct: http://python.org/psf/codeofconduct/

-- Christopher Barker, PhD Python Language Consulting - Teaching - Scientific Software Development - Desktop GUI and Web Development - wxPython, numpy, scipy, Cython

Oh... I made a mistake in my off-the-cuff code. The items.append() shouldn't be in an else, but just in the loop. def median(it, on_nan=DEFAULT): if on_nan == 'unsafe': ... do all the current stuff ... elif on_nan == "ignore": return median((x for x in it if not is_nan(x)), on_nan='unsafe') elif on_nan = "ieee_total_order": return median(sorted(it, key=total_order), on_nan='unsafe') else: items = [] for x in it: if is_nan(x): if on_nan == 'raise': raise ValueError('No median exists of collections with NaNs') elif on_nan == 'poison': return float('nan') items.append(x) return median(items, on_nan='unsafe') It needs a total_order() support function and an is_nan() support function, but other than that, I think this is complete code. Probably it would be faster if it moved the raise/poison conditions outside the loop, and looped inside each of them. But this way is slightly fewer lines and conceptually more obvious, I think. -- Keeping medicines from the bloodstreams of the sick; food from the bellies of the hungry; books from the hands of the uneducated; technology from the underdeveloped; and putting advocates of freedom in prisons. Intellectual property is to the 21st century what the slave trade was to the 16th.

On Dec 29, 2019, at 16:08, David Mertz <mertz@gnosis.cx> wrote:

* There is absolutely no need to lose any efficiency by making the statistics functions more friendly. All we need is an optional parameter whose spelling I've suggested as `on_nan` (but bikeshed freely). Under at least one value of that parameter, we can keep EXACTLY the current implementation, with all its warts and virtues as-is. Maybe a spelling for that option could be 'unsafe' or 'fast'?

This seems like the right way to go to me. However, rather than coming up with appropriately-general implementations of each of these things, wouldn’t taking a key function to pass through to sorted be simpler for some? In particular, coming up with a total_order function that works for all valid number-like types is difficult; letting the user pass key=math.total_order or decimal.Decimal.compare_total or partial(decimal.Decimal.compare_total, context=my_context) or whatever is appropriate is a lot simpler and a lot more flexible. Anyone who knows that’s what they want should know how to pass it. Plus, finding the median_low or _high, with a key function actually seems useful even without NaNs. “Find the median employee by salary” doesn’t seem like a meaningless operation. A key function could also take care of raise, but not ignore or poison, and at least ignore seems like it’s needed. So your API still makes sense otherwise. (But, while we’re painting the shed, maybe enum values instead of bare strings? They could be StrEnum values where FAST.value == 'fast' for people who are used to Pandas, I suppose.) Maybe the is_nan function could also be a parameter, like the key function. By default it’s just the method with a fallback to math or cmath (or it’s just the method, and float and complex add those methods, or it’s a new function that calls a new protocol method, or whatever). That doesn’t work for every possible type that might otherwise work with statistics, but if you have some other type—or want some other unusual but sensible behavior (e.g., you’re the one guy who actually needs to ignore qNaNs but raise early on sNaNs), you can write it and pass it. I’m still not convinced anyone will ever want anymore other than the method/math/cmath version, but if they do, I think they’d know it and be fine with passing it in explicitly. As far as your implementation, I don’t think anything but ignore needs to preprocess things. Raise can just pass a key function that raises on NaN to sorted. Poison can do the same but handle the exception by returning NaN. Who cares that it might take slightly longer to hit the first NaN that way than by doing an extra pass, if it’s simpler and slightly faster for the non-exceptional case?

On Sun, Dec 29, 2019 at 8:14 PM Andrew Barnert <abarnert@yahoo.com> wrote:

On Dec 29, 2019, at 16:08, David Mertz <mertz@gnosis.cx> wrote:

* There is absolutely no need to lose any efficiency by making the

statistics functions more friendly. All we need is an optional parameter whose spelling I've suggested as `on_nan` (but bikeshed freely). Under at least one value of that parameter, we can keep EXACTLY the current implementation, with all its warts and virtues as-is. Maybe a spelling for that option could be 'unsafe' or 'fast'?

This seems like the right way to go to me. However, rather than coming up with appropriately-general implementations of each of these things, wouldn’t taking a key function to pass through to sorted be simpler for some?

No, it wouldn't. That cannot deal with the 'raise' or 'poison' behaviors. Moreover, coming up with statistics.is_nan() and statistics.total_order() really isn't hard. Chris Barker already provided a pretty good is_nan() that we could use. In particular, coming up with a total_order function that works for all

valid number-like types is difficult; letting the user pass key=math.total_order or decimal.Decimal.compare_total or partial(decimal.Decimal.compare_total, context=my_context) or whatever is appropriate is a lot simpler and a lot more flexible. Anyone who knows that’s what they want should know how to pass it.

Here it is. I could save a line by not using the 'else'. def total_order(x): if is_nan(x): return (math.copysign(1, x)) else: return (0, x) Plus, finding the median_low or _high, with a key function actually seems

useful even without NaNs. “Find the median employee by salary” doesn’t seem like a meaningless operation.

I'll give you that. That could be useful. However, I feel like that's too much for the module itself. It's easy to write it yourself with no function. median_employee = statistics.median((e.salary, e) for e in employees)[1] That's not so terribly much work for the user of the module. -- Keeping medicines from the bloodstreams of the sick; food from the bellies of the hungry; books from the hands of the uneducated; technology from the underdeveloped; and putting advocates of freedom in prisons. Intellectual property is to the 21st century what the slave trade was to the 16th.

On Dec 29, 2019, at 17:30, David Mertz <mertz@gnosis.cx> wrote:

On Sun, Dec 29, 2019 at 8:14 PM Andrew Barnert <abarnert@yahoo.com> wrote: On Dec 29, 2019, at 16:08, David Mertz <mertz@gnosis.cx> wrote:

* There is absolutely no need to lose any efficiency by making the statistics functions more friendly. All we need is an optional parameter whose spelling I've suggested as `on_nan` (but bikeshed freely). Under at least one value of that parameter, we can keep EXACTLY the current implementation, with all its warts and virtues as-is. Maybe a spelling for that option could be 'unsafe' or 'fast'?

This seems like the right way to go to me. However, rather than coming up with appropriately-general implementations of each of these things, wouldn’t taking a key function to pass through to sorted be simpler for some?

No, it wouldn't. That cannot deal with the 'raise' or 'poison' behaviors.

I already said the same thing in the very next paragraph: that you’d still want your API on top of the key parameter.

Moreover, coming up with statistics.is_nan() and statistics.total_order() really isn't hard. Chris Barker already provided a pretty good is_nan() that we could use.

In particular, coming up with a total_order function that works for all valid number-like types is difficult; letting the user pass key=math.total_order or decimal.Decimal.compare_total or partial(decimal.Decimal.compare_total, context=my_context) or whatever is appropriate is a lot simpler and a lot more flexible. Anyone who knows that’s what they want should know how to pass it.

Here it is. I could save a line by not using the 'else'.

def total_order(x): if is_nan(x): return (math.copysign(1, x)) else: return (0, x)

This doesn’t give you IEEE total order. Under what circumstances would you prefer this to, say, Decimal.compare_total, which does? I suppose there may be _some_ users who really need a total order that’s sort of like the IEEE one but don’t actually need it to be the same, and also really need it to be general enough to work with Decimal or some other type that can convert to float (including overflowing properly for copysign to work) and know their type can do so, but who don’t know how to write that key function themselves. But enough such users that it’s worth catering to them?

Plus, finding the median_low or _high, with a key function actually seems useful even without NaNs. “Find the median employee by salary” doesn’t seem like a meaningless operation.

I'll give you that. That could be useful. However, I feel like that's too much for the module itself. It's easy to write it yourself with no function.

median_employee = statistics.median((e.salary, e) for e in employees)[1]

By the same argument, nothing actually needs a key function because you can always decorate-sort-undecorate. And yet, key functions are useful in all kinds of places. Likewise, it’s even easier to write ignore-nan yourself than to write the DSU yourself: median = statistics.median(x for x in xs if not x.isnan()) … and yet this whole proposal is still useful, isn’t it? So, why isn’t adding a key parameter (as well as an on_nan that takes fast/ignore/raise/poison) to median useful?

On Sun, Dec 29, 2019, 9:23 PM Andrew Barnert

Here it is. I could save a line by not using the 'else'.

def total_order(x): if is_nan(x): return (math.copysign(1, x), x) else: return (0, x)

This doesn’t give you IEEE total order. Under what circumstances would you prefer this to, say, Decimal.compare_total, which does?

How does this differ from IEEE? I'm not being rhetorical, I really thought that was their order. But I haven't looked carefully. median_employee = statistics.median((e.salary, e) for e in employees)[1]

By the same argument, nothing actually needs a key function because you can always decorate-sort-undecorate. And yet, key functions are useful in all kinds of places.

It's a balance, I think. `key_fn=` feels much too "heavy" for users of statistics module. Handling NaNs one way or the other is the 95% concern. And for the 5%, DSU covers it. DSU or key_fn is "advanced users." Beginners won't understands either, but will grok nan messing up their answers. Experts will have no difficulty with DSU. The built-in sorted() has a different balance since it is more generic in functionality. I wrote privately that I thought having a statistics._median() have a key_fn could make sense. Regular statistics.median() could utilize that, but the public function should have a less confusing API. I don't care about enum vs string. Pandas and NumPy use strings, but they are older than the enum module.

On 12/29/19 9:42 PM, David Mertz wrote:

On Sun, Dec 29, 2019, 9:23 PM Andrew Barnert

Here it is. I could save a line by not using the 'else'.

def total_order(x): if is_nan(x): return (math.copysign(1, x), x) else: return (0, x)

This doesn’t give you IEEE total order. Under what circumstances would you prefer this to, say, Decimal.compare_total, which does?

How does this differ from IEEE? I'm not being rhetorical, I really thought that was their order. But I haven't looked carefully.

IEEE total_order puts NaN as bigger than infinity, and -NaN as less than -inf. One simple way to implement it is to convert the representaton to a 64 bit signed integer (not its value, but its representation) and if the sign bit is set, complement the bottom 63 bits (because floats are signed-magnitude). For pure python code, I don't know how hard it is to get the representation of a float as a 64 bit integer. In C or Assembly it is fairly easy as you can easily get around the type system, but I don't know python well enough to do it. -- Richard Damon

On Sun, Dec 29, 2019 at 10:18 PM Richard Damon <Richard@damon-family.org> wrote:

IEEE total_order puts NaN as bigger than infinity, and -NaN as less than -inf.

You mean like this?

def total_order(x): ... if math.isnan(x): ... return (math.copysign(1, x), x) ... return (0, x) ... ... nums = [1, 2, float('-inf'), float('nan'), float('inf'), float('-nan')] nums [1, 2, -inf, nan, inf, nan] sorted(nums, key=total_order) [nan, -inf, 1, 2, inf, nan]

It's a little weird that -nan has a repr of 'nan', but bracketing that, my implementation is EXACTLY what you describe. -- Keeping medicines from the bloodstreams of the sick; food from the bellies of the hungry; books from the hands of the uneducated; technology from the underdeveloped; and putting advocates of freedom in prisons. Intellectual property is to the 21st century what the slave trade was to the 16th.

On 12/29/19 10:39 PM, David Mertz wrote:

On Sun, Dec 29, 2019 at 10:18 PM Richard Damon <Richard@damon-family.org <mailto:Richard@damon-family.org>> wrote:

IEEE total_order puts NaN as bigger than infinity, and -NaN as less than -inf.

You mean like this?

def total_order(x): ... if math.isnan(x): ... return (math.copysign(1, x), x) ... return (0, x) ... ... nums = [1, 2, float('-inf'), float('nan'), float('inf'), float('-nan')] nums [1, 2, -inf, nan, inf, nan] sorted(nums, key=total_order) [nan, -inf, 1, 2, inf, nan]

It's a little weird that -nan has a repr of 'nan', but bracketing that, my implementation is EXACTLY what you describe.

The difference probably is in places that Python doesn't show, and maybe isn't that important. (like all nans regardless of sign are just NaN) In the IEEE total order, +0 and -0 are distinct, which your order doesn't handle, For NaNs, the issue is that NaN is NOT a single representation, but each combination of the sign bit and the 52 mantissa bits (except all zeros which signify infinity) when the exponent field is all ones is a different NaN (and the MSB of the mantissa indicates quiet or signalling). you code makes all these values unordered with respect to each other as opposed to having a distinct strict order. That probably isn't that important of a distinction. Thus your total_order, while not REALLY a total order, is likely good enough for most purposes. Unless the code has done something to create specific NaNs, and then does something to recover that information, the fact that the NaNs aren't properly sorted within themselves probably isn't an issue. There is a possible issue that depending on the exact sort algorithm, if the data includes two (or more) NaNs with the same sign, then their lack of order might give the sort difficulties. -- Richard Damon

On Dec 29, 2019, at 20:04, Richard Damon <Richard@damon-family.org> wrote:

Thus your total_order, while not REALLY a total order, is likely good enough for most purposes.

Well, it is a total order of equivalence classes (with all IEEE-equal values being equivalent, all negative NaNs being equivalent, and all positive NaNs being equivalent). But whether it’s good enough for most purposes depends on what those purposes are. I’m having a hard time imagining any purposes where “some well-defined but arbitrary, unnatural, and idiosyncratic total order of equivalence classes of the binary64 values” is of any use at all. Maybe that’s just a failure of my imagination. But “the specific well-defined but arbitrary, unnatural order that almost every other similar bit of code in every language is likely to use” seems a lot more likely to be useful. (And, needless to say, so does “this specific order that I defined myself”. I don’t know why the OP wanted to treat all NaNs as equivalents values greater than inf, but presumably he does know, and that’s fine. So, I don’t think that needs to be built into the module as one of the options, but I don’t see why he shouldn’t be able to specify it explicitly.)

On 12/30/19 12:06 AM, Andrew Barnert wrote:

On Dec 29, 2019, at 20:04, Richard Damon <Richard@damon-family.org> wrote:

Thus your total_order, while not REALLY a total order, is likely good enough for most purposes. Well, it is a total order of equivalence classes (with all IEEE-equal values being equivalent, all negative NaNs being equivalent, and all positive NaNs being equivalent).

But whether it’s good enough for most purposes depends on what those purposes are.

I’m having a hard time imagining any purposes where “some well-defined but arbitrary, unnatural, and idiosyncratic total order of equivalence classes of the binary64 values” is of any use at all. Maybe that’s just a failure of my imagination. But “the specific well-defined but arbitrary, unnatural order that almost every other similar bit of code in every language is likely to use” seems a lot more likely to be useful.

(And, needless to say, so does “this specific order that I defined myself”. I don’t know why the OP wanted to treat all NaNs as equivalents values greater than inf, but presumably he does know, and that’s fine. So, I don’t think that needs to be built into the module as one of the options, but I don’t see why he shouldn’t be able to specify it explicitly.)

But it DOESN'T make all positive NaNs equivalent, as they do not compare 'equal'. Neither is less than the other, but they don't compare equal. If you use a sort function (not the one that Python currently uses) that uses an equality test as part of its sort (rather than just letting the falsehood of a < b and b< a imply that a == b) then it will break and perhaps return illogical results. This is getting into the weeds of corner cases, so may not be important. I will admit that if you want to sort a list that contains a mix of floats, Decimals, and Rationals, then the tuple approach might be better then the conversion to a 64 bit int as the key, since that doesn't work for the other types except by converting them to float first. As to why NaNs are greater than infinity, that comes from the IEEE specification on the total_order relationship of floating point numbers. -- Richard Damon

On Dec 30, 2019, at 06:50, Richard Damon <Richard@damon-family.org> wrote:

On 12/30/19 12:06 AM, Andrew Barnert wrote:

On Dec 29, 2019, at 20:04, Richard Damon <Richard@damon-family.org> wrote: Thus your total_order, while not REALLY a total order, is likely good enough for most purposes. Well, it is a total order of equivalence classes (with all IEEE-equal values being equivalent, all negative NaNs being equivalent, and all positive NaNs being equivalent).

But whether it’s good enough for most purposes depends on what those purposes are.

I’m having a hard time imagining any purposes where “some well-defined but arbitrary, unnatural, and idiosyncratic total order of equivalence classes of the binary64 values” is of any use at all. Maybe that’s just a failure of my imagination. But “the specific well-defined but arbitrary, unnatural order that almost every other similar bit of code in every language is likely to use” seems a lot more likely to be useful.

(And, needless to say, so does “this specific order that I defined myself”. I don’t know why the OP wanted to treat all NaNs as equivalents values greater than inf, but presumably he does know, and that’s fine. So, I don’t think that needs to be built into the module as one of the options, but I don’t see why he shouldn’t be able to specify it explicitly.)

But it DOESN'T make all positive NaNs equivalent, as they do not compare 'equal'. Neither is less than the other, but they don't compare equal. If you use a sort function (not the one that Python currently uses) that uses an equality test as part of its sort (rather than just letting the falsehood of a < b and b< a imply that a == b) then it will break and perhaps return illogical results. This is getting into the weeds of corner cases, so may not be important.

Yes it does. If all positive NaNs sort the same with this key function and the Python sorted function, they form an equivalence class under that function. The fact that they wouldn’t form an equivalence class under a different function is irrelevant. It’s like arguing that 1 and 8 can’t be in the same equivalence class for mod 7 addition because they’re not equivalent under mod 11 addition. And I don’t even know what you’re trying to argue here. That we shouldn’t allow this key function because it wouldn’t provide a total order if someone monkeypatched builtins to replace sorted? And therefore we should invent some “better” order that nobody has asked for and offer that?

I will admit that if you want to sort a list that contains a mix of floats, Decimals, and Rationals, then the tuple approach might be better then the conversion to a 64 bit int as the key, since that doesn't work for the other types except by converting them to float first.

Better for what? If there’s no good use for the OP’s order or the tuple order or IEEE total order, we shouldn’t try to provide any of them; figuring out which one “makes sense” or coming up with one that “makes more sense” when we don’t even know what case we’re trying to make sense of is pointless. Inventing a new order that nobody asked for and nobody can even imagine a use for doesn’t help anyone, it’s just more code to write and maintain and one more option to confuse users with that benefits nobody. It doesn’t matter what properties that new order has. If there is a good use for any of them, on the other hand… inventing a different order that isn’t one any of them asked for still doesn’t help anyone. We need to provide what they actually asked for, or give them a way to do so (most obviously, with a key parameter), or just reject their use as unimportant.

As to why NaNs are greater than infinity, that comes from the IEEE specification on the total_order relationship of floating point numbers.

So what? If you care about IEEE total order, you want IEEE total order, not an unspecified idiosyncratic order that’s similar to it in some ways.

On 12/30/19 4:22 PM, Andrew Barnert via Python-ideas wrote:

On Dec 30, 2019, at 06:50, Richard Damon <Richard@damon-family.org> wrote:

On 12/30/19 12:06 AM, Andrew Barnert wrote:

On Dec 29, 2019, at 20:04, Richard Damon <Richard@damon-family.org> wrote: Thus your total_order, while not REALLY a total order, is likely good enough for most purposes. Well, it is a total order of equivalence classes (with all IEEE-equal values being equivalent, all negative NaNs being equivalent, and all positive NaNs being equivalent).

But whether it’s good enough for most purposes depends on what those purposes are.

I’m having a hard time imagining any purposes where “some well-defined but arbitrary, unnatural, and idiosyncratic total order of equivalence classes of the binary64 values” is of any use at all. Maybe that’s just a failure of my imagination. But “the specific well-defined but arbitrary, unnatural order that almost every other similar bit of code in every language is likely to use” seems a lot more likely to be useful.

(And, needless to say, so does “this specific order that I defined myself”. I don’t know why the OP wanted to treat all NaNs as equivalents values greater than inf, but presumably he does know, and that’s fine. So, I don’t think that needs to be built into the module as one of the options, but I don’t see why he shouldn’t be able to specify it explicitly.) But it DOESN'T make all positive NaNs equivalent, as they do not compare 'equal'. Neither is less than the other, but they don't compare equal. If you use a sort function (not the one that Python currently uses) that uses an equality test as part of its sort (rather than just letting the falsehood of a < b and b< a imply that a == b) then it will break and perhaps return illogical results. This is getting into the weeds of corner cases, so may not be important. Yes it does. If all positive NaNs sort the same with this key function and the Python sorted function, they form an equivalence class under that function. The fact that they wouldn’t form an equivalence class under a different function is irrelevant. It’s like arguing that 1 and 8 can’t be in the same equivalence class for mod 7 addition because they’re not equivalent under mod 11 addition.

And I don’t even know what you’re trying to argue here. That we shouldn’t allow this key function because it wouldn’t provide a total order if someone monkeypatched builtins to replace sorted? And therefore we should invent some “better” order that nobody has asked for and offer that?

The issue is that the functions generates keys of the form (1, nan) for NaNs, and two of these do not compare equal, so not an equivalency class. IF the sort routine only uses <, and assumes that if neither a < b or b < a then they are equal, but not all sort routine use that method. I think the current sorted function just uses <, so it works as a key, it just isn't guaranteed to work for all sort algorithms. Basically, if a sort ever compares with a Nan using normal comparison rules, you have opened the door to problems. In this case we will only ever compare a NaN to another NaN, and depending on the sort routine, it still might work, and if the algorithm never uses an equality test to see that they aren't equal (and few sorts actually test for equality) it won't cause a problem. One simple solution is to just make the nan tuples be (1, 0) or (-1, 0) and just make all NaNs a real equivalence class. If you want NaN to sort by their representation, then you could could make that the second term of the tuple (as an int)

I will admit that if you want to sort a list that contains a mix of floats, Decimals, and Rationals, then the tuple approach might be better then the conversion to a 64 bit int as the key, since that doesn't work for the other types except by converting them to float first. Better for what?

If there’s no good use for the OP’s order or the tuple order or IEEE total order, we shouldn’t try to provide any of them; figuring out which one “makes sense” or coming up with one that “makes more sense” when we don’t even know what case we’re trying to make sense of is pointless. Inventing a new order that nobody asked for and nobody can even imagine a use for doesn’t help anyone, it’s just more code to write and maintain and one more option to confuse users with that benefits nobody. It doesn’t matter what properties that new order has.

If there is a good use for any of them, on the other hand… inventing a different order that isn’t one any of them asked for still doesn’t help anyone. We need to provide what they actually asked for, or give them a way to do so (most obviously, with a key parameter), or just reject their use as unimportant.

Building the 64 bit int key to sort floats by IEEE total_order only works if all the values are floats, as that key won't compare properly with other types of numbers. If your data does have other types of numbers, than making the tuple to force NaNs into their right place, and let the normal value comparison work for normal numbers (or infinities which I think get compared right).

As to why NaNs are greater than infinity, that comes from the IEEE specification on the total_order relationship of floating point numbers. So what? If you care about IEEE total order, you want IEEE total order, not an unspecified idiosyncratic order that’s similar to it in some ways.

The question was why were NaNs sorted this way, and the basic answer is to get a usable order for sorting, you have to put them somewhere, and the IEEE total order definition is as good as any. Ultimately, no position would make sense universally. The order that IEEE chose is 1) a simple order to create (being based fairly simply on the representation for binary floating point), and 2) putting NaNs at the extreme may make it easier to remove them if that is what is desired. The order he generates is very close to the IEEE total order, the difference are: 1) It doesn't seperate -0 for +0, which shouldn't matter for most applications. 2) It doesn't provide an order between NaNs, but unless you are taking special measures to distinguish the NaNs anyway, that doesn't really matter. -- Richard Damon

On Dec 30, 2019, at 14:05, Richard Damon <Richard@damon-family.org> wrote:

On 12/30/19 4:22 PM, Andrew Barnert via Python-ideas wrote:

On Dec 30, 2019, at 06:50, Richard Damon <Richard@damon-family.org> wrote: On 12/30/19 12:06 AM, Andrew Barnert wrote:

On Dec 29, 2019, at 20:04, Richard Damon <Richard@damon-family.org> wrote: Thus your total_order, while not REALLY a total order, is likely good enough for most purposes. Well, it is a total order of equivalence classes (with all IEEE-equal values being equivalent, all negative NaNs being equivalent, and all positive NaNs being equivalent). But whether it’s good enough for most purposes depends on what those purposes are. I’m having a hard time imagining any purposes where “some well-defined but arbitrary, unnatural, and idiosyncratic total order of equivalence classes of the binary64 values” is of any use at all. Maybe that’s just a failure of my imagination. But “the specific well-defined but arbitrary, unnatural order that almost every other similar bit of code in every language is likely to use” seems a lot more likely to be useful. (And, needless to say, so does “this specific order that I defined myself”. I don’t know why the OP wanted to treat all NaNs as equivalents values greater than inf, but presumably he does know, and that’s fine. So, I don’t think that needs to be built into the module as one of the options, but I don’t see why he shouldn’t be able to specify it explicitly.) But it DOESN'T make all positive NaNs equivalent, as they do not compare 'equal'. Neither is less than the other, but they don't compare equal. If you use a sort function (not the one that Python currently uses) that uses an equality test as part of its sort (rather than just letting the falsehood of a < b and b< a imply that a == b) then it will break and perhaps return illogical results. This is getting into the weeds of corner cases, so may not be important. Yes it does. If all positive NaNs sort the same with this key function and the Python sorted function, they form an equivalence class under that function. The fact that they wouldn’t form an equivalence class under a different function is irrelevant. It’s like arguing that 1 and 8 can’t be in the same equivalence class for mod 7 addition because they’re not equivalent under mod 11 addition. And I don’t even know what you’re trying to argue here. That we shouldn’t allow this key function because it wouldn’t provide a total order if someone monkeypatched builtins to replace sorted? And therefore we should invent some “better” order that nobody has asked for and offer that?

The issue is that the functions generates keys of the form (1, nan) for NaNs, and two of these do not compare equal, so not an equivalency class.

Forming an equivalence class is always under a specific relation, not absolute. The fact that they don’t form an equivalence class under == is no more relevant than the fact that they do form an equivalence class under is. There are a huge number or potential relations over binary64 x binary64, and many of them are equivalence relations and the rest are not, and the only thing that matters is the one actually used by sorted. Which is not is or == or any of the zillions or other possible relations, it’s `not (a<b or b<a)‘. For floats with no key, that is not an equivalence relation. For example, 2 R nan, and nan R 3, but not 2 R 3, so R is not transitive. And that’s why floats with NaNs don’t get sorted correctly without a key. That’s the problem you’re apparently trying to solve And this key does solve that problem. key(2) R key(NaN) doesn’t hold, so the original intransitivity doesn’t arise. And neither does any other; it’s transitive over the whole domain. And it’s also reflexive and symmetric. And that’s the definition of an equivalence relation. And any equivalence relation partitions the set into equivalence classes, again by definition. Here, the (key-processed) NaNs are in one class, the other (key-processed) values are in the same classes they would have been in with no key function, which also happens to be the same classes they’d be in with == on the unkeyed values, and these classes are totally ordered under <. And that’s what sorted requires. QED. That doesn’t mean it’s a _useful_ solution. If I want a specific total ordering over specific equivalence classes and you give me a different total ordering over different equivalence classes, that isn’t useful. But not because it isn’t a total ordering over equivalence classes. It is, it’s just the wrong one. So offering a different total ordering over equivalence classes that still isn’t the one I asked for is no better and no worse: it’s equally valid and equally useless.

IF the sort routine only uses <, and assumes that if neither a < b or b < a then they are equal, but not all sort routine use that method. I think the current sorted function just uses <, so it works as a key, it just isn't guaranteed to work for all sort algorithms.

But who cares that a key that median is passing to sorted wouldn’t work with other sort algorithms that it will never be used with? That doesn’t affect whether it works with sorted.

Basically, if a sort ever compares with a Nan using normal comparison rules, you have opened the door to problems.

But because median is not ever using a sort that will ever compare a NaN with ==, that isn’t relevant.

In this case we will only ever compare a NaN to another NaN, and depending on the sort routine, it still might work, and if the algorithm never uses an equality test to see that they aren't equal (and few sorts actually test for equality) it won't cause a problem.

One simple solution is to just make the nan tuples be (1, 0) or (-1, 0) and just make all NaNs a real equivalence class. If you want NaN to sort by their representation, then you could could make that the second term of the tuple (as an int)

Solution to what? You just said there’s no problem, so what problem are you trying to solve? The simplest solution to “it won’t cause a problem” is to change nothing. If you’re trying to solve the problem that someone could theoretically monkeypatch builtins.sorted and make it use a different algorithm, you haven’t solved that, because someone could just as easily monkeypatch builtins.sorted and make it always return [3, 1, 2] for any input. If “what if someone replaces the sorted function?” isn’t consenting adults territory, I don’t know what it.

I will admit that if you want to sort a list that contains a mix of floats, Decimals, and Rationals, then the tuple approach might be better then the conversion to a 64 bit int as the key, since that doesn't work for the other types except by converting them to float first. Better for what? If there’s no good use for the OP’s order or the tuple order or IEEE total order, we shouldn’t try to provide any of them; figuring out which one “makes sense” or coming up with one that “makes more sense” when we don’t even know what case we’re trying to make sense of is pointless. Inventing a new order that nobody asked for and nobody can even imagine a use for doesn’t help anyone, it’s just more code to write and maintain and one more option to confuse users with that benefits nobody. It doesn’t matter what properties that new order has. If there is a good use for any of them, on the other hand… inventing a different order that isn’t one any of them asked for still doesn’t help anyone. We need to provide what they actually asked for, or give them a way to do so (most obviously, with a key parameter), or just reject their use as unimportant. Building the 64 bit int key to sort floats by IEEE total_order only works if all the values are floats, as that key won't compare properly with other types of numbers. If your data does have other types of numbers, than making the tuple to force NaNs into their right place, and let the normal value comparison work for normal numbers (or infinities which I think get compared right).

There are a huge number of possible total orders of equivalence classes you could induce over the floats, and there’s not one that’s “the right one”. They’re all valid, but if someone needs a specific one, only that one is the one they needs.

As to why NaNs are greater than infinity, that comes from the IEEE specification on the total_order relationship of floating point numbers. So what? If you care about IEEE total order, you want IEEE total order, not an unspecified idiosyncratic order that’s similar to it in some ways.

The question was why were NaNs sorted this way, and the basic answer is to get a usable order for sorting, you have to put them somewhere, and the IEEE total order definition is as good as any. Ultimately, no position would make sense universally. The order that IEEE chose is 1) a simple order to create (being based fairly simply on the representation for binary floating point), and 2) putting NaNs at the extreme may make it easier to remove them if that is what is desired.

The order he generates is very close to the IEEE total order, the difference are:

1) It doesn't seperate -0 for +0, which shouldn't matter for most applications.

2) It doesn't provide an order between NaNs, but unless you are taking special measures to distinguish the NaNs anyway, that doesn't really matter.

And it also doesn’t distinguish equal but distinct-bit-pattern subnormal values. Sure, most of these things don’t come up very often. But that’s just why needing IEEE total order doesn’t come up very often. If it does come up, it’s because one or all of these things matter, so offering something that’s kind of similar to IEEE total order doesn’t help. If the argument is that nobody will ever need IEEE total order, that’s fine. The person who asked for it already said he can’t think of a real use for it. And neither can anyone else. But then this solution is even less useful—it’s failing to solve a problem that didn’t even exist in the first place. If the only things people will ever need to do with nans and median is ignore, raise, poison, or backward compatibility with whatever Python 3.8 did, it doesn’t matter what fifth option you invent, it’s not going to solve anyone’s actual problem.

The order he generates is very close to the IEEE total order, the difference are:

1) It doesn't seperate -0 for +0, which shouldn't matter for most applications. 2) It doesn't provide an order between NaNs, but unless you are taking special measures to distinguish the NaNs anyway, that doesn't really matter.

And it also doesn’t distinguish equal but distinct-bit-pattern subnormal values.

This is more in the category of "things that definitely do not matter", but I had forgotten about subnomal floating-point values. How does IEEE totalOrder mandate ordering those vs. equivalent normal numbers? So yes, my implementation of totalOrder probably has another incompleteness for the IEEE spec. It was THREE LINES of code, and I never claimed the goal was useful even if done correctly. -- Keeping medicines from the bloodstreams of the sick; food from the bellies of the hungry; books from the hands of the uneducated; technology from the underdeveloped; and putting advocates of freedom in prisons. Intellectual property is to the 21st century what the slave trade was to the 16th.

On Dec 30, 2019, at 15:56, David Mertz <mertz@gnosis.cx> wrote:

The order he generates is very close to the IEEE total order, the difference are:

1) It doesn't seperate -0 for +0, which shouldn't matter for most applications. 2) It doesn't provide an order between NaNs, but unless you are taking special measures to distinguish the NaNs anyway, that doesn't really matter.

And it also doesn’t distinguish equal but distinct-bit-pattern subnormal values.

This is more in the category of "things that definitely do not matter", but I had forgotten about subnomal floating-point values. How does IEEE totalOrder mandate ordering those vs. equivalent normal numbers?

I think it’s in the category of “things that matter about as often as totalOrder matters in the first place (which is rarely)”. But that’s because the only reasons I can think of to want totalOrder are to make sure I get the exact same order on every implementation (so I’ll take any standard as long as it’s actually a standard that everyone implements the same way), or to make sure I get an order that distinguishes all distinct bit patterns (probably because I’m smuggling non-float values in my float bits or something? I can’t think of a use that would be meaningful with median…). In fact, in any other case I can think of besides those two, I would explicitly _not_ want this behavior, because it breaks stability (and probably hurts performance?) for no visible benefit. But maybe that’s just a failure of my imagination. Anyway, without rereading the spec, I believe it mandates that all equal but distinct bit patterns are ordered by the exponent times the sign, which handles both -0<0 and more vs. less subnormal representations of equal finite values. While we’re at it, IIRC, it specifically mandates a<=b rather than a<b for unequal values, and I’m not sure why that is.

On Mon, Dec 30, 2019 at 06:55:56PM -0500, David Mertz wrote:

The order he generates is very close to the IEEE total order, the difference are:

1) It doesn't seperate -0 for +0, which shouldn't matter for most applications. 2) It doesn't provide an order between NaNs, but unless you are taking special measures to distinguish the NaNs anyway, that doesn't really matter.

And it also doesn’t distinguish equal but distinct-bit-pattern subnormal values.

This is more in the category of "things that definitely do not matter", but I had forgotten about subnomal floating-point values. How does IEEE totalOrder mandate ordering those vs. equivalent normal numbers?

Subnormals (or denormalised numbers) are ordered in ascending order, like norms, and slot in between zero and the normalised numbers. If we enumerate floats in the order of their bitwise representation, every float maps to an integer in the range 0 to 2**64-1 inclusive: 0x0000000000000000 zero 0x0000000000000001...0x000FFFFFFFFFFFFF positive denormalised numbers 0x0010000000000000...0x7FEFFFFFFFFFFFFF positive normalised numbers 0x7FF0000000000000 positive infinity 0x7FF0000000000001...0x7FF7FFFFFFFFFFFF signalling NANs with sign-bit cleared 0x7FF8000000000000...0x7FFFFFFFFFFFFFFF quiet NANs with sign-bit cleared 0x8000000000000000 negative zero 0x8000000000000001...0x800FFFFFFFFFFFFF negative denormalised numbers 0x8010000000000000...0xFFEFFFFFFFFFFFFF negative normalised numbers 0xFFF0000000000000 negative infinity 0xFFF0000000000001...0xFFF7FFFFFFFFFFFF signalling NANs with sign-bit set 0xFFF8000000000000...0xFFFFFFFFFFFFFFFF quiet NANs with sign-bit set The total order reverses the order of the floats from 0x8000000000000000 onwards, giving us the numbers in ascending numeric order (aside from the NANs, which are shoved to the ends): quiet NANs with sign-bit set signalling NANs with sign-bit set negative infinity negative normalised numbers negative denormalised numbers negative zero zero positive denormalised numbers positive normalised numbers positive infinity signalling NANs with sign-bit cleared quiet NANs with sign-bit cleared I have been tediously pedantic in referring to NANs with the sign-bit set or cleared rather than "negative NANs" and "positive NANs", since strictly speaking the standard doesn't offer an interpretation of the sign-bit when applied to NANs. Or at least the original version of the standard didn't. I believe there have been at least two revisions since the original, it is possible they have added an interpretation. Strictly speaking, the order of signalling and quiet NANs is not specified by the original version of the standard. Most platforms follow the order above, but a few may reverse the interpretation of the "signalling bit" (bit 51, counting from 0 on the right). I don't know if Python runs on any of those platforms. -- Steven

On 12/30/19 6:55 PM, David Mertz wrote:

The order he generates is very close to the IEEE total order, the difference are:

> 1) It doesn't seperate -0 for +0, which shouldn't matter for most applications. > 2) It doesn't provide an order between NaNs, but unless you are taking special measures to distinguish the NaNs anyway, that doesn't really matter.

And it also doesn’t distinguish equal but distinct-bit-pattern subnormal values.

This is more in the category of "things that definitely do not matter", but I had forgotten about subnomal floating-point values. How does IEEE totalOrder mandate ordering those vs. equivalent normal numbers?

So yes, my implementation of totalOrder probably has another incompleteness for the IEEE spec. It was THREE LINES of code, and I never claimed the goal was useful even if done correctly.

IEEE BINARY floating point doesn't have subnormal values the equivalent of normal numbers. The binary format has an assumed leading one to the mantissa for exponents greater than 0 (and less than all ones which are the Infinities and NaNs). This assumed one makes multiple representations for most values impossible, it one happens for 0, having +0 and -0. The first level of the ordering is the basic value of the number, which provides a total order for all finite numbers but 0 in the binary format. Other floating point bases, like decimal floating point or hex based, don't have a consistent leading digit, so it isn't omitted, and you could have subnormal values if the leading digit is 0. The clause on handling different representations really just apply to those non-binary formats. -- Richard Damon

[Richard Damon <Richard@damon-family.org>]

IEEE total_order puts NaN as bigger than infinity, and -NaN as less than -inf.

One simple way to implement it is to convert the representaton to a 64 bit signed integer (not its value, but its representation) and if the sign bit is set, complement the bottom 63 bits (because floats are signed-magnitude). For pure python code, I don't know how hard it is to get the representation of a float as a 64 bit integer. In C or Assembly it is fairly easy as you can easily get around the type system, but I don't know python well enough to do it.

There's a Python implementation here: https://bugs.python.org/msg336487

How is that fancy bitmask version different from my 3-line version? On Sun, Dec 29, 2019, 11:01 PM Tim Peters <tim.peters@gmail.com> wrote:

[Richard Damon <Richard@damon-family.org>]

IEEE total_order puts NaN as bigger than infinity, and -NaN as less than -inf.

One simple way to implement it is to convert the representaton to a 64 bit signed integer (not its value, but its representation) and if the sign bit is set, complement the bottom 63 bits (because floats are signed-magnitude). For pure python code, I don't know how hard it is to get the representation of a float as a 64 bit integer. In C or Assembly it is fairly easy as you can easily get around the type system, but I don't know python well enough to do it.

There's a Python implementation here:

https://bugs.python.org/msg336487 _______________________________________________ Python-ideas mailing list -- python-ideas@python.org To unsubscribe send an email to python-ideas-leave@python.org https://mail.python.org/mailman3/lists/python-ideas.python.org/ Message archived at https://mail.python.org/archives/list/python-ideas@python.org/message/HGCIHF... Code of Conduct: http://python.org/psf/codeofconduct/

[David]

How is that fancy bitmask version different from my 3-line version?

Where he's referring to my: https://bugs.python.org/msg336487 and, I presume, to his: def total_order(x): if math.isnan(x): return (math.copysign(1, x), x) . return (0, x) \ Richard spelled out some differences in the message of his I was replying to (but which I snipped in my reply to him, to get to the only point I had to make). So read his message again. For example, the IEEE total order requires that -0 be "less than" +0, but your `total_order()` function treats them as equal.

On 12/29/19 11:00 PM, Tim Peters wrote:

[Richard Damon <Richard@damon-family.org>]

IEEE total_order puts NaN as bigger than infinity, and -NaN as less than -inf.

One simple way to implement it is to convert the representaton to a 64 bit signed integer (not its value, but its representation) and if the sign bit is set, complement the bottom 63 bits (because floats are signed-magnitude). For pure python code, I don't know how hard it is to get the representation of a float as a 64 bit integer. In C or Assembly it is fairly easy as you can easily get around the type system, but I don't know python well enough to do it. There's a Python implementation here:

Yep, that looks like the canonical Total_order, just didn't know how to convert a float into an array of bytes and then to an int. Now if this was just added to the library, and either median used it if the data was floats or sorted used it as part of the 'all floats' optimization. -- Richard Damon

On Dec 29, 2019, at 18:42, David Mertz <mertz@gnosis.cx> wrote: On Sun, Dec 29, 2019, 9:23 PM Andrew Barnert

Here it is. I could save a line by not using the 'else'.

def total_order(x): if is_nan(x): return (math.copysign(1, x), x) else: return (0, x)

This doesn’t give you IEEE total order. Under what circumstances would you prefer this to, say, Decimal.compare_total, which does?

How does this differ from IEEE? I'm not being rhetorical, I really thought that was their order. But I haven't looked carefully.

IEEE total order specifies a distinct order for every distinct bit pattern, and tries to do so in a way that makes sense. IIRC, this means -0.0 comes before 0.0, distinct but equal denormal values are ordered by exponent, sNaN comes after qNaN, two NaNs of the same kind are ordered by payload, negative NaNs come at the front rather than the back, and… I think that’s all the rules, but I wouldn’t want to bet too heavily on that. I don’t think you usually care too much about exactly how the values are ordered, but you do care that it’s consistent across implementations—two different programs sorting the same data in IEEE total order should give the same order. And leaving it up to the type and/or the end user means that you (or Stephen) don’t actually need to know how to correctly totally order Decimals or mpfrs, much less some crazy thing like Levi-Civita numbers; only the person writing the decimal or gmpy2 or infinitesimal module does.

median_employee = statistics.median((e.salary, e) for e in employees)[1]

By the same argument, nothing actually needs a key function because you can always decorate-sort-undecorate. And yet, key functions are useful in all kinds of places.

It's a balance, I think. `key_fn=` feels much too "heavy" for users of statistics module.

Key functions are ubiquitous in Python, and I think people learn them reasonably early. You can always get around them with DSU, but the result is always less readable, and that’s why they’re ubiquitous. And that really isn’t any different here than in itertools.groupby or heapq.merge or anywhere else. If anything, DSU is trickier here than elsewhere (how do you undecorate after the sort but before the average-two-values step?), but key is not. IEEE total order, on the other hand… how many people even know that it’s a thing, much less know when and how to use it?

Handling NaNs one way or the other is the 95% concern. And for the 5%, DSU covers it.

The 95% case is handled by just ignore and raise. Novices should probably never be using anything else. Experts will definitely often want poison. And probably sometimes fast for backward compatibility and/or performance. That gets you to 98%. Experts will rarely but not never want total order. But when they do, they’ll know what they want, and how to find the appropriate function for the type they’re using, and how to pass it as a key function to a key parameter that works the same way as everywhere else in Python. And experts might also want something different from IEEE total order, like uniformly pushing all NaNs to the end. I’m not sure when you’d actually want that, but since it was the original suggestion that kicked off this whole discussion, it’s obviously not inconceivable. I get the idea that, once you’ve already got an on_nan param, adding another value to that param doesn’t add as much cognitive load as adding a whole other param would. But I think a total order value is so rarely useful that it’s probably more load than it’s worth, while a key param is a more widely useful and therefore worth more load (although maybe still not enough).

On Sun, Dec 29, 2019 at 11:33 PM Andrew Barnert <abarnert@yahoo.com> wrote:

IEEE total order specifies a distinct order for every distinct bit pattern, and tries to do so in a way that makes sense.

Ok, ok... I've got "learned up" about this three times now :-). Given we cannot control those bit patterns from Python, I'm a bit "meh"... but I get the rule (yeah, yeah, struct module)

The 95% case is handled by just ignore and raise. Novices should probably never be using anything else. Experts will definitely often want poison. And probably sometimes fast for backward compatibility and/or performance. That gets you to 98%.

Fair enough. I really only care about the 98% case. But if you can convince Steven to add `key=` as well, no real harm to me. My only concern is a beginner who types `help(median)` and scratches her head over the key oddness. But I guess the docstring can say "Don't worry about this if you don't need a custom sort order for your objects." It's also no real extra work to pass along a `key` argument to the `sorted()` internal to the function. I guess on the off chance the implementation moves to Quickselect it will be slightly more work. But I guess really not that much even then (hmmm... I think the implementation would have to contain a kind of DSU inside it though for that). Do remember that using `sorted()` is an implementation detail, not a promise of functions in statistics module. And experts might also want something different from IEEE total order, like

uniformly pushing all NaNs to the end. I’m not sure when you’d actually want that, but since it was the original suggestion that kicked off this whole discussion, it’s obviously not inconceivable.

I think that idea of "NaNs to the end" was just ill-conceived nonsense. I mean MAYBE I can see a good in putting the +/-nans to both ends of the order, so MAYBE the stuff in the middle winds up being in median. But if you have 100 nans and 50 real numbers, it seems just silly to automatically select NaN as the median. Especially under the "missing data" use that is so common in data science (Pandas, R, etc).

I get the idea that, once you’ve already got an on_nan param, adding another value to that param doesn’t add as much cognitive load as adding a whole other param would. But I think a total order value is so rarely useful that it’s probably more load than it’s worth, while a key param is a more widely useful and therefore worth more load (although maybe still not enough).

Oh... yeah, the 'ieee_total_order' value was absolutely silly and over specialized. I just threw it in because some folks in the thread mentioned it. Albeit, it's a value to rarely use for the one parameter, so that's a little less burden than another parameter. In Pandas we see that a lot... some parameter will have 20 options, but 95% of the users use the default, and 4.9% use one non-default option. So the remaining 18 options cover the 0.1% use cases. Yours, David... -- Keeping medicines from the bloodstreams of the sick; food from the bellies of the hungry; books from the hands of the uneducated; technology from the underdeveloped; and putting advocates of freedom in prisons. Intellectual property is to the 21st century what the slave trade was to the 16th.

On Dec 29, 2019, at 21:00, David Mertz <mertz@gnosis.cx> wrote:

On Sun, Dec 29, 2019 at 11:33 PM Andrew Barnert <abarnert@yahoo.com> wrote:

IEEE total order specifies a distinct order for every distinct bit pattern, and tries to do so in a way that makes sense.

Ok, ok... I've got "learned up" about this three times now :-). Given we cannot control those bit patterns from Python, I'm a bit "meh"... but I get the rule (yeah, yeah, struct module)

That’s either me or AT&T being slow; either way, apologies.

The 95% case is handled by just ignore and raise. Novices should probably never be using anything else. Experts will definitely often want poison. And probably sometimes fast for backward compatibility and/or performance. That gets you to 98%.

Fair enough. I really only care about the 98% case. But if you can convince Steven to add `key=` as well, no real harm to me. My only concern is a beginner who types `help(median)` and scratches her head over the key oddness. But I guess the docstring can say "Don't worry about this if you don't need a custom sort order for your objects."

I’m more concerned that a beginner who types help(median) sees an on_nan that has five or more values that they don’t understand and, has no idea which one to choose. But I guess the same thing here: the docstring can tell them that they almost always want the default RAISE, or IGNORE if they’re using NaN to mean missing values, and then describe the other options. (It’s not like they have to be in alphabetical order…) And anyway, the real problem is, as always, beginners who copy-paste something off StackOverflow or a blog post without reading any help or even the text around the code, and then want to know “why this crash” when they explicitly passed a list full of nans with on_nan=RAISE. And, given that there are at least two frequently-useful behaviors, there has to be something that those people are going to misuse…

Do remember that using `sorted()` is an implementation detail, not a promise of functions in statistics module.

Sure, but conceptually median really is about order in a way that, e.g., mean and stdev are not. You’re asking for the middle value in sorted order, even if the fact that this happens to be done by actually sorting the values is an implementation detail you shouldn’t care about.

On Sun, Dec 29, 2019 at 08:32:52PM -0800, Andrew Barnert via Python-ideas wrote:

The 95% case is handled by just ignore and raise. Novices should probably never be using anything else.

Experts will definitely often want poison. And probably sometimes fast for backward compatibility and/or performance. That gets you to 98%.

Experts will rarely but not never want total order.

Can you explain the scenario where somebody using median will want negative NANs to sort to the beginning, below -INF, and positive NANs to sort to the end, above +INF? I have no doubt that total order for floats is useful. I just want to know when it would be useful for users of median.

And experts might also want something different from IEEE total order, like uniformly pushing all NaNs to the end.

Likewise. I'm sure there are many uses of sorting NANs to one end, but when will it be useful for median? -- Steven

On Dec 29, 2019, at 23:50, Steven D'Aprano <steve@pearwood.info> wrote:

On Sun, Dec 29, 2019 at 06:23:03PM -0800, Andrew Barnert via Python-ideas wrote:

Likewise, it’s even easier to write ignore-nan yourself than to write the DSU yourself:

median = statistics.median(x for x in xs if not x.isnan())

Try that with xs = [1, 10**400, 2] and come back to me.

Presumably the end user (unlike the statistics module) knows what data they have. You use this code when you’re passing in Decimals where NaN means missing values. You don’t use it when you’re passing in integers, or Decimals where NaN means an unexpected error, or Decimals where None rather than NaN means missing data, or whatever. Any of those cases are trivial. Could there be some use case where you know your data is either all ints or all Decimals but not which, and NaNs mean missing values? I suppose that’s not impossible. In that case, you have to write the appropriate function to filter out Decimal NaNs without breaking on ints. On Dec 29, 2019, at 23:59, Steven D'Aprano <steve@pearwood.info> wrote:

On Sun, Dec 29, 2019 at 08:32:52PM -0800, Andrew Barnert via Python-ideas wrote:

The 95% case is handled by just ignore and raise. Novices should probably never be using anything else.

Experts will definitely often want poison. And probably sometimes fast for backward compatibility and/or performance. That gets you to 98%.

Experts will rarely but not never want total order.

Can you explain the scenario where somebody using median will want negative NANs to sort to the beginning, below -INF, and positive NANs to sort to the end, above +INF?

I don’t know of one. I just assumed that since at least two people asked for it on this thread, some people will sometimes want it. (I could maybe imagine total order being useful for non-NaN cases, to make sure I get the same value for [0, -0, -10] in my Python code and my C++ code. But nobody’s actually asked for that; they asked for it specifically for NaN.)

And experts might also want something different from IEEE total order, like uniformly pushing all NaNs to the end.

Likewise. I'm sure there are many uses of sorting NANs to one end, but when will it be useful for median?

Likewise, I assumed that since the original post that started this thread was specifically asking for that, at least one person has a use for it. I have no idea what that use is. Maybe both of these requests were spurious. If no custom ordering is ever needed, then great; we’re down to raise and ignore, plus fast/backward-compatible, plus maybe poison. These can all be implemented pretty easily for the four types that median actually supports. And doing them all in the obvious way will mean they all happen to work for all types that are totally-ordered-except-NaN with a NaN whose semantics are IEEE and that can be detected by the same method as Decimal, which is already more than good enough. But if any custom ordering is ever needed, I can’t see how it helps to make it easy to use an order that’s somewhat similar to the two that people have asked for but not the same as either. If anyone needs the ordering defined by Decimal.compare_total, they probably know that’s the function they need, and they ought to know how to pass a key function. So, rather than trying to add a fifth option that gives them something they don’t want, just add a key parameter.

On Mon, Dec 30, 2019 at 3:32 AM Andrew Barnert via Python-ideas < python-ideas@python.org> wrote:

On Dec 29, 2019, at 23:50, Steven D'Aprano <steve@pearwood.info> wrote:

On Sun, Dec 29, 2019 at 06:23:03PM -0800, Andrew Barnert via

Python-ideas wrote:

Likewise, it’s even easier to write ignore-nan yourself than to write

the DSU yourself:

median = statistics.median(x for x in xs if not x.isnan())

Try that with xs = [1, 10**400, 2] and come back to me.

Presumably the end user (unlike the statistics module) knows what data they have.

No, Steven is right here. In Python we might very sensibly mix numeric datatypes. But this means we need an `is_nan()` function like some discussed in these threads, not rely on a method (and not the same behavior as math.isnan()). E.g.: my_data = {'observation1': 10**400, # really big amount 'observation2': 1, # ordinary size 'observation3': 2.0, # ordinary size 'observation4': math.nan # missing data } median = statistics.median_high(x for x in my_data if not is_nan(x)) The answer '2.0' is plainly right here, and there's no reason we shouldn't provide it. -- Keeping medicines from the bloodstreams of the sick; food from the bellies of the hungry; books from the hands of the uneducated; technology from the underdeveloped; and putting advocates of freedom in prisons. Intellectual property is to the 21st century what the slave trade was to the 16th.

On 12/30/19 11:54 AM, David Mertz wrote:

On Mon, Dec 30, 2019 at 3:32 AM Andrew Barnert via Python-ideas <python-ideas@python.org <mailto:python-ideas@python.org>> wrote:

On Dec 29, 2019, at 23:50, Steven D'Aprano <steve@pearwood.info <mailto:steve@pearwood.info>> wrote: > > On Sun, Dec 29, 2019 at 06:23:03PM -0800, Andrew Barnert via Python-ideas wrote: > >> Likewise, it’s even easier to write ignore-nan yourself than to write the DSU yourself: >> >> median = statistics.median(x for x in xs if not x.isnan()) > > Try that with xs = [1, 10**400, 2] and come back to me.

Presumably the end user (unlike the statistics module) knows what data they have.

No, Steven is right here. In Python we might very sensibly mix numeric datatypes. But this means we need an `is_nan()` function like some discussed in these threads, not rely on a method (and not the same behavior as math.isnan()).

E.g.:

my_data = {'observation1': 10**400, # really big amount 'observation2': 1, # ordinary size 'observation3': 2.0, # ordinary size 'observation4': math.nan # missing data }

median = statistics.median_high(x for x in my_data if not is_nan(x))

The answer '2.0' is plainly right here, and there's no reason we shouldn't provide it.

My preference is that the interpretation that NaN means Missing Data isn't appropriate for for the statistics module. In your code, because you put the filter in, YOU added that meaning which is ok, but I see no grounds to say that statistics.median(my_data) MUST be 2.0, and several other logical results have been presented. For instance, if your last point was defined as 1e400-1e399, which results in a nan, then 2.0 is NOT the reasonable answer, but from the numbers (before we lost precision to the subtraction of infinities) be inf, or maybe something close to 4.5e399 had the e notation numbers not overflow to infinity, but stayed big nums or decimals. Since Python DOES support the mixed type arrays, I see no reason that Python needs to adopt the ancient domain specific (and not universal in the domain) usage of nan as missing data, but instead the Python Idiom should more likely be something line None (which gets around the difficulty of detecting the multiple forms on nan). Now one issue with your example, which may be the point, is that currently the documentation of median says it does NOT support mixed type list, like given above, but it does seem to handle it as long as the comparison function gives reasonable results, I suspect that there are some combination of extreme values of differing types where the comparison function fails, and I am not sure if there is a easy solution to make ALL the Number classes always comparable to each other, one issue being that what type to do the comparison in most efficiently is value dependent (magnitude and how close the values are together). -- Richard Damon

On Mon, Dec 30, 2019 at 12:37 PM Richard Damon <Richard@damon-family.org> wrote:

My preference is that the interpretation that NaN means Missing Data isn't appropriate for for the statistics module.

You need to tel the entire PyData ecosystem, the entire R ecosystem, and a pretty much all of Data Science that they are wrong then. I would generally prefer a different sentinel value as well, but you are saying to refuse to interoperate with hundreds of millions of lines of code that do not meet the rule you have now declared. I suppose purity beats practicality though. -- Keeping medicines from the bloodstreams of the sick; food from the bellies of the hungry; books from the hands of the uneducated; technology from the underdeveloped; and putting advocates of freedom in prisons. Intellectual property is to the 21st century what the slave trade was to the 16th.

On 12/30/19 12:45 PM, David Mertz wrote:

On Mon, Dec 30, 2019 at 12:37 PM Richard Damon <Richard@damon-family.org <mailto:Richard@damon-family.org>> wrote:

My preference is that the interpretation that NaN means Missing Data isn't appropriate for for the statistics module.

You need to tel the entire PyData ecosystem, the entire R ecosystem, and a pretty much all of Data Science that they are wrong then. I would generally prefer a different sentinel value as well, but you are saying to refuse to interoperate with hundreds of millions of lines of code that do not meet the rule you have now declared.

I suppose purity beats practicality though.

First, for R and other languages where arrays of data are single typed, NaN is a sort of reasonable (or at least a least wrong) value. That is the environment where the convention stated. There Practicality beats trying to be pure, and once you decide you need a No Data value, NaN is better than -99999. (one of the other historical choices) In the domain of advanced statistical packages, that derive from that history, I can accept that usage, when used by people who understand its implications, and use packages adapted to Python from that domain. The statistics package does NOT come from that history, and explicitly refers people to package that are for those usages. I would note that if median ignored NaNs, then so should things like mean, and stdev which don't, but return nans. This would be an argument that the 'poison' option maybe should be the default option for median if a nan policy is added. -- Richard Damon

On Dec 30, 2019, at 08:55, David Mertz <mertz@gnosis.cx> wrote:

Presumably the end user (unlike the statistics module) knows what data they have.

No, Steven is right here. In Python we might very sensibly mix numeric datatypes.

The statistics module explicitly doesn’t support doing so. Which means anyone who’s doing it anyway is into “experienced user” territory, and ought to know what they’re doing. At any rate, I wasn’t arguing that we don’t need a NaN test function in statistics. My point—lost by snipping off all the context—was nearly the opposite. The fact that you can NaN-filter things yourself (more easily than the statistics module can) doesn’t mean the module shouldn’t offer an ignore option—and therefore, the fact that you can DSU things yourself (less easily than using a key function) doesn’t mean the module shouldn’t offer a key parameter. (There may be other good arguments against a key parameter. The fact that all three of the alternate orders anyone’s asked for or suggested turned out to be spurious, and nobody can think of a good use for a different one, that’s a pretty good argument that YAGNI. But that doesn’t make the bogus argument from “theoretically you could do it yourself so we don’t need to offer it no matter how useful” any less bogus.)

But this means we need an `is_nan()` function like some discussed in these threads, not rely on a method (and not the same behavior as math.isnan()).

Wait, what’s wrong with the behavior of math.isnan for floats? If you want a NaN test that differs from the one defined by IEEE, I think we’re off into uncharted waters. Let’s get concrete: say we have a function that tries the method, and, on exception, tries math for floats, returns false for other Numbers, and finally raises a TypeError if all of the above failed. (If this were a general thing rather than a statistics thing, add trying cmath too.) What values of what types does that not serve? People keep trying to come up with “better” NaN tests than the obvious one, but better for what? If you don’t have an actual problem to solve, what use is a solution, no matter how clever?

E.g.:

my_data = {'observation1': 10**400, # really big amount 'observation2': 1, # ordinary size 'observation3': 2.0, # ordinary size 'observation4': math.nan # missing data }

median = statistics.median_high(x for x in my_data if not is_nan(x))

The answer '2.0' is plainly right here, and there's no reason we shouldn't provide it.

Wait, are you arguing that we should just offer a generic is_nan function (as a builtin?), instead of adding an on_nan handler parameter to median and friends? If so, apologies; I guess I was disagreeing with someone else’s very different position above, not yours. This helps users who are sophisticated enough to intentionally use NaNs for missing data, and to know they want to filter them out of a median, and to know how to do that with a genexpr, and to know when you can and can’t safely ignore the docs on which inputs are supported by statistics, but not sophisticated enough to write an isnan test for their mix of two types. But do any such users exist? Writing a NaN test that works for your values even though you intentionally mixed two types isn’t the hard part. It’s knowing what to do with that NaN test. Which still isn’t all that hard, but it’s something a lot of novices haven’t learned yet. I think there are a lot more users of the statistics module who would be helped by raise and ignore options on median than by just giving them the simple tools to build that behavior themselves and hoping they figure out that they need to.

On Mon, Dec 30, 2019, 5:17 PM Andrew Barnert

The fact that all three of the alternate orders anyone’s asked for or suggested turned out to be spurious, and nobody can think of a good use for a different one, that’s a pretty good argument that YAGNI.

I think everyone agrees that the only actual use cases are 'ignore', 'poison', 'raise', and I suppose 'fast/unsafe'. I missed some details in trying to emulate IEEE total_order(), but my point was never that it was desirable, just easy (but it turns out to be slightly less way than I first thought). Any actual order involving NaNs is a fool's mission. Wait, what’s wrong with the behavior of math.isnan for floats? If you want

a NaN test that differs from the one defined by IEEE, I think we’re off into uncharted waters.

I do not believe or accept that statistics is meant to "blow up on non-floats". I might like math.isnan() to behave better with non-floats numbers, but that's a different issue. You seem to propose a perfectly good version of a general is_nan() later in your comment. Wait, are you arguing that we should just offer a generic is_nan function

(as a builtin?), instead of adding an on_nan handler parameter to median and friends?

I don't think a built-in. Maybe in math, or statistics. I don't care if it's spelled '_isnan()' for private use by statistics. I'm not worried about public functionality. Just avoiding repetition in different paths in 'median_<whatever>' and xtile().

On Dec 30, 2019, at 14:35, David Mertz <mertz@gnosis.cx> wrote:

On Mon, Dec 30, 2019, 5:17 PM Andrew Barnert

The fact that all three of the alternate orders anyone’s asked for or suggested turned out to be spurious, and nobody can think of a good use for a different one, that’s a pretty good argument that YAGNI.

I think everyone agrees that the only actual use cases are 'ignore', 'poison', 'raise', and I suppose 'fast/unsafe'.

I missed some details in trying to emulate IEEE total_order(), but my point was never that it was desirable, just easy (but it turns out to be slightly less way than I first thought). Any actual order involving NaNs is a fool's mission.

OK. Sorry; I misinterpreted your point, and I think we agree here. Well, I guess I still disagree about total_order looking easy (nothing with IEEE floats is ever as easy as you expect…) but that isn’t relevant here. :)

Wait, what’s wrong with the behavior of math.isnan for floats? If you want a NaN test that differs from the one defined by IEEE, I think we’re off into uncharted waters.

I do not believe or accept that statistics is meant to "blow up on non-floats". I might like math.isnan() to behave better with non-floats numbers, but that's a different issue. You seem to propose a perfectly good version of a general is_nan() later in your comment.

OK, I thought you were saying math.isnan is the wrong semantics even for floats. The key to me is that something “general enough for median and friends” is pretty easy, while something truly general that handles all possible nan-like values in all possible types is not.

Wait, are you arguing that we should just offer a generic is_nan function (as a builtin?), instead of adding an on_nan handler parameter to median and friends?

I don't think a built-in. Maybe in math, or statistics. I don't care if it's spelled '_isnan()' for private use by statistics. I'm not worried about public functionality. Just avoiding repetition in different paths in 'median_<whatever>' and xtile().

I think they all start with data = sorted(data), so they could all change to use data = _nan_smart_sorted(data, on_nan), and just put all the smarts (prefiltering and/or keying with isnan depending on the value of on_nan) inside that _nan_smart_sorted helper. The only issue to worry about is that I think some of them check for a minimum of 1 or 2 elements before sorting, and, given IGNORE (and POISON?) they probably need to instead check for that after the filter-and-sort step. (Presumably asking for the xtile of [nan, nan, nan] with IGNORE is an error, not nan.)

On Mon, Dec 30, 2019 at 2:58 AM Steven D'Aprano <steve@pearwood.info> wrote:

Can you explain the scenario where somebody using median will want negative NANs to sort to the beginning, below -INF, and positive NANs to sort to the end, above +INF?

I can kinda-sorta provide a case. But overall, despite my whimsical inclusion of an `ieee_total_order()` option in my sample implementation, I don't really think this is something we should care about. def my_results(it): for x in it: x_1 = func1_with_asymptotes(x) x_2 = func2_with_asymptotes(x) result = x_1 / x_2 yield result median = statistics.median(my_results(my_iterator)) In concept, if some of my answers "fly off to +inf" and others "fly off to -inf", then maybe I want to balance those and look at the "sensible" stuff in the middle. But honestly, this feels more like an 'ignore' case. I don't know whether 'inf/inf' is less than or greater than 42. The operation in the Real domain might have a determinate answer (or even just had I used float128 instead, for example). But as is, ignoring the nans is the best I can do. Moreover, it seems like Python does not preserve sign of the nans for these operations. I would THINK that -inf/inf should be -nan. But nope. Operations that get nans in Python just don't seem to respect the "sensible" sign of them.

math.copysign(1, float('nan')) 1.0 math.copysign(1, float('-nan')) -1.0 math.copysign(1, float('-inf')) -1.0 math.copysign(1, float('-inf') / float('-inf')) -1.0 math.copysign(1, float('-inf') / float('inf')) -1.0 math.copysign(1, float('inf') / float('inf')) -1.0

Steven D'Aprano <steve@pearwood.info> wrote:

Can you explain the scenario where somebody using median will want

negative NANs to sort to the beginning, below -INF, and positive NANs to sort to the end, above +INF?

No, and I don’t think anyone really wants that. I think it was proposed as a possible improvement on non-deterministic results, and/or a way to address this issue by only changing how the sorting is done. Whether IEEE total order is useful in some other contexts is a totally separate question. -CHB -- Christopher Barker, PhD Python Language Consulting - Teaching - Scientific Software Development - Desktop GUI and Web Development - wxPython, numpy, scipy, Cython

On 12/30/19 12:18 PM, Christopher Barker wrote:

Steven D'Aprano <steve@pearwood.info <mailto:steve@pearwood.info>> wrote:

Can you explain the scenario where somebody using median will want negative NANs to sort to the beginning, below -INF, and positive NANs to sort to the end, above +INF?

No, and I don’t think anyone really wants that. I think it was proposed as a possible improvement on non-deterministic results, and/or a way to address this issue by only changing how the sorting is done.

Whether IEEE total order is useful in some other contexts is a totally separate question.

-CHB

I would say that the only reason to use that in median is it at least produces results that are consistent and not dependent on input data order, and if we assume NaN results are rare close to the actual median (what ever that really means) of the data. It converts Garbage In, Crazy Garbage Out to Garbage In, Somewhat Garbage Out (maybe its abstract art?). -- Richard Damon

On Sun, Dec 29, 2019 at 06:23:03PM -0800, Andrew Barnert via Python-ideas wrote:

Likewise, it’s even easier to write ignore-nan yourself than to write the DSU yourself:

median = statistics.median(x for x in xs if not x.isnan())

Try that with xs = [1, 10**400, 2] and come back to me.

So, why isn’t adding a key parameter (as well as an on_nan that takes fast/ignore/raise/poison) to median useful?

That's the wrong question. The right question is, why is adding a key parameter useful? The statistics module is for calculating statistics. Think of it as like a button on your scientific calculator in STATS mode. It's not designed for tasks like finding the employee with the median salary or the highest performing sales representative for the third quarter of 2017. If you want SQL, you know how to get it :-) -- Steven

Actually, I wouldn't mind passing a key function to _median(), but that is way too advanced for the beginner users to have to think about. So maybe median() could call _median() internally where needed, but the underscore version could exist also. On Sun, Dec 29, 2019 at 8:14 PM Andrew Barnert <abarnert@yahoo.com> wrote:

On Dec 29, 2019, at 16:08, David Mertz <mertz@gnosis.cx> wrote:

* There is absolutely no need to lose any efficiency by making the

statistics functions more friendly. All we need is an optional parameter whose spelling I've suggested as `on_nan` (but bikeshed freely). Under at least one value of that parameter, we can keep EXACTLY the current implementation, with all its warts and virtues as-is. Maybe a spelling for that option could be 'unsafe' or 'fast'?

This seems like the right way to go to me.

However, rather than coming up with appropriately-general implementations of each of these things, wouldn’t taking a key function to pass through to sorted be simpler for some? In particular, coming up with a total_order function that works for all valid number-like types is difficult; letting the user pass key=math.total_order or decimal.Decimal.compare_total or partial(decimal.Decimal.compare_total, context=my_context) or whatever is appropriate is a lot simpler and a lot more flexible. Anyone who knows that’s what they want should know how to pass it.

Plus, finding the median_low or _high, with a key function actually seems useful even without NaNs. “Find the median employee by salary” doesn’t seem like a meaningless operation.

A key function could also take care of raise, but not ignore or poison, and at least ignore seems like it’s needed. So your API still makes sense otherwise. (But, while we’re painting the shed, maybe enum values instead of bare strings? They could be StrEnum values where FAST.value == 'fast' for people who are used to Pandas, I suppose.)

Maybe the is_nan function could also be a parameter, like the key function. By default it’s just the method with a fallback to math or cmath (or it’s just the method, and float and complex add those methods, or it’s a new function that calls a new protocol method, or whatever). That doesn’t work for every possible type that might otherwise work with statistics, but if you have some other type—or want some other unusual but sensible behavior (e.g., you’re the one guy who actually needs to ignore qNaNs but raise early on sNaNs), you can write it and pass it. I’m still not convinced anyone will ever want anymore other than the method/math/cmath version, but if they do, I think they’d know it and be fine with passing it in explicitly.

As far as your implementation, I don’t think anything but ignore needs to preprocess things. Raise can just pass a key function that raises on NaN to sorted. Poison can do the same but handle the exception by returning NaN. Who cares that it might take slightly longer to hit the first NaN that way than by doing an extra pass, if it’s simpler and slightly faster for the non-exceptional case?

On Sun, Dec 29, 2019 at 06:22:49PM -0500, Richard Damon wrote:

The way I see it, is that median doesn't handle NaNs in a reasonable way, because sorted doesn't handle them, because it is easy and quick to not handle NaN, and to handle them you need to define an Official meaning for them, and there are multiple reasonable meanings.

There is already an official way to handle NAN comparisons, and Python follows that official standard, and so does sorted. It might be annoying, it might be inconvenient, it might be surprising to people's intuition (if they only think about values with a total order and not the possibility of partial order) but the way NANs are sorted seemingly at random throughout a sequence is a logical and correct consequence of the semantics of NANs and the way sorting operates. It might be surprising, but what are the alternatives? We can: (1) Detect sets of values which make a partial order and raise an exception or warning. But that's *really* expensive, it is at least O(N**2) and requires comparing every element with every other element. (2) Don't bother exhaustively comparing every element to every other element, but when sorting, each time you compare two elements, check for a partial order between those two elements alone. But that too increases the cost of sorting, it has many false negatives (you miss a lot of partial ordered sets), and it breaks the invariant that sorting requires only the less than operator. See https://bugs.python.org/issue36095 for more details. (3) Don't try to detect all sets of values with a partial order, but single out NANs only. That violates the "special cases aren't special enough" koan in the Zen, and it will slow down sorting for everyone even if they don't have NANs, but okay, maybe this special case *is* special enough. So what do we do now? (4) There's a NAN in your data, so sort() will raise. That's a bit extreme. It will annoy and frustrate people for whom the odd placement of NANs is not a problem they care about. E.g. if I want to process a set of numbers in order from smallest to largest, I don't care if a NAN happens to get placed between 3.5 and 4.25. (5) Move all the NANs to the front of the sequence! No the end of the sequence! Half to the front and half to the back! Order them according to their sign bit! No, order them according to the payload! Or both! That's already six ways to order NANs. Let's focus on the two obvious ones: (6) Move all NANs to the start of the list. Or the end of the list. Either way, you artifically decrease or increase the median: [NAN, NAN, NAN, NAN, 1, 2, 3, 4, 5, 6, 7] # median 2 [1, 2, 3, 4, 5, 6, 7, NAN, NAN, NAN, NAN] # median 6 (7) No no no, once you have sorted the NANs to one end or the other, median can check for the existance of even a single NAN, and then raise or return NAN or strip them out or do whatever you like. Great, but we can do the same without changing sort in any way. Doing this in sort when it could be done elsewhere is the wrong place, and probably less efficient too. Doing it in median, rather than sort, lets you short-circuit as soon as you see a single NAN. So changing the way sort() doesn't seem very promising. All the alternatives have serious downsides that outweigh the benefits. (8) How about fixing NANs? (For some definition of "fixing".) That will violate the standard, and will change the behaviour of any code that uses comparisons, some potentially far-reaching changes in behaviour. And it glosses over the question of *how* we should change NANs to order differently. Raise? Less than everything? Greater than everything? Sort between negatives and zero? Justify your choice.

One big reason to NOT fix the issue with NaNs in median is that such a fix likely has a measurable impact ction the processing of the median.

I just did some quick and dirty tests to try to estimate the likely cost of NAN processing on median(). I get a slowdown of between 4 x and 8 x by applying NAN processing. For mean(), I get a slowdown of a factor of between 1.4 to 1.7 times. -- Steven

On Sat, Dec 28, 2019 at 10:16:28PM -0800, Christopher Barker wrote:

Richard: I am honestly confused about what you think we should do. Sure, you can justify why the statistics module doesn’t currently handle NaN’s well, but that doesn’t address the question of what it should do.

As far as I can tell, the only reasons for the current approach is ease of implementation and performance. Which are fine reasons, and why it was done that way in the first place.

Actually, the reason I didn't specify the behaviour with NANs or make any guarantees one way or another was that I wasn't sure what behaviour, or behaviours, would be desirable. I didn't want to lock in one behaviour and get it wrong, or impose my own preference without some real-world usage. (In the case of mode, I did get it wrong: raising an exception in the case of multi-modal data turned out to be annoying and less useful than I hoped. Raymond Hettinger convinced me to change the behaviour, based on real-world feedback and use-cases.)

But there seems to be (mostly) a consensus that it would be good to better handle NaNs in the statistics module.

I think the thing to do is decide what we want NaNs to mean: should they be interpreting as missing values or, essentially, errors.

Missing values aren't errors :-) I haven't finished reading the entire thread yet, but I don't think we're going to reach a concensus as to what the One Correct Thing to do with NANs. (1) Some people like the fact that NANs propagate through their calculations without halting computation; after all, that's why they were invented in the first place. (2) Some people prefer an immediate failure (that's why signalling NANs were invented, but I think it was William Kahan who describes signalling NANs as a "compromise" that nobody uses in practice). Exceptions in Python are easier to handle than signals in a low-level language like Fortran or C, which makes this option more practical. (3) Some people are dealing with datasets that use NANs as missing values. This is not "pure" (missing values weren't a motivating use-case for NANs), and arguably it's not "best practice" (a NAN could conceivably creep into your data as a calculation artifact, in which case you might not want to ignore that specific NAN) but it seems to work well enough in practice that this is very common. (4) Some people may not want to pay the cost of handling NANs inside the statistics module, since they've already ensured there are no NANs in their data. So I think we want to give the caller the option to choose the behaviour that works for them and their data set. Proposal: Add a keyword only parameter to the statistics functions to choose a policy. The policy could be: RETURN RAISE IGNORE NONE (or just None?) I hope the meanings are obvious, but in case they aren't: (1) RETURN means to return a NAN if the data contains any NANs; this is "NAN poisoning". (2) RAISE means to raise an exception if the data contains any NANs. (3) IGNORE means to ignore any NANs and skip over them, as if they didn't exist in the data set. (4) NONE means no policy is in place, and the behaviour with NANs is implementation dependent. In practice one should choose NONE only if you were sure you didn't have any NANs in your data and wanted the extra speed of skipping any NAN checks. I've been playing with some implementations, and option (3) could conceivably be relegated to a mere recipe: statistics.mean(x for x in data if not isnan(x)) but options (1) and (2) are not so easy for the caller. In practice, the statistics functions have to handle those cases themselves. At which point, once they are handling cases (1) and (2) it is no extra work to also handle (3) and take the burden off the caller. The fine print: - In the above, whenever I used the term "NAN", I mean a quiet NAN. - Signalling NANs should always raise immediately regardless of the policy. That's the intended semantics of sNANs. - In practice, there are no platform-independent or reliable ways to get float sNANs in Python, and they never come up except in contrived examples. I don't propose to officially support float sNANs until Python makes some reliable guarantees for them. If you manage to somehow put a float sNAN in your data, you get whatever happens to happen. - Decimal sNANs are another story, and should do the right thing. Question: what should this parameter be called? What should its default behaviour be? nan_policy=RAISE

You’ve made a good case that None is the “right” thing to use for missing values — and could be used with int and other types. So yes, if the statistics module were to grow support for missing values, that could be the way to do it.

Regardless of whether None is a better "missing value" than NAN, some data sets will have already used NANs as missing values. E.g. it is quite common in both R and pandas. pandas also uses None as a missing value, and R has a special NA constant. The problem with None is that, compared to NANs, it is too easy for None to accidentally creep into your data. (Python functions return None by default, so it is easy to accidentally return None without intending to. It is hard to accidentally return NAN without intending to, since so few things in Python return a NAN.) With "None is a missing value" semantics, there are only two options: - ignore None (it is a missing value, so it should be skipped) - don't ignore None, and raise TypeError if you get one The first option is easy for the caller to do using a simple recipe: statistics.mean(x for x in data if x is not None) which is explicit, easy to understand and easy to remember. We get the second option for free, by the nature of None: you can't sort mixed numeric+None data, and you can't do arithmetic on None. So I don't think that the statistics functions need an additional parameter to ignore None. (Unlike NAN policies.)

Frankly, I’m also confused as to why folks seem to think this is an issue to be addressed in the sort() functions — those are way too general and low level to be expected to solve this. And it would be a much heavier lift to make a change that central to Python anyway.

Indeed! Solving this in sort is the wrong solution. Only median() cares about sort, and even that is an implementation detail. -- Steven

On 12/28/19 10:41 PM, David Mertz wrote:

On Sat, Dec 28, 2019 at 10:31 PM Richard Damon <Richard@damon-family.org <mailto:Richard@damon-family.org>> wrote:

Every value of the type float, except NaN and perhaps +inf and -inf (depending on which version of the Real Number Line you use) IS actually a representation of a Real Number (so I don't understand in what way you can say they aren't "numbers"). And yes, they happen to also be Rationals.

This is not subtle, and it is not something you do not understand. C'mon.

Tell me three Real Numbers (or Rational Numbers) such that:

(a + b) + c != a + (b + c)

Yes, the floating point approximation says that + in computer programming is not exactly the ideal + in the domains of the Reals, understanding the difference is important. It is well known that floating point arithmetic has properties that make it not a finite field, and that ISN'T a real problem, as it is a well defined approximation to a Real Field, and thus when you take into account the limits of the approximations, you CAN say that it approximates the numeric field, which is good enough for computer science. (and there is a lot of math available to allow you to evaluate how good your approximation is.) This also brings us to the first rule of floating point math, looking for exact equality, particularly after a calculation, is almost never the correct action.

You will not succeed in that endeavor because of the nature of "numbers."

... and yes, because of the usual meaning given to the symbol '+', etc.

OK... now I'll tell you three floating point "numbers" that follow that pattern:

a = float("0.1") b = float("0.2") c = float("0.3")

We can conveniently write the constructors for these in Python as `0.1` and so on. These objects that occupy computer memory are not "numbers" in the sense of obeying the axioms of arithmetic. But they kinda-sorta resemble numbers, and computers can do things with them quickly, so we pretend.

yes, you can create these numbers in Python (and they are NOT the numbers 0.1, 0.2, and 0.3, (by my tests, the first two are slightly larger, and the last smaller) and that distinction is important and often missed by people who don't understand what floating point number is computers actually are. And YES, they ARE numbers in the sense of the Real Number Field, the issue is that the + operator isn't the exact mathematical operation, but an approximation to it, with fairly precise definition of how it differs from the precise mathematical operation. Let me ask you a question, do you have any real experience in computational number theory? You are taking a stance like you think you do, but arguments don't show it, being a bit on the novice level. -- Richard Damon

The signature could be: def median(it, exclude=None): With *exclude* being a value, or collection supporting the *in* operator On Thu, Dec 26, 2019 at 4:14 PM David Mertz <mertz@gnosis.cx> wrote:

Maybe we can just change the function signature:

statistics.median(it, do_wrong_ass_thing_with_nans=False)

:-)

But yes, the problem is really with sorted(). However, the implementation of statistics.median() doesn't HAVE TO use sorted(), that's just one convenient way to do it.

There IS NO right answer for `sorted([nan, 1, 2, 3])`. However, there is a very plausibly right answer for `statistics.median([nan, 1, 2, 3])` ... or rather, both 'nan' and '2' are plausible (one approach is what Numpy does, the other is what Pandas does).

-- Keeping medicines from the bloodstreams of the sick; food from the bellies of the hungry; books from the hands of the uneducated; technology from the underdeveloped; and putting advocates of freedom in prisons. Intellectual property is to the 21st century what the slave trade was to the 16th. _______________________________________________ Python-ideas mailing list -- python-ideas@python.org To unsubscribe send an email to python-ideas-leave@python.org https://mail.python.org/mailman3/lists/python-ideas.python.org/ Message archived at https://mail.python.org/archives/list/python-ideas@python.org/message/524SWH... Code of Conduct: http://python.org/psf/codeofconduct/

-- Juancarlo *Añez*

nan1 = float('nan')

nan2 = 1e350 - 1e350

nan3 = 1e400 / 1e399

nan1, nan2, nan3

(nan, nan, nan)

things = [1, 2, 3, nan1, nan2]

nan1 in things, nan3 in things

(True, False)

nan1 == nan1

False

nan1 is nan1

True The "in" operator might not do what you hope with NaNs. On Fri, Dec 27, 2019 at 2:15 PM Juancarlo Añez <apalala@gmail.com> wrote:

The signature could be:

def median(it, exclude=None):

With *exclude* being a value, or collection supporting the *in* operator

On Thu, Dec 26, 2019 at 4:14 PM David Mertz <mertz@gnosis.cx> wrote:

Maybe we can just change the function signature:

statistics.median(it, do_wrong_ass_thing_with_nans=False)

:-)

But yes, the problem is really with sorted(). However, the implementation of statistics.median() doesn't HAVE TO use sorted(), that's just one convenient way to do it.

There IS NO right answer for `sorted([nan, 1, 2, 3])`. However, there is a very plausibly right answer for `statistics.median([nan, 1, 2, 3])` ... or rather, both 'nan' and '2' are plausible (one approach is what Numpy does, the other is what Pandas does).

-- Keeping medicines from the bloodstreams of the sick; food from the bellies of the hungry; books from the hands of the uneducated; technology from the underdeveloped; and putting advocates of freedom in prisons. Intellectual property is to the 21st century what the slave trade was to the 16th. _______________________________________________ Python-ideas mailing list -- python-ideas@python.org To unsubscribe send an email to python-ideas-leave@python.org https://mail.python.org/mailman3/lists/python-ideas.python.org/ Message archived at https://mail.python.org/archives/list/python-ideas@python.org/message/524SWH... Code of Conduct: http://python.org/psf/codeofconduct/

-- Juancarlo *Añez* _______________________________________________ Python-ideas mailing list -- python-ideas@python.org To unsubscribe send an email to python-ideas-leave@python.org https://mail.python.org/mailman3/lists/python-ideas.python.org/ Message archived at https://mail.python.org/archives/list/python-ideas@python.org/message/TRN2MK... Code of Conduct: http://python.org/psf/codeofconduct/

I'm just glancing at this thread, but it sounds like you want to add the quickselect algorithm to the standard library. As you point out in another message, quickselect is faster than quicksort: it is linear time (provided the pivot is chosen by median of medians) whereas quicksort is expected linearithmic time. Using quickselect, you could select any order statistic: the median, the 25th percentile, etc. If it were added, I guess it probably belongs in statistics. Best, Neil On Thursday, December 26, 2019 at 3:14:33 PM UTC-5, David Mertz wrote:

Maybe we can just change the function signature:

statistics.median(it, do_wrong_ass_thing_with_nans=False)

:-)

But yes, the problem is really with sorted(). However, the implementation of statistics.median() doesn't HAVE TO use sorted(), that's just one convenient way to do it.

On Sun, Dec 29, 2019 at 05:40:09PM -0800, Neil Girdhar wrote:

I'm just glancing at this thread, but it sounds like you want to add the quickselect algorithm to the standard library. As you point out in another message, quickselect is faster than quicksort: it is linear time (provided the pivot is chosen by median of medians) whereas quicksort is expected linearithmic time.

See https://bugs.python.org/issue21592 -- Steven

IMHO, another sorted function, slower than it, should be added. I don't know exactly how timsort works, but I know that it requires only `__lt__()`. So probably it checks, element by element, if `elem_b < elem_a`, and probably exchange them if this is `1`. If `0`, it does nothing. (Yes, `0` and `1`, I'm thinking about CPython). I think that a `sorted_slow()` function should also check, if `elem_a < elem_b == 0`, what `elem_a < elem_b` gives. If also this check returns `0`, there are two cases: 1. the objects are equal 2. elem_a (or also elem_b) are not orderable In this case, there's no problem: the function will switch `elem_a` and `elem_b`. If they are equal, it changes nothing. If `elem_a` is unorderable, this way at the end of the algorithm, `elem_a` will be at the end of the iterable. This is also good for the programmer, because (s)he can slice the NaNs away more easily, if (s)he wants :) I think Java do the things this way, because I tested `Collections.sort()` with a List<Double> and it puts all NaNs at the end of the list.

Here is an implementation that: A. Only relies on '<' B. Has no special knowledge of NaN C. Does not use sorted() or .sort() for anything D. Is Pandas-like in choosing among only comparable items E. Only picks an element from the original iterator, does not take mean of two candidates (I.e. kinda like statistic.median_low() without the NaN bug) F. Is barely tested and not-at-all benchmarked. I'm sure Timsort is way faster than this kinda-like-quicksort approach.

even_names = ['Bob', 'Alice', 'Charlie', 'Zack', 'Xavier', 'Mary'] odd_names = names + ['Francis'] nums = [2, 3, 4, nan, 1] %paste def median(it, *, _offset=0): smalls = [] bigs = [] go_big = False it = iter(it) candidate = next(it) for x in it: if x < candidate: smalls.append(x) elif candidate < x: bigs.append(x) elif candidate == x: if go_big: bigs.append(x) else: smalls.append(x) go_big = not go_big

if abs(len(smalls) - len(bigs) + _offset) <= 1: return candidate elif len(smalls) >= len(bigs): offset = -len(bigs) if bigs: smalls.append(min(bigs)) return median(smalls, _offset=offset) else: offset = len(smalls) if smalls: bigs.append(max(smalls)) return median(bigs, _offset=offset) ## -- End pasted text --

statistics.median_low(odd_names), median(odd_names) ('Francis', 'Francis') statistics.median_low(even_names), median(even_names) ('Charlie', 'Charlie') statistics.median_low(nums), median(nums) (4, 2)

FWIW, here is a timing:

many_nums = [randint(10, 100) for _ in range(1_000_000)] %timeit statistics.median_low(many_nums) 87.2 ms ± 654 µs per loop (mean ± std. dev. of 7 runs, 10 loops each) %timeit median(many_nums) 282 ms ± 3.43 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

I think almost all the slowdown is because `sorted()` is a C function. In big-O terms, mine should be an improvement since it does part of a Quicksort in partitioning elements, but it doesn't actually bother sorting the smaller partition. It *does* make one pass through to find the min or max though. I'm pretty sure that even in plain-Python, someone smarter than me, like Tim Peters (or others who watch here), could speed up my version by 2x. On Thu, Dec 26, 2019 at 4:13 PM David Mertz <mertz@gnosis.cx> wrote:

Here is an implementation that:

A. Only relies on '<' B. Has no special knowledge of NaN C. Does not use sorted() or .sort() for anything D. Is Pandas-like in choosing among only comparable items E. Only picks an element from the original iterator, does not take mean of two candidates (I.e. kinda like statistic.median_low() without the NaN bug) F. Is barely tested and not-at-all benchmarked. I'm sure Timsort is way faster than this kinda-like-quicksort approach.

even_names = ['Bob', 'Alice', 'Charlie', 'Zack', 'Xavier', 'Mary'] odd_names = names + ['Francis'] nums = [2, 3, 4, nan, 1] %paste def median(it, *, _offset=0): smalls = [] bigs = [] go_big = False it = iter(it) candidate = next(it) for x in it: if x < candidate: smalls.append(x) elif candidate < x: bigs.append(x) elif candidate == x: if go_big: bigs.append(x) else: smalls.append(x) go_big = not go_big

if abs(len(smalls) - len(bigs) + _offset) <= 1: return candidate elif len(smalls) >= len(bigs): offset = -len(bigs) if bigs: smalls.append(min(bigs)) return median(smalls, _offset=offset) else: offset = len(smalls) if smalls: bigs.append(max(smalls)) return median(bigs, _offset=offset)

## -- End pasted text --

statistics.median_low(odd_names), median(odd_names) ('Francis', 'Francis') statistics.median_low(even_names), median(even_names) ('Charlie', 'Charlie') statistics.median_low(nums), median(nums) (4, 2)

Oh... mine doesn't handle unbalanced duplicate values correctly :-). Maybe I'll figure out how to fix it. But I'm not really proposing the implementation anyway, just making the abstract point that we don't need sorted() On Thu, Dec 26, 2019 at 4:34 PM David Mertz <mertz@gnosis.cx> wrote:

FWIW, here is a timing:

many_nums = [randint(10, 100) for _ in range(1_000_000)] %timeit statistics.median_low(many_nums) 87.2 ms ± 654 µs per loop (mean ± std. dev. of 7 runs, 10 loops each) %timeit median(many_nums) 282 ms ± 3.43 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

I think almost all the slowdown is because `sorted()` is a C function. In big-O terms, mine should be an improvement since it does part of a Quicksort in partitioning elements, but it doesn't actually bother sorting the smaller partition. It *does* make one pass through to find the min or max though.

I'm pretty sure that even in plain-Python, someone smarter than me, like Tim Peters (or others who watch here), could speed up my version by 2x.

On Thu, Dec 26, 2019 at 4:13 PM David Mertz <mertz@gnosis.cx> wrote:

Here is an implementation that:

A. Only relies on '<' B. Has no special knowledge of NaN C. Does not use sorted() or .sort() for anything D. Is Pandas-like in choosing among only comparable items E. Only picks an element from the original iterator, does not take mean of two candidates (I.e. kinda like statistic.median_low() without the NaN bug) F. Is barely tested and not-at-all benchmarked. I'm sure Timsort is way faster than this kinda-like-quicksort approach.

even_names = ['Bob', 'Alice', 'Charlie', 'Zack', 'Xavier', 'Mary'] odd_names = names + ['Francis'] nums = [2, 3, 4, nan, 1] %paste def median(it, *, _offset=0): smalls = [] bigs = [] go_big = False it = iter(it) candidate = next(it) for x in it: if x < candidate: smalls.append(x) elif candidate < x: bigs.append(x) elif candidate == x: if go_big: bigs.append(x) else: smalls.append(x) go_big = not go_big

if abs(len(smalls) - len(bigs) + _offset) <= 1: return candidate elif len(smalls) >= len(bigs): offset = -len(bigs) if bigs: smalls.append(min(bigs)) return median(smalls, _offset=offset) else: offset = len(smalls) if smalls: bigs.append(max(smalls)) return median(bigs, _offset=offset)

## -- End pasted text --

statistics.median_low(odd_names), median(odd_names) ('Francis', 'Francis') statistics.median_low(even_names), median(even_names) ('Charlie', 'Charlie') statistics.median_low(nums), median(nums) (4, 2)

FWIW, although no one cares, I "withdraw" my proposed implementation. While it bugs me that I'm not sure what error I made in dealing with duplicate values in an iterable, on reflection I think the whole idea is wrong. That is, I don't like the weirdness of the behavior of statistics.median. But what I guard against in my partitioning approach isn't every possible comparison of two items anyway. That would always take quadratic time. I just do a bunch of such comparisons according to some particular program flow, but not everything. "Incomparability" can be a property of any pair of objects, in principle. However, I also realize the completely general question is irrelevant. NaNs really are just special in arising innocuously from relatively normal numeric operations. If I make some custom class IncomparableToEverything, it's my problem if I stick it in a list of things I want the median of. So we could get the Pandas-style behavior simply by calling median like so: statistics.median((x for x in it if not math.isnan(x))) I still feel like having median (and friends) do that internally would be worthwhile under some optional parameter. But the default value of that parameter is indeed non-obvious. In a sort of Pandas way of using arguments, we might get `on_nan=["skip"|"poison"|"raise"|"random"]`. "Random" seems like the only wrong answer, but it is the status quo. On Thu, Dec 26, 2019 at 4:34 PM David Mertz <mertz@gnosis.cx> wrote:

FWIW, here is a timing:

many_nums = [randint(10, 100) for _ in range(1_000_000)] %timeit statistics.median_low(many_nums) 87.2 ms ± 654 µs per loop (mean ± std. dev. of 7 runs, 10 loops each) %timeit median(many_nums) 282 ms ± 3.43 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

I think almost all the slowdown is because `sorted()` is a C function. In big-O terms, mine should be an improvement since it does part of a Quicksort in partitioning elements, but it doesn't actually bother sorting the smaller partition. It *does* make one pass through to find the min or max though.

David Mertz wrote:

So we could get the Pandas-style behavior simply by calling median like so: statistics.median((x for x in it if not math.isnan(x)))

This is wrong. Or maybe potentially wrong. This way you're removing items from the iterable, so you're moving the median. If the NaNs are not really member of your population, it's ok. On the contrary, if you use my median function with the key function I posted before, you have not this problem. The iterable is sorted well and you get the real median.

Yes. Like Pandas does! Like I wrote! And yes, it is one of three plausible good behaviors that Steven described well. Which is kinda why is still like a named parameter like 'on_nan' to choose which behavior you want inside the function. Unfortunately, propogating/poisoning NaN like NumPy does cannot really be done with a comprehension, so doing it in the function makes sense. Similar with raising an exception. On Thu, Dec 26, 2019, 9:35 PM Marco Sulla via Python-ideas < python-ideas@python.org> wrote:

David Mertz wrote:

So we could get the Pandas-style behavior simply by calling median like so: statistics.median(x for x in it if not math.isnan(x))

This is wrong. Or maybe potentially wrong.

This way you're removing items from the iterable, so you're moving the median.

If the NaNs are not really member of your population, it's ok.

On the contrary, if you use my median function with the key function I posted before, you have not this problem. The iterable is sorted well and you get the real median. _______________________________________________ Python-ideas mailing list -- python-ideas@python.org To unsubscribe send an email to python-ideas-leave@python.org https://mail.python.org/mailman3/lists/python-ideas.python.org/ Message archived at https://mail.python.org/archives/list/python-ideas@python.org/message/7WU5GQ... Code of Conduct: http://python.org/psf/codeofconduct/

Oh my... Mertz, listen to me, you don't need a parameter. You only need a key function to pass to `sorted()` If you use this key function: https://mail.python.org/archives/list/python-ideas@python.org/message/M3DEOZ... in my median() function: https://mail.python.org/archives/list/python-ideas@python.org/message/KN6BSM... you can simply do: ``` median(iterable, key=iliadSort) ``` and you have "poisoned" your data. If you want to use the sorted iterable again later in the code, you can just do: ``` sorted_it = sorted(iterable, key=iliadSort) median(sorted_it, sort_fn=None) ``` I prefer this approach, since this way you can avoid potentially re-sorting of your data. For the same reason, if you want to remove the NaNs, it's better to create an iterable apart instead of creating it on the fly, because you can reutilize it: ``` filtered_it = [x for x in iterable if not math.isnan(x)] median(filtered_it) ``` If you want to raise an error, you can use another key function: ``` class NanError(ValueError): pass def alertNan(x): if math.isnan(x): raise NanError("There a NaN in my dish!") return x ``` and then use it: ``` median(iterable, key=alertNan) ``` But if you absolutely want a nan parameter, you can create a wrapper for `sorted`: ``` def nansorted(iterable, on_nan="poison", **kwargs): if on_nan == "poison": return sorted(iterable, key=iliadSort, **kwargs) if on_nan == "remove": new_iterable = [x for x in iterable if not math.isnan(x)] return sorted(new_iterable, **kwargs) if on_nan == "raise": return sorted(iterable, key=alertNan, **kwargs) raise ValueError(f"Unsupported on_nan parameter value: {on_nan}") ``` and then use it in my `median()`: ``` median(iterable, sort_fn=nansorted, on_nan="raise") ``` or, as before ``` sorted_it = nansorted(iterable, on_nan="poison") median(sorted_it, sort_fn=None) ```

The behavior of your sort function is not any of the desirable options. Moving NaNs to the end is not the widely used Panda style of removing them; I cannot think of any situation where that behavior would be useful... even though I've read the Illiad. Other than wastefully creating an eager list, your filter is the same as I suggested for the one behavior. In general, passing a sort function, while powerful, is terrible API design for the non-experts who are main users of statistics module. On Thu, Dec 26, 2019, 10:42 PM Marco Sulla via Python-ideas < python-ideas@python.org> wrote:

Oh my... Mertz, listen to me, you don't need a parameter. You only need a key function to pass to `sorted()`

If you use this key function:

https://mail.python.org/archives/list/python-ideas@python.org/message/M3DEOZ...

in my median() function:

https://mail.python.org/archives/list/python-ideas@python.org/message/KN6BSM...

you can simply do:

``` median(iterable, key=iliadSort) ```

and you have "poisoned" your data.

If you want to use the sorted iterable again later in the code, you can just do:

``` sorted_it = sorted(iterable, key=iliadSort) median(sorted_it, sort_fn=None) ```

I prefer this approach, since this way you can avoid potentially re-sorting of your data.

For the same reason, if you want to remove the NaNs, it's better to create an iterable apart instead of creating it on the fly, because you can reutilize it:

``` filtered_it = [x for x in iterable if not math.isnan(x)] median(filtered_it) ```

If you want to raise an error, you can use another key function:

``` class NanError(ValueError): pass

def alertNan(x): if math.isnan(x): raise NanError("There a NaN in my dish!")

return x

```

and then use it:

``` median(iterable, key=alertNan) ```

But if you absolutely want a nan parameter, you can create a wrapper for `sorted`:

``` def nansorted(iterable, on_nan="poison", **kwargs): if on_nan == "poison": return sorted(iterable, key=iliadSort, **kwargs)

if on_nan == "remove": new_iterable = [x for x in iterable if not math.isnan(x)] return sorted(new_iterable, **kwargs)

if on_nan == "raise": return sorted(iterable, key=alertNan, **kwargs)

raise ValueError(f"Unsupported on_nan parameter value: {on_nan}") ```

and then use it in my `median()`:

``` median(iterable, sort_fn=nansorted, on_nan="raise") ```

or, as before

``` sorted_it = nansorted(iterable, on_nan="poison") median(sorted_it, sort_fn=None) ```

David Mertz wrote:

The behavior of your sort function is not any of the desirable options. Moving NaNs to the end is not the widely used Panda style of removing them

...Mertz, you are really hardheaded.... I supported **all** the option of your lovely Pandas, that supports also poisoning, and you say you don't like poisoning?

I cannot think of any situation where that behavior would be useful

Think about this: you have a population of 1 million of people. You want to take the median of their heart rate. But for some reason, your calculations gives you some NaN. If you remove the NaNs, it's like you remove people from your statistics. And since the median is the central value of the population, you're faking the result. Is it clear now?

Other than wastefully creating an eager list, your filter is the same as I suggested for the one behavior.

.....I've really not understood what you're writing.

In general, passing a sort function, while powerful, is terrible API design for the non-experts who are main users of statistics module.

......non-experts does not need to pass anything, since the default is `sorted`. Non-experts probably does not even know what to pass to that parameter. Non-experts can also read the docs, you know?!? Anyway, I'm not here to convince you of anything. Continue to use Pandas, the most slow module in the history of Python, that my company asked to me to remove to all their code because it slow down their applications like an old man with the hat driving a camion in a street, and be happy. Sayonara.

On Thu, Dec 26, 2019, 10:42 PM Marco Sulla via Python-ideas python-ideas@python.org wrote:

Oh my... Mertz, listen to me, you don't need a parameter. You only need a key function to pass to sorted() If you use this key function: https://mail.python.org/archives/list/python-ideas@python.org/message/M3DEOZ... in my median() function: https://mail.python.org/archives/list/python-ideas@python.org/message/KN6BSM... you can simply do: median(iterable, key=iliadSort)

and you have "poisoned" your data. If you want to use the sorted iterable again later in the code, you can just do: sorted_it = sorted(iterable, key=iliadSort) median(sorted_it, sort_fn=None)

I prefer this approach, since this way you can avoid potentially re-sorting of your data. For the same reason, if you want to remove the NaNs, it's better to create an iterable apart instead of creating it on the fly, because you can reutilize it: filtered_it = [x for x in iterable if not math.isnan(x)] median(filtered_it)

If you want to raise an error, you can use another key function: class NanError(ValueError): pass

def alertNan(x): if math.isnan(x): raise NanError("There a NaN in my dish!")

return x

and then use it: median(iterable, key=alertNan)

But if you absolutely want a nan parameter, you can create a wrapper for sorted: def nansorted(iterable, on_nan="poison", **kwargs): if on_nan == "poison": return sorted(iterable, key=iliadSort, **kwargs)

if on_nan == "remove": new_iterable = [x for x in iterable if not math.isnan(x)] return sorted(new_iterable, **kwargs)

if on_nan == "raise": return sorted(iterable, key=alertNan, **kwargs)

raise ValueError(f"Unsupported on_nan parameter value: {on_nan}")

and then use it in my median(): median(iterable, sort_fn=nansorted, on_nan="raise")

or, as before sorted_it = nansorted(iterable, on_nan="poison") median(sorted_it, sort_fn=None)

On Fri, Dec 27, 2019 at 04:32:44AM -0000, Marco Sulla via Python-ideas wrote:

Think about this: you have a population of 1 million of people. You want to take the median of their heart rate. But for some reason, your calculations gives you some NaN.

The only reasonable scenario for that is if NANs indicate missing data. Otherwise, a NAN is an obvious measurement error, like a negative heart rate, or infinity.

If you remove the NaNs, it's like you remove people from your statistics.

If the value is missing, then you don't know what value it has. It's as if you never collected the data in the first place! (That's because you *didn't* collect the data -- if you did, it wouldn't be missing.) How do you deal with data you don't have? Obviously you don't include measurements you don't have in your data.

And since the median is the central value of the population, you're faking the result.

That's not faking the result, it is the only reasonable way to handle data that is missing and unrecoverable. For example, SPSS simply omits missing data from its calculations: https://stats.idre.ucla.edu/spss/modules/missing-data/ More here: http://www.real-statistics.com/descriptive-statistics/missing-data/ Dealing with missing data is not really something that functions like mean, median etc can deal with other than to ignore it, raise an exception or return a NAN. The statistician collecting the data may *sometimes* be in a good position to do something about missing data: for example, chasing up respondents and convincing them to provide the missing values, or interpolating missing values from the available values ("data imputation"). https://measuringu.com/handle-missing-data/ Any sort of data imputation is, to my mind, suspicious: you are effectively making up data and hoping it is representative. But whether data imputation is justified or not, it is *far* out of scope for the statistics module. -- Steven

On Fri, Dec 27, 2019 at 03:40:10AM -0000, Marco Sulla via Python-ideas wrote:

Oh my... Mertz, listen to me, you don't need a parameter. You only need a key function to pass to `sorted()`

How do you pass the key function to sorted() without a parameter?

median(iterable, key=iliadSort)

What is the `key` parameter, if its not a parameter? Putting aside handling of NANs, are there any use-cases for a key parameter to median? -- Steven

David Mertz wrote:

Here is an implementation that: A. Only relies on '<'

Well, no. There's an `==` check. Can you please read this 2 posts of mine? https://mail.python.org/archives/list/python-ideas@python.org/message/7255SH... https://mail.python.org/archives/list/python-ideas@python.org/message/KN6BSM...

On Dec 26, 2019, at 12:19, Marco Sulla via Python-ideas <python-ideas@python.org> wrote:

IMHO, another sorted function, slower than it, should be added.

You can very easily just write a key function that does this for you. In fact, you can write different key functions for different variations. For example, if you’re specifically sorting floats and want to shift nans to the right (even beyond inf): class shift_nan_to_end_key(float): def __lt__(self, other): return math.isnan(other) or not math.isnan(self) and float(self)<float(other) If you want to sort nans to the start, or deal with some other type with incomparable values, or do something more general with all pairs of incomparable values, you can just write a different key function. Whatever’s most appropriate for readability or type safety or performance for your particular use case, do that. In some cases, it might be more readable to write as a cmp function instead of a key function, but that’s fine too. To show an example (although maybe not a compelling one), you can get the behavior of your algorithm below: @functools.cmp_to_key def flip_incomparables_key(a, b): if a < b: return -1 if b < a: return 1 return 1 This treats incomparable values (and equal values) as greater instead of as equal, which is what you described… although that doesn’t actually accomplish what you wanted. If you don’t want to have to specify the key function every time, you can write your own sorted_slow: sorted_slow_nan = functools.partial(sorted, key=shift_nan_to_end_key) Or, if you want to take and wrap existing key functions, that’s almost as trivial. If any of these are common enough, they could be added to the Sorting HOWTO docs, or even included somewhere in the stdlib so you can just import and use, or even I suppose added to builtins (although I doubt any of them are nearly common enough for that last one).

I don't know exactly how timsort works, but I know that it requires only `__lt__()`.

So probably it checks, element by element, if `elem_b < elem_a`, and probably exchange them if this is `1`. If `0`, it does nothing.

No, it’s not that simple. It couldn’t be that simple unless it was a quadratic time algorithm. Which means your change won’t actually work.

I think that a `sorted_slow()` function should also check, if `elem_a < elem_b == 0`, what `elem_a < elem_b` gives.

Presumably you mean `elem_b < elem_a` here? But at any rate once you discover that the two elements are incomparable, you need to decide what to do about them. If you treat them as equal, you get the same behavior as currently. If you do the opposite of that, you get… well, it’s complicated, but it doesn’t inherently make any more sense than treating it as ==, and also doesn’t do what you want for nans. And, without keeping track of additional information, there’s no trivial way to get the behavior you want.

If also this check returns `0`, there are two cases: 1. the objects are equal 2. elem_a (or also elem_b) are not orderable

In this case, there's no problem: the function will switch `elem_a` and `elem_b`. If they are equal, it changes nothing.

No, it breaks the stability property. Two values can compare equal but be meaningful distinct. This is especially important when using a key function—e.g., the common “sort by column C, then by column B” idiom relies on the fact that two values that are equal in column B but different in column C remain in their column-C order.

If `elem_a` is unorderable, this way at the end of the algorithm, `elem_a` will be at the end of the iterable.

No it won’t. You can try it and see with nans with the key function above. Try tweaking it to return 0 on equal and any of the three choices for the remaining incomparable case—none of them do what you want. This is the wrong way to solve the problem. More importantly, “orderable” isn’t a property of a single value, it’s a property of a pair of values. The fact that float has values that are incomparable with everything (including itself) and everything else is comparable with everything else, is a very special case. For most partially ordered types, there are values that can be compared with some values but not with others. What should happen if a and b are incomparable, but a>c while b<c? What if b and c are also incomparable? So you can’t just say “push all the incomparable values to the right”. Except in the very special case of float, in which case you might as well say “push all the nan values to the right” in the first place, which is a lot simpler to explain, and to implement.

On Fri, Dec 27, 2019 at 9:07 AM Andrew Barnert via Python-ideas <python-ideas@python.org> wrote:

You can very easily just write a key function that does this for you. In fact, you can write different key functions for different variations.

For example, if you’re specifically sorting floats and want to shift nans to the right (even beyond inf):

class shift_nan_to_end_key(float): def __lt__(self, other): return math.isnan(other) or not math.isnan(self) and float(self)<float(other)

If you want to sort nans to the start, or deal with some other type with incomparable values, or do something more general with all pairs of incomparable values, you can just write a different key function. Whatever’s most appropriate for readability or type safety or performance for your particular use case, do that.

You've created a key class, effectively equivalent to a cmp_to_key form. Is there a flaw in this much simpler version? def nans_right(n): return math.isnan(n), n ChrisA

On Dec 26, 2019, at 14:13, Chris Angelico <rosuav@gmail.com> wrote:

On Fri, Dec 27, 2019 at 9:07 AM Andrew Barnert via Python-ideas <python-ideas@python.org> wrote:

You can very easily just write a key function that does this for you. In fact, you can write different key functions for different variations.

For example, if you’re specifically sorting floats and want to shift nans to the right (even beyond inf):

class shift_nan_to_end_key(float): def __lt__(self, other): return math.isnan(other) or not math.isnan(self) and float(self)<float(other)

If you want to sort nans to the start, or deal with some other type with incomparable values, or do something more general with all pairs of incomparable values, you can just write a different key function. Whatever’s most appropriate for readability or type safety or performance for your particular use case, do that.

You've created a key class, effectively equivalent to a cmp_to_key form. Is there a flaw in this much simpler version?

def nans_right(n): return math.isnan(n), n

No, I just thought being explicit would be clearer here. Seeing them both written out, I think you’re right, and all the extra boilerplate gets in the way much more than making you think through the (trivial) lexicographical compare does. And actually, where there is a difference, yours is usually better. For example, if you’re not sure the inputs will be floats, yours will raise in more cases than mine. (If you want to sort stringified floats out of a CSV or something, you probably want to say so explicitly, and I don’t think `nans_right` does say that.) But the fact that you can use whichever one you think is best for your case is certainly a nice feature of Python’s sorting API. :)

On Fri, Dec 27, 2019 at 9:35 AM Andrew Barnert <abarnert@yahoo.com> wrote:

On Dec 26, 2019, at 14:13, Chris Angelico <rosuav@gmail.com> wrote:

On Fri, Dec 27, 2019 at 9:07 AM Andrew Barnert via Python-ideas <python-ideas@python.org> wrote:

You can very easily just write a key function that does this for you. In fact, you can write different key functions for different variations.

For example, if you’re specifically sorting floats and want to shift nans to the right (even beyond inf):

class shift_nan_to_end_key(float): def __lt__(self, other): return math.isnan(other) or not math.isnan(self) and float(self)<float(other)

If you want to sort nans to the start, or deal with some other type with incomparable values, or do something more general with all pairs of incomparable values, you can just write a different key function. Whatever’s most appropriate for readability or type safety or performance for your particular use case, do that.

You've created a key class, effectively equivalent to a cmp_to_key form. Is there a flaw in this much simpler version?

def nans_right(n): return math.isnan(n), n

No, I just thought being explicit would be clearer here. Seeing them both written out, I think you’re right, and all the extra boilerplate gets in the way much more than making you think through the (trivial) lexicographical compare does.

And actually, where there is a difference, yours is usually better. For example, if you’re not sure the inputs will be floats, yours will raise in more cases than mine. (If you want to sort stringified floats out of a CSV or something, you probably want to say so explicitly, and I don’t think `nans_right` does say that.)

But the fact that you can use whichever one you think is best for your case is certainly a nice feature of Python’s sorting API. :)

Yeah, agreed :) Lots of languages have support for a comparison function, but in the common cases, a key function is WAY simpler to write (unless you forego stability, in which case it's merely "notably simpler"), and can perform better too. But when you need the flexibility, the comparison function is easily implemented, courtesy of operator overloading. ChrisA

Andrew Barnert wrote:

On Dec 26, 2019, at 12:19, Marco Sulla via Python-ideas python-ideas@python.org wrote: you can get the behavior of your algorithm below: @functools.cmp_to_key def flip_incomparables_key(a, b): if a < b: return -1 if b < a: return 1 return 1

^_____^ we arrived to the same conclusion: https://mail.python.org/archives/list/python-ideas@python.org/message/LM22TN... But you're key is better! It's not only for NaNs, and I don't know why you say it does not do exactly what I wanted.

If also this check returns 0, there are two cases:

the objects are equal elem_a (or also elem_b) are not orderable

In this case, there's no problem: the function will switch elem_a and elem_b. If they are equal, it changes nothing. No, it breaks the stability property. Two values can compare equal but be meaningful distinct. This is especially important when using a key function—e.g., the common “sort by column C, then by column B” idiom relies on the fact that two values that are equal in column B but different in column C remain in their column-C order.

Well, so we can't use __lt__(). We have to require __le__(): 1. if a <= b is 1, do nothing 2. if a <=b is 0, then switch This way, equal objects are not switched. Greater and uncomparable objects are switched. Call it IliadSort :D (Well, I think this joke is for Italians only....)

What should happen if a and b are incomparable, but a>c while b<c?

bca

What if b and c are also incomparable?

bca or cba

So you can’t just say “push all the incomparable values to the right”.

Wait, I'll push them until they are incomparable. I always check for "<=" for every element, I don't mark them as "incomparable" the first time and stop. I think we can simply apply to sorted you hack key, with a little mod: ``` @functools.cmp_to_key def iliadSort(a, b): if a <= b: return -1 if b <= a: return 0 return 1 ``` I'll test it.

On Dec 26, 2019, at 15:27, Marco Sulla via Python-ideas <python-ideas@python.org> wrote:

Andrew Barnert wrote:

On Dec 26, 2019, at 12:19, Marco Sulla via Python-ideas python-ideas@python.org wrote: you can get the behavior of your algorithm below: @functools.cmp_to_key def flip_incomparables_key(a, b): if a < b: return -1 if b < a: return 1 return 1

^_____^ we arrived to the same conclusion: https://mail.python.org/archives/list/python-ideas@python.org/message/LM22TN...

But you're key is better! It's not only for NaNs, and I don't know why you say it does not do exactly what I wanted.

>>> a = [1, nan, 3, nan, 2, inf, 0, nan] >>> sorted(a, key=flip_incomparables_key) [1, nan, 3, nan, 0, 2, inf, nan] What you wanted is [0, 1, 2, 3, inf, nan, nan, nan], and it does not do exactly that, or even anything remotely similar. If you tweak it, you can make something that would be correct if sorted were a bubble sort, but still isn’t correct with the actual timsort algorithm, or for quicksort or anything else you’d ever want to use. On the other hand, my shift_nan function—or, even better, Chris’s—does do exactly what you want, obviously works with any valid sorting algorithm, and is explicit about why it does so: it treats nan as bigger than any other value, so of course all the nans end up at the right.

Wait, I'll push them until they are incomparable. I always check for "<=" for every element, I don't mark them as "incomparable" the first time and stop.

I think we can simply apply to sorted you hack key, with a little mod:

``` @functools.cmp_to_key def iliadSort(a, b): if a <= b: return -1 if b <= a: return 0 return 1

```

I'll test it.

Using <= Instead of < doesn’t fix anything that couldn’t be fixed by the == clause (except maybe for types where a<=b iff a<b or a==b is false, and <= is reasonable but < is silly, but that’s not true for float or set or any other reasonable type). This does fix the stability problem. Extend a with 5, 5.0 and the original version will end with 5.0, 5, inf but this version will end with 5, 5.0, inf. But it doesn’t fix anything else. And you could have fixed the stability problem just by changing one character in the original version; there’s no need to switch to <=. The root problem is the whole idea of “push them until they’re comparable”. That can’t work unless you’re testing each value against every potentially incomparable value to the right, which is inherently quadratic. You have to come up with a rule for incomparable values that depends on nothing but the two values—a rule like “nan is bigger” (as long as you get the stability right) does that; a rule like “left is bigger” does not.

Nope. flip_incomparables_key does not work, and neither my key. This one works: ``` import functools @functools.cmp_to_key def iliadSort(a, b): if a < b: res = -1 elif not b == b: res = -1 else: res = 0 return res x = float("nan") y = float("nan") print(sorted([x, 6, -10, float("-inf"), 1981, 8, y, float("+inf"), 19, 23], key=iliadSort)) ``` result: ``` [-inf, -10, 6, 8, 19, 23, 1981, inf, nan, nan] ``` but works only for NaNs and requires also "=="... 0_____o

Forcing NANs to the end is not the right solution. Consider the median of [NAN, 2, 3, 4, 5]. If you force the NAN to remain at the start, the median is 3. If you force the NAN to the end of the list, the median is 4. Your choice to force NANs to the end is equivalent to introducing a bias towards larger values. -- Steven

Thanks everyone commenting on this thread. I haven't quite read it all yet (I will) but I wanted to get a few comments now. On Thu, Dec 26, 2019 at 10:31:00AM -0500, David Mertz wrote:

Anyway, I propose that the obviously broken version of `statistics.median()` be replaced with a better implementation.

To be precise, the problem is not just the implementation, but the interface, as median is explicitly noted to require orderable data. Data with NANs is not orderable. Richard is correct: this is a case of garbage in, garbage out: if you ignore the documented requirements, you'll get garbage results. However, I am happy to accept that silent failure may not be the ideal result for everyone. Unfortunately, there is no consensus on what the ideal result is, with at least four valid responses: - the status quo: the caller is responsible for dealing with NANs, just as they are responsible for dealing with unorderable values passed to min, max, sort, etc. If you know that there are no NANs in your data, any extra processing to check for NANs is just wasted effort. - NANs represent missing values, so they should be ignored; - the presence of a NAN is an error, and should raise an exception; - NANs should propogate through the calculation, a NAN anywhere in your data should return NAN (this is sometimes called "nan poisoning"). Also note that NANs are not just a problem for median. They are a problem for all order statistics, including percentiles, quartiles and general quantiles. Python 3.8 adds a quantiles function which has the same problem: py> statistics.quantiles([NAN, 3, 4, 7, 5]) [nan, 4.0, 6.0] py> statistics.quantiles([3, 4, 7, NAN, 5]) [3.5, nan, nan] NANs aren't as big a problem for other functions like mean and stdev, but the caller may still want to make the choice of ignore, raise or return a NAN. So I would like to avoid an ad hoc response to NANs in median alone, and treat them consistently across the entire module. Marco, you don't have to use median_low and median_high if you don't like them, but they aren't any worse than any other choice for calculating order statistics. All order statistics (apart from min and max) require you to sometimes make a choice between returning a data value or interpolating between two data values, and in general there are *lots* of choices. Here are just a few of them: "Sample Quantiles in Statistical Packages", Hyndman & Fan, The American Statistician 1996, Vol 50, No 4, pp. 361-365. https://www.amherst.edu/media/view/129116/original/Sample+Quantiles.pdf "Quartiles in Elementary Statistics", Langford, Journal of Statistics Education Volume 14, Number 3 (2006). http://www.amstat.org/publications/jse/v14n3/langford.html For median, there are only three choices when the midpoint falls between two values: the lower value, the higher value, and the average between the two. All three choices have their pros and cons. -- Steven

Steven D'Aprano wrote:

Marco, you don't have to use median_low and median_high if you don't like them, but they aren't any worse than any other choice for calculating order statistics. All order statistics (apart from min and max) require you to sometimes make a choice between returning a data value or interpolating between two data values, and in general there are lots of choices.

Of course, but usually they are unbiased solutions by default. On the contrary, if you don't have numeric iterables, you have to choice between median_low and median_high, that introduce bias. And since the module is called `statistics` I think it's very important to not introduce biases. My function correct this. Furthermore, with my function you can simply pass this function as key parameter for sorted: @functools.cmp_to_key def iliadSort(a, b): if a < b: res = -1 elif not b == b: res = -1 else: res = 0 return res and NaNs are moved at the end of the list.

On Fri, Dec 27, 2019 at 02:03:57AM -0000, Marco Sulla via Python-ideas wrote:

Steven D'Aprano wrote:

Marco, you don't have to use median_low and median_high if you don't like them, but they aren't any worse than any other choice for calculating order statistics. All order statistics (apart from min and max) require you to sometimes make a choice between returning a data value or interpolating between two data values, and in general there are lots of choices.

Of course, but usually they are unbiased solutions by default.

That's really not true. Please read the links I provided, and keep in mind that in general, most of these methods will give slightly different answers.

On the contrary, if you don't have numeric iterables, you have to choice between median_low and median_high, that introduce bias.

If your data is ordinal but not numeric, and you want the median value, and there are an even number of values, *you have no choice* but to pick one of the two middle values. To be clear, let's say we want to know the average (median) mark for a test out of F (fail), D, C, B, A: # ten students in the class [F, F, D, C, C, B, B, B, A, A] There is no such mark as "halfway between C and B", and no good way to decide whether the median should be taken as C or B. The best you can do is choose *ahead of time* (so that your choice is not biased by the result) the low median or the high median, and then be consistent with any other calculations you are doing. Even if the data is numeric, you may not want to average the two middle values, if it could give an impossible result ("2.35 children"). For example: https://stackoverflow.com/questions/48268263/finding-median-without-averagin... The method choosen is entirely up to the person doing the calculation, depending on their needs and what it is that they are measuring. It's not for me to tell the user which median is right for them, or for you to say that the low and high versions are always wrong. -- Steven

Thanks Steven, for your thoughtful response. On Thu, Dec 26, 2019 at 5:44 PM Steven D'Aprano <steve@pearwood.info> wrote:

However, I am happy to accept that silent failure may not be the ideal result for everyone.

I would argue that it is not the ideal result for ANYONE. The only reason for it is that it's easy to implement :-) And it is a fact that NaNs are part of the what is probably the most commonly used type with the statistics module: floats (and also Decimal, probably third, after integers). Frankly, if statistics only supported floats, it'd still be a very useful module. This thread has gotten a bit distracted (it IS python-ideas, after all) with discussion of sorting algorithms, and total ordering, and ..., but it seems that all of that is secondary to deciding what NaNs should *mean* to the statistics module. Options are: A) A missing Value B) An error of some sort. If (A), then the functions should remove the NaNs from the data as (or before) it is processed. if (B) then they should either raise or return NaN I think that missing value support would be a nice feature to have, so propose to use NaNs that way. Of course, another option would be to have the user set a flag indicating how they want NaN's treated (default TBD). So that leaves us with how to detect a NaN. As Steven pointed out in another thread, it's not trivially easy, but the smart folks on this list can solve that, I'm sure. I notice that the module says this in the docs: "Unless explicitly noted, these functions support int, float, Decimal and Fraction. Behaviour with other types (whether in the numeric tower or not) is currently unsupported" So while it would be nice for a fully robust solution for any numeric type, we should be able to get a good enough solution that at least supports the built-in types. (see the other thread Steven started for details). -CHB -- Christopher Barker, PhD Python Language Consulting - Teaching - Scientific Software Development - Desktop GUI and Web Development - wxPython, numpy, scipy, Cython

By the way, it's not as easy as you might think to test whether a value is a NAN or not. There are currently three ways to test whether a value is a NAN. (Four if you include cmath.isnan.) 1. math.isnan(x) but it can fail in annoying ways. # No int is a NAN so this should return False py> math.isnan(10**1000) Traceback (most recent call last): File "<stdin>", line 1, in <module> OverflowError: int too large to convert to float # Decimal signalling NANs are NANs so this should return True py> x = Decimal('snan') py> math.isnan(x) Traceback (most recent call last): File "<stdin>", line 1, in <module> ValueError: cannot convert signaling NaN to float 2. Decimal.is_nan() but it only works with decimals. However it does work with signalling NANs. 3. Back before math.isnan existed, the idiomatic way to check for NANs was to test whether they were equal to themselves. Assuming that x is a kind of numeric type, we should expect that `x == x` except for NANs. But this too fails for signalling NANs: # Using the default Decimal context py> x == x Traceback (most recent call last): File "<stdin>", line 1, in <module> decimal.InvalidOperation: [<class 'decimal.InvalidOperation'>] So there is no good way to check for an arbitrary NAN that duck-types floats, Decimals, ints, Fractions and other numeric types. I have come up with this: def isnan(x, _isnan=math.isnan): try: return _isnan(x) except ValueError: # Could be a Decimal signalling NAN. try: return x.is_nan() except AttributeError: return x != x except OverflowError: return False except TypeError: from cmath import isnan return isnan(x) but I fear that there could be other landmines waiting, other odd corner cases I haven't thought of. See also https://bugs.python.org/issue27975 -- Steven

Great work! I had not known if those corners. Have you tried with NumPy zero-D arrays or PyTorch Values? Both of those should have sensible answers. On Fri, Dec 27, 2019, 7:42 PM Steven D'Aprano <steve@pearwood.info> wrote:

By the way, it's not as easy as you might think to test whether a value is a NAN or not.

There are currently three ways to test whether a value is a NAN. (Four if you include cmath.isnan.)

1. math.isnan(x) but it can fail in annoying ways.

# No int is a NAN so this should return False py> math.isnan(10**1000) Traceback (most recent call last): File "<stdin>", line 1, in <module> OverflowError: int too large to convert to float

# Decimal signalling NANs are NANs so this should return True py> x = Decimal('snan') py> math.isnan(x) Traceback (most recent call last): File "<stdin>", line 1, in <module> ValueError: cannot convert signaling NaN to float

2. Decimal.is_nan() but it only works with decimals. However it does work with signalling NANs.

3. Back before math.isnan existed, the idiomatic way to check for NANs was to test whether they were equal to themselves. Assuming that x is a kind of numeric type, we should expect that `x == x` except for NANs. But this too fails for signalling NANs:

# Using the default Decimal context py> x == x Traceback (most recent call last): File "<stdin>", line 1, in <module> decimal.InvalidOperation: [<class 'decimal.InvalidOperation'>]

So there is no good way to check for an arbitrary NAN that duck-types floats, Decimals, ints, Fractions and other numeric types.

I have come up with this:

def isnan(x, _isnan=math.isnan): try: return _isnan(x) except ValueError: # Could be a Decimal signalling NAN. try: return x.is_nan() except AttributeError: return x != x except OverflowError: return False except TypeError: from cmath import isnan return isnan(x)

but I fear that there could be other landmines waiting, other odd corner cases I haven't thought of.

See also

https://bugs.python.org/issue27975

-- Steven _______________________________________________ Python-ideas mailing list -- python-ideas@python.org To unsubscribe send an email to python-ideas-leave@python.org https://mail.python.org/mailman3/lists/python-ideas.python.org/ Message archived at https://mail.python.org/archives/list/python-ideas@python.org/message/VHC4GO... Code of Conduct: http://python.org/psf/codeofconduct/

Never mind... the cases I mentioned are happy already:

import numpy as np import torch import math math.isnan(np.nan), type(np.nan) (True, <class 'float'>) np_nan = np.array(np.nan) math.isnan(np_nan), type(np_nan) (True, <class 'numpy.ndarray'>) torch_nan = torch.Tensor([float('nan')]) math.isnan(torch_nan), type(torch_nan) (True, <class 'torch.Tensor'>)

Interestingly:

np_nans = np.array([np.nan, np.nan]) math.isnan(np_nans) Traceback (most recent call last): File "<ipython-input-26-2f5da5ba4604>", line 1, in <module> math.isnan(np_nans) TypeError: only size-1 arrays can be converted to Python scalars

torch_nans = torch.Tensor([float('nan'), np.nan]) torch_nans tensor([nan, nan]) math.isnan(torch_nans) Traceback (most recent call last): File "<ipython-input-22-72715a2d3347>", line 1, in <module> math.isnan(torch_nans) ValueError: only one element tensors can be converted to Python scalars

And yet:

np_nan_arr = np.array([np.nan]).reshape(1, 1, 1, 1) math.isnan(np_nan_arr), np_nan_arr.shape (True, (1, 1, 1, 1))

So it's really only the 1-element-ness being checked, not the 0-D shape. On Fri, Dec 27, 2019 at 8:14 PM David Mertz <mertz@gnosis.cx> wrote:

Great work! I had not known if those corners.

Have you tried with NumPy zero-D arrays or PyTorch Values? Both of those should have sensible answers.

On Fri, Dec 27, 2019, 7:42 PM Steven D'Aprano <steve@pearwood.info> wrote:

By the way, it's not as easy as you might think to test whether a value is a NAN or not.

There are currently three ways to test whether a value is a NAN. (Four if you include cmath.isnan.)

1. math.isnan(x) but it can fail in annoying ways.

# No int is a NAN so this should return False py> math.isnan(10**1000) Traceback (most recent call last): File "<stdin>", line 1, in <module> OverflowError: int too large to convert to float

# Decimal signalling NANs are NANs so this should return True py> x = Decimal('snan') py> math.isnan(x) Traceback (most recent call last): File "<stdin>", line 1, in <module> ValueError: cannot convert signaling NaN to float

2. Decimal.is_nan() but it only works with decimals. However it does work with signalling NANs.

3. Back before math.isnan existed, the idiomatic way to check for NANs was to test whether they were equal to themselves. Assuming that x is a kind of numeric type, we should expect that `x == x` except for NANs. But this too fails for signalling NANs:

# Using the default Decimal context py> x == x Traceback (most recent call last): File "<stdin>", line 1, in <module> decimal.InvalidOperation: [<class 'decimal.InvalidOperation'>]

So there is no good way to check for an arbitrary NAN that duck-types floats, Decimals, ints, Fractions and other numeric types.

I have come up with this:

def isnan(x, _isnan=math.isnan): try: return _isnan(x) except ValueError: # Could be a Decimal signalling NAN. try: return x.is_nan() except AttributeError: return x != x except OverflowError: return False except TypeError: from cmath import isnan return isnan(x)

but I fear that there could be other landmines waiting, other odd corner cases I haven't thought of.

See also

https://bugs.python.org/issue27975

-- Steven _______________________________________________ Python-ideas mailing list -- python-ideas@python.org To unsubscribe send an email to python-ideas-leave@python.org https://mail.python.org/mailman3/lists/python-ideas.python.org/ Message archived at https://mail.python.org/archives/list/python-ideas@python.org/message/VHC4GO... Code of Conduct: http://python.org/psf/codeofconduct/

Is duck typing float or Decimal worth the bother? Barring that it could be done with some isinstance() checks (in the user code, not in math.isnan()). Otherwise, your only resort is a PEP, like the final comment of that bug says. On Fri, Dec 27, 2019 at 18:16 David Mertz <mertz@gnosis.cx> wrote:

Great work! I had not known if those corners.

Have you tried with NumPy zero-D arrays or PyTorch Values? Both of those should have sensible answers.

On Fri, Dec 27, 2019, 7:42 PM Steven D'Aprano <steve@pearwood.info> wrote:

By the way, it's not as easy as you might think to test whether a value is a NAN or not.

There are currently three ways to test whether a value is a NAN. (Four if you include cmath.isnan.)

1. math.isnan(x) but it can fail in annoying ways.

# No int is a NAN so this should return False py> math.isnan(10**1000) Traceback (most recent call last): File "<stdin>", line 1, in <module> OverflowError: int too large to convert to float

# Decimal signalling NANs are NANs so this should return True py> x = Decimal('snan') py> math.isnan(x) Traceback (most recent call last): File "<stdin>", line 1, in <module> ValueError: cannot convert signaling NaN to float

2. Decimal.is_nan() but it only works with decimals. However it does work with signalling NANs.

3. Back before math.isnan existed, the idiomatic way to check for NANs was to test whether they were equal to themselves. Assuming that x is a kind of numeric type, we should expect that `x == x` except for NANs. But this too fails for signalling NANs:

# Using the default Decimal context py> x == x Traceback (most recent call last): File "<stdin>", line 1, in <module> decimal.InvalidOperation: [<class 'decimal.InvalidOperation'>]

So there is no good way to check for an arbitrary NAN that duck-types floats, Decimals, ints, Fractions and other numeric types.

I have come up with this:

def isnan(x, _isnan=math.isnan): try: return _isnan(x) except ValueError: # Could be a Decimal signalling NAN. try: return x.is_nan() except AttributeError: return x != x except OverflowError: return False except TypeError: from cmath import isnan return isnan(x)

but I fear that there could be other landmines waiting, other odd corner cases I haven't thought of.

See also

https://bugs.python.org/issue27975

-- Steven _______________________________________________ Python-ideas mailing list -- python-ideas@python.org To unsubscribe send an email to python-ideas-leave@python.org https://mail.python.org/mailman3/lists/python-ideas.python.org/ Message archived at https://mail.python.org/archives/list/python-ideas@python.org/message/VHC4GO... Code of Conduct: http://python.org/psf/codeofconduct/

_______________________________________________ Python-ideas mailing list -- python-ideas@python.org To unsubscribe send an email to python-ideas-leave@python.org https://mail.python.org/mailman3/lists/python-ideas.python.org/ Message archived at https://mail.python.org/archives/list/python-ideas@python.org/message/AGBORB... Code of Conduct: http://python.org/psf/codeofconduct/

-- --Guido (mobile)

We have a fundamental issue here: On the one hand, we like Duck Typing, and it is baked into Python. On the other hand, we have a "math" module, that started as a wrapper around the C lib, and to this day is essentially a "float math" module. Which indeed is very useful, but not compatible with duck typing. This issue came up when we added math.isclose ( https://www.python.org/dev/peps/pep-0485/). I started out with a pure Python implementation that would work with duck types (or, at least Decimal and Fraction) . But when it came time for implementation, the math module seemed to be the only logical place for it, and it's written in C, and all (mostly?) about floats -- so we dumped the Duck Typing version. At the time, it was proposed that we re-write the math module in Python, calling out to C for most stuff, and then we could add to it making use of Python features. That idea was rejected at the time, probably because it seemed like way too much churn for the case at hand. But maybe it's time for a "duck math" module -- one that would provide a full set of math functions that would work on a wide range of numeric types. isnan() would be a nice candidate to kick it off :-) In any case, if this is too big a lift, then it would still be nice to have an duck-typeable isnan() function -- but where to put it? MAybe the statistics module is not such a bad place? -CHB On Fri, Dec 27, 2019 at 5:39 PM Guido van Rossum <guido@python.org> wrote:

Is duck typing float or Decimal worth the bother? Barring that it could be done with some isinstance() checks (in the user code, not in math.isnan()).

Otherwise, your only resort is a PEP, like the final comment of that bug says.

On Fri, Dec 27, 2019 at 18:16 David Mertz <mertz@gnosis.cx> wrote:

Great work! I had not known if those corners.

Have you tried with NumPy zero-D arrays or PyTorch Values? Both of those should have sensible answers.

On Fri, Dec 27, 2019, 7:42 PM Steven D'Aprano <steve@pearwood.info> wrote:

By the way, it's not as easy as you might think to test whether a value is a NAN or not.

There are currently three ways to test whether a value is a NAN. (Four if you include cmath.isnan.)

1. math.isnan(x) but it can fail in annoying ways.

2. Decimal.is_nan() but it only works with decimals. However it does work with signalling NANs.

I have come up with this:

but I fear that there could be other landmines waiting, other odd corner cases I haven't thought of.

See also

https://bugs.python.org/issue27975

_______________________________________________ Python-ideas mailing list -- python-ideas@python.org To unsubscribe send an email to python-ideas-leave@python.org https://mail.python.org/mailman3/lists/python-ideas.python.org/ Message archived at https://mail.python.org/archives/list/python-ideas@python.org/message/AGBORB... Code of Conduct: http://python.org/psf/codeofconduct/

-- --Guido (mobile) _______________________________________________ Python-ideas mailing list -- python-ideas@python.org To unsubscribe send an email to python-ideas-leave@python.org https://mail.python.org/mailman3/lists/python-ideas.python.org/ Message archived at https://mail.python.org/archives/list/python-ideas@python.org/message/OGU3QD... Code of Conduct: http://python.org/psf/codeofconduct/

-- Christopher Barker, PhD Python Language Consulting - Teaching - Scientific Software Development - Desktop GUI and Web Development - wxPython, numpy, scipy, Cython

On Fri, Dec 27, 2019 at 5:39 PM Guido van Rossum <guido@python.org> wrote:

Is duck typing float or Decimal worth the bother? Barring that it could be done with some isinstance() checks (in the user code, not in math.isnan()).

well, for the topic at hand in another thread -- in the statistics module. And I was coming to the same conclusion, but it dawned on me that another option would be to add a .is_nan() method to floats. (same as Decimal). It would be a lighter-weight option that a new dunder, but accomplish a similar effect -- anyone implementing a new numeric type that support NaN could add that method. BTW, could you simply do: def is_nan(num): try: return num.is_nan() except AttributeError: if isinstance(num, complex): return cmath.isnan(num) try: return math.isnan(num) except: return False I don't like the bare except, but it may be OK to say that anything that can't be coerced to a float is not a NaN. (na you could trap the exeptions we expect anyway) And this doesn't require you to import the Decimal module, and you can document that it will work with any type that either has an is_nan() method, or can have its NaN values successfully coerced into a float. And we could remove the complex support -- does the rest of the statistics module support it anyway? But it did make me think -- what if the complex number __float__() would work for NaN, and Inf, and -inf -- then you could have a single isnan() implementation in the math module, (in fact, that could be a standard part of the __float__ protocol) By the way: ----> 1 float(Decimal('snan')) ValueError: cannot convert signaling NaN to float Why can't it convert a signaling NaN to a float? Isn't a signaling NaN part of the IEE 754 spec? -CHB -- Christopher Barker, PhD Python Language Consulting - Teaching - Scientific Software Development - Desktop GUI and Web Development - wxPython, numpy, scipy, Cython

opps, forgot to include my test code -- could be handy: On Fri, Dec 27, 2019 at 11:52 PM Christopher Barker <pythonchb@gmail.com> wrote:

On Fri, Dec 27, 2019 at 5:39 PM Guido van Rossum <guido@python.org> wrote:

Is duck typing float or Decimal worth the bother? Barring that it could be done with some isinstance() checks (in the user code, not in math.isnan()).

well, for the topic at hand in another thread -- in the statistics module. And I was coming to the same conclusion, but it dawned on me that another option would be to add a .is_nan() method to floats. (same as Decimal). It would be a lighter-weight option that a new dunder, but accomplish a similar effect -- anyone implementing a new numeric type that support NaN could add that method.

BTW, could you simply do:

def is_nan(num): try: return num.is_nan() except AttributeError: if isinstance(num, complex): return cmath.isnan(num) try: return math.isnan(num) except: return False

I don't like the bare except, but it may be OK to say that anything that can't be coerced to a float is not a NaN. (na you could trap the exeptions we expect anyway)

And this doesn't require you to import the Decimal module, and you can document that it will work with any type that either has an is_nan() method, or can have its NaN values successfully coerced into a float.

And we could remove the complex support -- does the rest of the statistics module support it anyway? But it did make me think -- what if the complex number __float__() would work for NaN, and Inf, and -inf -- then you could have a single isnan() implementation in the math module,

(in fact, that could be a standard part of the __float__ protocol)

By the way:

----> 1 float(Decimal('snan')) ValueError: cannot convert signaling NaN to float

Why can't it convert a signaling NaN to a float? Isn't a signaling NaN part of the IEE 754 spec?

-CHB

-- Christopher Barker, PhD

Python Language Consulting - Teaching - Scientific Software Development - Desktop GUI and Web Development - wxPython, numpy, scipy, Cython

Personally I still like the fundamental: def is_nan(num): “”” Test for NaN “”” return num != num So simple! From: Christopher Barker <pythonchb@gmail.com> Sent: 28 December 2019 07:54 To: guido@python.org Cc: python-ideas <python-ideas@python.org> Subject: [Python-ideas] Re: Testing for NANs [was Re: Fix statistics.median()?] opps, forgot to include my test code -- could be handy: On Fri, Dec 27, 2019 at 11:52 PM Christopher Barker <pythonchb@gmail.com<mailto:pythonchb@gmail.com>> wrote: On Fri, Dec 27, 2019 at 5:39 PM Guido van Rossum <guido@python.org<mailto:guido@python.org>> wrote: Is duck typing float or Decimal worth the bother? Barring that it could be done with some isinstance() checks (in the user code, not in math.isnan()). well, for the topic at hand in another thread -- in the statistics module. And I was coming to the same conclusion, but it dawned on me that another option would be to add a .is_nan() method to floats. (same as Decimal). It would be a lighter-weight option that a new dunder, but accomplish a similar effect -- anyone implementing a new numeric type that support NaN could add that method. BTW, could you simply do: def is_nan(num): try: return num.is_nan() except AttributeError: if isinstance(num, complex): return cmath.isnan(num) try: return math.isnan(num) except: return False I don't like the bare except, but it may be OK to say that anything that can't be coerced to a float is not a NaN. (na you could trap the exeptions we expect anyway) And this doesn't require you to import the Decimal module, and you can document that it will work with any type that either has an is_nan() method, or can have its NaN values successfully coerced into a float. And we could remove the complex support -- does the rest of the statistics module support it anyway? But it did make me think -- what if the complex number __float__() would work for NaN, and Inf, and -inf -- then you could have a single isnan() implementation in the math module, (in fact, that could be a standard part of the __float__ protocol) By the way: ----> 1 float(Decimal('snan')) ValueError: cannot convert signaling NaN to float Why can't it convert a signaling NaN to a float? Isn't a signaling NaN part of the IEE 754 spec? -CHB -- Christopher Barker, PhD Python Language Consulting - Teaching - Scientific Software Development - Desktop GUI and Web Development - wxPython, numpy, scipy, Cython -- Christopher Barker, PhD Python Language Consulting - Teaching - Scientific Software Development - Desktop GUI and Web Development - wxPython, numpy, scipy, Cython

On Sat, Dec 28, 2019 at 7:03 PM Steve Barnes <GadgetSteve@live.co.uk> wrote:

Personally I still like the fundamental:

def is_nan(num):

“”” Test for NaN “””

return num != num

Which was in Steven's original post, and which is dangerous because a signalling nan will bomb. There's no easy way to get a safe True/False about whether it's nan or not. ChrisA