Mailman 3 NAN handling in statistics functions - Python-ideas

newer
Remove a single warning from the...

NAN handling in statistics functions

Steven D'Aprano

23 Aug 2021 23 Aug '21

8:53 p.m.

At the moment, the handling of NANs in the statistics module is implementation dependent. In practice, that *usually* means that if your data has a NAN in it, the result you get will probably be a NAN. >>> statistics.mean([1, 2, float('nan'), 4]) nan But there are unfortunate exceptions to this: >>> statistics.median([1, 2, float('nan'), 4]) nan >>> statistics.median([float('nan'), 1, 2, 4]) 1.5 I've spoken to users of other statistics packages and languages, such as R, and I cannot find any consensus on what the "right" behaviour should be for NANs except "not that!". So I propose that statistics functions gain a keyword only parameter to specify the desired behaviour when a NAN is found: - raise an exception - return NAN - ignore it (filter out NANs) which seem to be the three most common preference. (It seems to be split roughly equally between the three.) Thoughts? Objections? Does anyone have any strong feelings about what should be the default? -- Steve

Show replies by date

David Mertz, Ph.D.

23 Aug 23 Aug

9:47 p.m.

We had this discussion about a year and a half ago, in which I strongly advocated exactly this keyword argument to median*(). As before, I don't care about the default if there is an option. I don't even really care about the exception case, but don't object to it. On Mon, Aug 23, 2021 at 11:55 PM Steven D'Aprano <steve@pearwood.info> wrote:

...

At the moment, the handling of NANs in the statistics module is implementation dependent. In practice, that *usually* means that if your data has a NAN in it, the result you get will probably be a NAN.

>>> statistics.mean([1, 2, float('nan'), 4]) nan

But there are unfortunate exceptions to this:

>>> statistics.median([1, 2, float('nan'), 4]) nan >>> statistics.median([float('nan'), 1, 2, 4]) 1.5

I've spoken to users of other statistics packages and languages, such as R, and I cannot find any consensus on what the "right" behaviour should be for NANs except "not that!".

So I propose that statistics functions gain a keyword only parameter to specify the desired behaviour when a NAN is found:

- raise an exception

- return NAN

- ignore it (filter out NANs)

which seem to be the three most common preference. (It seems to be split roughly equally between the three.)

Thoughts? Objections?

Does anyone have any strong feelings about what should be the default?

-- Steve _______________________________________________ Python-ideas mailing list -- python-ideas@python.org To unsubscribe send an email to python-ideas-leave@python.org https://mail.python.org/mailman3/lists/python-ideas.python.org/ Message archived at https://mail.python.org/archives/list/python-ideas@python.org/message/EDRF2N... Code of Conduct: http://python.org/psf/codeofconduct/

-- Keeping medicines from the bloodstreams of the sick; food from the bellies of the hungry; books from the hands of the uneducated; technology from the underdeveloped; and putting advocates of freedom in prisons. Intellectual property is to the 21st century what the slave trade was to the 16th.

Christopher Barker

9:50 p.m.

Note that numpy has a set of nan* functions that ignore NaNs. I’m not suggesting that here, but it is prior art to be considered, and I do like that it explicitly is ignoring NaNs.

...

- raise an exception

- return NAN

- ignore it (filter out NANs)

Does anyone have any strong feelings about what should be the default?

Filtering our NaNs should *not* be the default. Often NaN means missing data, but could also be the result of an error of some sort. Incorrect results are much worse than errors — NaNs should never be ignored unless explicitly asked for. Beyond that, I’d prefer returning NaN to raising an exception, but either is OK. -CHB

...

-- Steve _______________________________________________ Python-ideas mailing list -- python-ideas@python.org To unsubscribe send an email to python-ideas-leave@python.org https://mail.python.org/mailman3/lists/python-ideas.python.org/ Message archived at https://mail.python.org/archives/list/python-ideas@python.org/message/EDRF2N... Code of Conduct: http://python.org/psf/codeofconduct/

-- Christopher Barker, PhD (Chris) Python Language Consulting - Teaching - Scientific Software Development - Desktop GUI and Web Development - wxPython, numpy, scipy, Cython

Guido van Rossum

9:50 p.m.

Urgh. That's a nasty dilemma. I propose that the default should be return NAN, since that's what you'd expect if you did the super-naive arithmetic version (e.g. mean(x, y, z) = (x+y+z)/3). On Mon, Aug 23, 2021 at 8:55 PM Steven D'Aprano <steve@pearwood.info> wrote:

...

At the moment, the handling of NANs in the statistics module is implementation dependent. In practice, that *usually* means that if your data has a NAN in it, the result you get will probably be a NAN.

>>> statistics.mean([1, 2, float('nan'), 4]) nan

But there are unfortunate exceptions to this:

>>> statistics.median([1, 2, float('nan'), 4]) nan >>> statistics.median([float('nan'), 1, 2, 4]) 1.5

I've spoken to users of other statistics packages and languages, such as R, and I cannot find any consensus on what the "right" behaviour should be for NANs except "not that!".

So I propose that statistics functions gain a keyword only parameter to specify the desired behaviour when a NAN is found:

- raise an exception

- return NAN

- ignore it (filter out NANs)

which seem to be the three most common preference. (It seems to be split roughly equally between the three.)

Thoughts? Objections?

Does anyone have any strong feelings about what should be the default?

-- Steve _______________________________________________ Python-ideas mailing list -- python-ideas@python.org To unsubscribe send an email to python-ideas-leave@python.org https://mail.python.org/mailman3/lists/python-ideas.python.org/ Message archived at https://mail.python.org/archives/list/python-ideas@python.org/message/EDRF2N... Code of Conduct: http://python.org/psf/codeofconduct/

-- --Guido van Rossum (python.org/~guido) *Pronouns: he/him **(why is my pronoun here?)* <http://feministing.com/2015/02/03/how-using-they-as-a-singular-pronoun-can-change-the-world/>

Marc-Andre Lemburg

24 Aug 24 Aug

12:44 a.m.

On 24.08.2021 05:53, Steven D'Aprano wrote:

...

At the moment, the handling of NANs in the statistics module is implementation dependent. In practice, that *usually* means that if your data has a NAN in it, the result you get will probably be a NAN.

>>> statistics.mean([1, 2, float('nan'), 4]) nan

But there are unfortunate exceptions to this:

>>> statistics.median([1, 2, float('nan'), 4]) nan >>> statistics.median([float('nan'), 1, 2, 4]) 1.5

I've spoken to users of other statistics packages and languages, such as R, and I cannot find any consensus on what the "right" behaviour should be for NANs except "not that!".

So I propose that statistics functions gain a keyword only parameter to specify the desired behaviour when a NAN is found:

- raise an exception

- return NAN

- ignore it (filter out NANs)

which seem to be the three most common preference. (It seems to be split roughly equally between the three.)

Thoughts? Objections?

Sounds good. This is similar to the errors argument we have for codecs where users can determine what the behavior should be in case of an error in processing.

...

Does anyone have any strong feelings about what should be the default?

No strong preference, but if the objective is to continue calculations as much as possible even in the face of missing values, returning NAN is the better choice. Second best would be an exception, IMO, to signal: please be explicit about what to do about NANs in the calculation. It helps reduce the needed backtracking when the end result of a calculation turns out to be NAN. Filtering out NANs should always be an explicit choice to make. Ideally such filtering should happen *before* any calculations get applied. In some cases, it's better to replace NANs with use case specific default values. In others, removing them is the right thing to do. Note that e.g. SQL defaults to ignoring NULLs in aggregate functions such as AVG(), so there are standard precedents for ignoring NAN values per default as well. And yes, that default can lead to wrong results in reports which are hard to detect. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Experts (#1, Aug 24 2021)

...

...
...
Python Projects, Coaching and Support ... https://www.egenix.com/ Python Product Development ... https://consulting.egenix.com/

::: We implement business ideas - efficiently in both time and costs ::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 https://www.egenix.com/company/contact/ https://www.malemburg.com/

Finn Mason

25 Aug 25 Aug

5:36 p.m.

Perhaps a warning could be raised but the NaNs are ignored. For example: Input: statistics.mean([4, 2, float('nan')]) Output: [warning blah blah blah] 3 Or the NaNs could be treated as zeros and a warning raised: Input: statistics.mean([4, 2, float('nan')]) Output: [warning blah blah blah] 2 I do feel there should be a catchable warning but not an outright exception, and a non-NaN value should still be returned. This allows calculations to still quickly and easily be made with or without NaNs, but an alternative course of action can be taken in the presence of a NaN value if desired. In any case, the current behavior should definitely be changed. On Tue, Aug 24, 2021, 1:46 AM Marc-Andre Lemburg <mal@egenix.com> wrote:

...

On 24.08.2021 05:53, Steven D'Aprano wrote:

...
At the moment, the handling of NANs in the statistics module is implementation dependent. In practice, that *usually* means that if your data has a NAN in it, the result you get will probably be a NAN.

>>> statistics.mean([1, 2, float('nan'), 4]) nan

But there are unfortunate exceptions to this:

>>> statistics.median([1, 2, float('nan'), 4]) nan >>> statistics.median([float('nan'), 1, 2, 4]) 1.5

I've spoken to users of other statistics packages and languages, such as R, and I cannot find any consensus on what the "right" behaviour should be for NANs except "not that!".

So I propose that statistics functions gain a keyword only parameter to specify the desired behaviour when a NAN is found:

- raise an exception

- return NAN

- ignore it (filter out NANs)

which seem to be the three most common preference. (It seems to be split roughly equally between the three.)

Thoughts? Objections?

Sounds good. This is similar to the errors argument we have for codecs where users can determine what the behavior should be in case of an error in processing.

...
Does anyone have any strong feelings about what should be the default?

No strong preference, but if the objective is to continue calculations as much as possible even in the face of missing values, returning NAN is the better choice.

Second best would be an exception, IMO, to signal: please be explicit about what to do about NANs in the calculation. It helps reduce the needed backtracking when the end result of a calculation turns out to be NAN.

Filtering out NANs should always be an explicit choice to make. Ideally such filtering should happen *before* any calculations get applied. In some cases, it's better to replace NANs with use case specific default values. In others, removing them is the right thing to do.

Note that e.g. SQL defaults to ignoring NULLs in aggregate functions such as AVG(), so there are standard precedents for ignoring NAN values per default as well. And yes, that default can lead to wrong results in reports which are hard to detect.

-- Marc-Andre Lemburg eGenix.com

Professional Python Services directly from the Experts (#1, Aug 24 2021)

...
...
...
Python Projects, Coaching and Support ... https://www.egenix.com/ Python Product Development ... https://consulting.egenix.com/

::: We implement business ideas - efficiently in both time and costs :::

eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 https://www.egenix.com/company/contact/ https://www.malemburg.com/

_______________________________________________ Python-ideas mailing list -- python-ideas@python.org To unsubscribe send an email to python-ideas-leave@python.org https://mail.python.org/mailman3/lists/python-ideas.python.org/ Message archived at https://mail.python.org/archives/list/python-ideas@python.org/message/L5QB4G... Code of Conduct: http://python.org/psf/codeofconduct/

Christopher Barker

10:40 p.m.

On Wed, Aug 25, 2021 at 5:39 PM Finn Mason <finnjavier08@gmail.com> wrote:

...

Or the NaNs could be treated as zeros and a warning raised:

Absolutely not! NaN in no way means zero, ever. We should never provide a known incorrect result.

...

I do feel there should be a catchable warning but not an outright exception, and a non-NaN value should still be returned.

I disagree -- warnings are way too easy to ignore. Give people a way to opt-in to silent NaN handling, but don't rely on a warning to let people know they need to think about it. In any case, the current behavior should definitely be changed.

...

I think we all agree on that ! -CHB -- Christopher Barker, PhD (Chris) Python Language Consulting - Teaching - Scientific Software Development - Desktop GUI and Web Development - wxPython, numpy, scipy, Cython

Steven D'Aprano

26 Aug 26 Aug

3:18 a.m.

On Wed, Aug 25, 2021 at 10:40:59PM -0700, Christopher Barker wrote:

...

On Wed, Aug 25, 2021 at 5:39 PM Finn Mason <finnjavier08@gmail.com> wrote:

...
Or the NaNs could be treated as zeros and a warning raised:

Absolutely not! NaN in no way means zero, ever. We should never provide a known incorrect result.

I agree that NANs should not be replaced by zero. If the user wants to replace NANs with some constant, they can filter and replace the data themselves.

...

...
I do feel there should be a catchable warning but not an outright exception, and a non-NaN value should still be returned.

I disagree -- warnings are way too easy to ignore. Give people a way to opt-in to silent NaN handling, but don't rely on a warning to let people know they need to think about it.

Such a warning would be opt-in. If someone chooses to ignore the warnings that they explicitly asked to receive, that's not our problem :-) I think it would be useful to say "ignore NANs, but give me a warning if you do". That gives a meaningful result (treating NANs as if they were missing data) while still alerting the user to the fact that they had missing data and might want to find out why. -- Steve

Marc-Andre Lemburg

12:36 a.m.

On 26.08.2021 02:36, Finn Mason wrote:

...

Perhaps a warning could be raised but the NaNs are ignored. For example:

Input: statistics.mean([4, 2, float('nan')]) Output: [warning blah blah blah] 3

Or the NaNs could be treated as zeros and a warning raised:

Input: statistics.mean([4, 2, float('nan')]) Output: [warning blah blah blah] 2

I do feel there should be a catchable warning but not an outright exception, and a non-NaN value should still be returned. This allows calculations to still quickly and easily be made with or without NaNs, but an alternative course of action can be taken in the presence of a NaN value if desired.

With the keyword argument, you can decide what to do. As for the default: for codecs we made raising an exeception the default, simply because this highlights the need to make an explicit decision. For long running calculations this may not be desirable, but then getting NAN as end result isn't the best compromise either. In practice it's better to check for NANs before entering a calculation and then apply case specific handling, e.g. replace NANs with fixed default values, remove them, use a different heuristic for the calculation, stop the calculation and ask for better input, etc. etc. There are many ways to process things in the face of NANs. In Python you can use a simple test for this:

...

...
...
nan = float('nan') l = [1,2,3,nan] d = {nan:1, 2:3, 4:5, 5:nan} s = set(l) nan in l True nan in d True nan in s True

but this really only makes sense for smaller data sets. If you have a large data set where you rarely get NANs, using the keyword argument may indeed be a better way to go about this.

...

In any case, the current behavior should definitely be changed.

Indeed. The NAN handling in median() looks like a bug, more than anything else:

...

...
...
import statistics statistics.mean(l) nan statistics.mean(d) nan statistics.mean(s) nan

...

...
...
l1 = [1,2,nan,4] statistics.mean(l1) nan l2 = [nan,1,2,4] statistics.mean(l2) nan

...

...
...
statistics.median(l) 2.5 statistics.median(l1) nan statistics.median(l2) 1.5

...

On Tue, Aug 24, 2021, 1:46 AM Marc-Andre Lemburg <mal@egenix.com <mailto:mal@egenix.com>> wrote:

On 24.08.2021 05:53, Steven D'Aprano wrote: > At the moment, the handling of NANs in the statistics module is > implementation dependent. In practice, that *usually* means that if your > data has a NAN in it, the result you get will probably be a NAN. > > >>> statistics.mean([1, 2, float('nan'), 4]) > nan > > But there are unfortunate exceptions to this: > > >>> statistics.median([1, 2, float('nan'), 4]) > nan > >>> statistics.median([float('nan'), 1, 2, 4]) > 1.5 > > I've spoken to users of other statistics packages and languages, such as > R, and I cannot find any consensus on what the "right" behaviour should > be for NANs except "not that!". > > So I propose that statistics functions gain a keyword only parameter to > specify the desired behaviour when a NAN is found: > > - raise an exception > > - return NAN > > - ignore it (filter out NANs) > > which seem to be the three most common preference. (It seems to be > split roughly equally between the three.) > > Thoughts? Objections?

Sounds good. This is similar to the errors argument we have for codecs where users can determine what the behavior should be in case of an error in processing.

> Does anyone have any strong feelings about what should be the default?

No strong preference, but if the objective is to continue calculations as much as possible even in the face of missing values, returning NAN is the better choice.

Second best would be an exception, IMO, to signal: please be explicit about what to do about NANs in the calculation. It helps reduce the needed backtracking when the end result of a calculation turns out to be NAN.

Filtering out NANs should always be an explicit choice to make. Ideally such filtering should happen *before* any calculations get applied. In some cases, it's better to replace NANs with use case specific default values. In others, removing them is the right thing to do.

Note that e.g. SQL defaults to ignoring NULLs in aggregate functions such as AVG(), so there are standard precedents for ignoring NAN values per default as well. And yes, that default can lead to wrong results in reports which are hard to detect.

-- Marc-Andre Lemburg eGenix.com

Professional Python Services directly from the Experts (#1, Aug 24 2021) >>> Python Projects, Coaching and Support ... https://www.egenix.com/ >>> Python Product Development ... https://consulting.egenix.com/ ________________________________________________________________________

::: We implement business ideas - efficiently in both time and costs :::

eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 https://www.egenix.com/company/contact/ https://www.malemburg.com/

_______________________________________________ Python-ideas mailing list -- python-ideas@python.org <mailto:python-ideas@python.org> To unsubscribe send an email to python-ideas-leave@python.org <mailto:python-ideas-leave@python.org> https://mail.python.org/mailman3/lists/python-ideas.python.org/ Message archived at https://mail.python.org/archives/list/python-ideas@python.org/message/L5QB4G... Code of Conduct: http://python.org/psf/codeofconduct/

_______________________________________________ Python-ideas mailing list -- python-ideas@python.org To unsubscribe send an email to python-ideas-leave@python.org https://mail.python.org/mailman3/lists/python-ideas.python.org/ Message archived at https://mail.python.org/archives/list/python-ideas@python.org/message/SSGI4J... Code of Conduct: http://python.org/psf/codeofconduct/

-- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Experts (#1, Aug 26 2021)

...

...
...
Python Projects, Coaching and Support ... https://www.egenix.com/ Python Product Development ... https://consulting.egenix.com/

Peter Otten

1:02 a.m.

On 26/08/2021 09:36, Marc-Andre Lemburg wrote:

...

In Python you can use a simple test for this:

I think you need math.isnan().

...

...
...
...
nan = float('nan') l = [1,2,3,nan] d = {nan:1, 2:3, 4:5, 5:nan} s = set(l) nan in l True

That only works with identical nan-s, and because the container omits the equality check for identical objects:

...

...
...
nan = float("nan") nan in [nan] True

But:

...

...
...
nan == nan False nan in [float("nan")] False

Marc-Andre Lemburg

2:05 a.m.

On 26.08.2021 10:02, Peter Otten wrote:

...

On 26/08/2021 09:36, Marc-Andre Lemburg wrote:

...
In Python you can use a simple test for this:

I think you need math.isnan().

...
...
...
...
nan = float('nan') l = [1,2,3,nan] d = {nan:1, 2:3, 4:5, 5:nan} s = set(l) nan in l True

That only works with identical nan-s, and because the container omits the equality check for identical objects:

...
...
...
nan = float("nan") nan in [nan] True

But:

...
...
...
nan == nan False nan in [float("nan")] False

Oh, good point. I was under the impression that NAN is handled as a singleton. Perhaps this should be changed to make to make it easier to detect NANs ?! -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Experts (#1, Aug 26 2021)

...

...
...
Python Projects, Coaching and Support ... https://www.egenix.com/ Python Product Development ... https://consulting.egenix.com/

Steven D'Aprano

3:15 a.m.

On Thu, Aug 26, 2021 at 11:05:01AM +0200, Marc-Andre Lemburg wrote:

...

Oh, good point. I was under the impression that NAN is handled as a singleton.

There are 4503599627370496 distinct quiet NANs (plus about the same signalling NANs). So it would need to be 4-quadrillion-ton :-) (If anyone is concerned about the large number of NANs, it's less than 0.05% of the total number of floats.) Back in the mid-80s, Apple's floating point library, SANE, distinguished different classes of error with distinct NANs. Few systems have followed that lead, but each NAN still has 51 bits available for a diagnostic code, plus the sign bit. While Python itself only generates a single NAN value, if you are receiving data from outside sources it could contain NANs with distinct payloads. The IEEE-754 standard doesn't mandate that NANs preserve the payload, but it does recommend it. We shouldn't gratuitously discard that information. It could be meaningful to whoever is generating the data. -- Steve

Marc-Andre Lemburg

3:44 a.m.

On 26.08.2021 12:15, Steven D'Aprano wrote:

...

On Thu, Aug 26, 2021 at 11:05:01AM +0200, Marc-Andre Lemburg wrote:

...
Oh, good point. I was under the impression that NAN is handled as a singleton.

There are 4503599627370496 distinct quiet NANs (plus about the same signalling NANs). So it would need to be 4-quadrillion-ton :-)

(If anyone is concerned about the large number of NANs, it's less than 0.05% of the total number of floats.)

Back in the mid-80s, Apple's floating point library, SANE, distinguished different classes of error with distinct NANs. Few systems have followed that lead, but each NAN still has 51 bits available for a diagnostic code, plus the sign bit. While Python itself only generates a single NAN value, if you are receiving data from outside sources it could contain NANs with distinct payloads.

The IEEE-754 standard doesn't mandate that NANs preserve the payload, but it does recommend it. We shouldn't gratuitously discard that information. It could be meaningful to whoever is generating the data.

Fair enough. Would it then make sense to at least have all possible NAN objects compare equal, treating the extra error information as an attribute value rather than a distinct value and perhaps exposing this as such ? I'm after the "practicality beats purity" here. The math.isnan() test doesn't work well in practice, since you'd have to iterate over all sequence members and call that test function, which is expensive when done in Python. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Experts (#1, Aug 26 2021)

...

...
...
Python Projects, Coaching and Support ... https://www.egenix.com/ Python Product Development ... https://consulting.egenix.com/

Christopher Barker

8:36 a.m.

There have been a number of discussions on this list, and at least one PEP, about NaN (and other special values). Let’s keep this thread about handling them in the statistics lib. But briefly: NaNs are weird on purpose, and Python should absolutely not deviate from IEEE. That’s (one reason) Python has None :-) If you are that worried about performance, you should probably use numpy anyway :-) -CHB On Thu, Aug 26, 2021 at 3:47 AM Marc-Andre Lemburg <mal@egenix.com> wrote:

...

On 26.08.2021 12:15, Steven D'Aprano wrote:

...
On Thu, Aug 26, 2021 at 11:05:01AM +0200, Marc-Andre Lemburg wrote:

...
Oh, good point. I was under the impression that NAN is handled as a singleton.

There are 4503599627370496 distinct quiet NANs (plus about the same signalling NANs). So it would need to be 4-quadrillion-ton :-)

(If anyone is concerned about the large number of NANs, it's less than 0.05% of the total number of floats.)

Back in the mid-80s, Apple's floating point library, SANE, distinguished different classes of error with distinct NANs. Few systems have followed that lead, but each NAN still has 51 bits available for a diagnostic code, plus the sign bit. While Python itself only generates a single NAN value, if you are receiving data from outside sources it could contain NANs with distinct payloads.

The IEEE-754 standard doesn't mandate that NANs preserve the payload, but it does recommend it. We shouldn't gratuitously discard that information. It could be meaningful to whoever is generating the data.

Fair enough. Would it then make sense to at least have all possible NAN objects compare equal, treating the extra error information as an attribute value rather than a distinct value and perhaps exposing this as such ?

I'm after the "practicality beats purity" here. The math.isnan() test doesn't work well in practice, since you'd have to iterate over all sequence members and call that test function, which is expensive when done in Python.

-- Marc-Andre Lemburg eGenix.com

Professional Python Services directly from the Experts (#1, Aug 26 2021)

...
...
...
Python Projects, Coaching and Support ... https://www.egenix.com/ Python Product Development ... https://consulting.egenix.com/

::: We implement business ideas - efficiently in both time and costs :::

eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 https://www.egenix.com/company/contact/ https://www.malemburg.com/

_______________________________________________ Python-ideas mailing list -- python-ideas@python.org To unsubscribe send an email to python-ideas-leave@python.org https://mail.python.org/mailman3/lists/python-ideas.python.org/ Message archived at https://mail.python.org/archives/list/python-ideas@python.org/message/GX7PAY... Code of Conduct: http://python.org/psf/codeofconduct/

-- Christopher Barker, PhD (Chris) Python Language Consulting - Teaching - Scientific Software Development - Desktop GUI and Web Development - wxPython, numpy, scipy, Cython

Marc-Andre Lemburg

12:32 p.m.

On 26.08.2021 17:36, Christopher Barker wrote:

...

There have been a number of discussions on this list, and at least one PEP, about NaN (and other special values).

Let’s keep this thread about handling them in the statistics lib.

But briefly:

NaNs are weird on purpose, and Python should absolutely not deviate from IEEE.

Agreed. I was just surprised that NANs are more Medusa-like than expected ;-)

...

That’s (one reason) Python has None :-)

If you are that worried about performance, you should probably use numpy anyway :-)

Sure, and pandas, which both have methods to replace NANs in arrays.

...

-CHB

On Thu, Aug 26, 2021 at 3:47 AM Marc-Andre Lemburg <mal@egenix.com <mailto:mal@egenix.com>> wrote:

On 26.08.2021 12:15, Steven D'Aprano wrote: > On Thu, Aug 26, 2021 at 11:05:01AM +0200, Marc-Andre Lemburg wrote: > >> Oh, good point. I was under the impression that NAN is handled >> as a singleton. > > There are 4503599627370496 distinct quiet NANs (plus about the same > signalling NANs). So it would need to be 4-quadrillion-ton :-) > > (If anyone is concerned about the large number of NANs, it's less than > 0.05% of the total number of floats.) > > Back in the mid-80s, Apple's floating point library, SANE, distinguished > different classes of error with distinct NANs. Few systems have followed > that lead, but each NAN still has 51 bits available for a diagnostic > code, plus the sign bit. While Python itself only generates a single NAN > value, if you are receiving data from outside sources it could contain > NANs with distinct payloads. > > The IEEE-754 standard doesn't mandate that NANs preserve the payload, > but it does recommend it. We shouldn't gratuitously discard that > information. It could be meaningful to whoever is generating the data.

Fair enough. Would it then make sense to at least have all possible NAN objects compare equal, treating the extra error information as an attribute value rather than a distinct value and perhaps exposing this as such ?

I'm after the "practicality beats purity" here. The math.isnan() test doesn't work well in practice, since you'd have to iterate over all sequence members and call that test function, which is expensive when done in Python.

-- Marc-Andre Lemburg eGenix.com

Professional Python Services directly from the Experts (#1, Aug 26 2021) >>> Python Projects, Coaching and Support ... https://www.egenix.com/ >>> Python Product Development ... https://consulting.egenix.com/ ________________________________________________________________________

::: We implement business ideas - efficiently in both time and costs :::

eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 https://www.egenix.com/company/contact/ https://www.malemburg.com/

_______________________________________________ Python-ideas mailing list -- python-ideas@python.org <mailto:python-ideas@python.org> To unsubscribe send an email to python-ideas-leave@python.org <mailto:python-ideas-leave@python.org> https://mail.python.org/mailman3/lists/python-ideas.python.org/ Message archived at https://mail.python.org/archives/list/python-ideas@python.org/message/GX7PAY... Code of Conduct: http://python.org/psf/codeofconduct/

-- Christopher Barker, PhD (Chris)

Python Language Consulting - Teaching - Scientific Software Development - Desktop GUI and Web Development - wxPython, numpy, scipy, Cython

David Mertz, Ph.D.

6:24 p.m.

On Thu, Aug 26, 2021, 6:46 AM Marc-Andre Lemburg

...

Fair enough. Would it then make sense to at least have all possible NAN objects compare equal, treating the extra error information as an attribute value rather than a distinct value and perhaps exposing this as such ?

No, no, no! Almost the entire point of a NaN is that it doesn't compare as equal to anything... Not even to itself!

Marc-Andre Lemburg

27 Aug 27 Aug

12:30 a.m.

On 27.08.2021 03:24, David Mertz, Ph.D. wrote:

...

On Thu, Aug 26, 2021, 6:46 AM Marc-Andre Lemburg

Fair enough. Would it then make sense to at least have all possible NAN objects compare equal, treating the extra error information as an attribute value rather than a distinct value and perhaps exposing this as such ?

No, no, no!

Almost the entire point of a NaN is that it doesn't compare as equal to anything... Not even to itself!

Yeah, you're right, it would break the logic that NAN should "infect" most (or even all) other operations they are used in to signal "no idea what to do here". -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Experts (#1, Aug 27 2021)

...

...
...
Python Projects, Coaching and Support ... https://www.egenix.com/ Python Product Development ... https://consulting.egenix.com/

Steven D'Aprano

26 Aug 26 Aug

10:29 p.m.

On Thu, Aug 26, 2021 at 12:44:18PM +0200, Marc-Andre Lemburg wrote:

...

Fair enough. Would it then make sense to at least have all possible NAN objects compare equal, treating the extra error information as an attribute value rather than a distinct value and perhaps exposing this as such ?

I'm after the "practicality beats purity" here. The math.isnan() test doesn't work well in practice, since you'd have to iterate over all sequence members and call that test function, which is expensive when done in Python.

Yes, it is expensive in Python (but maybe the work Mark Shannon is doing will speed up function calls?). It's even more expensive because Decimals and floats don't support the same API for testing for NANs, which makes me sad. But having NANs compare equal to each other would be a major backwards compatibility break, and it would go against the standard. Right now, the only fast way to check for NANs without worrying about their type is to take advantage of the fact that they don't equal themselves: `x != x`. -- Steve

Serhiy Storchaka

27 Aug 27 Aug

12:58 a.m.

26.08.21 12:05, Marc-Andre Lemburg пише:

...

Oh, good point. I was under the impression that NAN is handled as a singleton.

Perhaps this should be changed to make to make it easier to detect NANs ?!

Even ignoring a NaN payload, there are many different NaNs of different types. For example, Decimal('nan') cannot be the same as float('nan').

Marc-Andre Lemburg

1:46 a.m.

On 27.08.2021 09:58, Serhiy Storchaka wrote:

...

26.08.21 12:05, Marc-Andre Lemburg пише:

...
Oh, good point. I was under the impression that NAN is handled as a singleton.

Perhaps this should be changed to make to make it easier to detect NANs ?!

Even ignoring a NaN payload, there are many different NaNs of different types. For example, Decimal('nan') cannot be the same as float('nan').

Right, it's a much larger problem than I thought :-) cmath has its own NANs as well. Too many NANs... it's probably better to stick with NumPy for handling data sets with embedded NANs. It provides consistent handling for NANs across integers, floats, complex and even date/time values (as NATs). -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Experts (#1, Aug 27 2021)

...

...
...
Python Projects, Coaching and Support ... https://www.egenix.com/ Python Product Development ... https://consulting.egenix.com/

Steven D'Aprano

8:32 p.m.

On Thu, Aug 26, 2021 at 09:36:27AM +0200, Marc-Andre Lemburg wrote:

...

Indeed. The NAN handling in median() looks like a bug, more than anything else:

[slightly paraphrased]

...

...
...
...
l1 = [1,2,nan,4] l2 = [nan,1,2,4]

...
...
...
statistics.median(l1) nan statistics.median(l2) 1.5

Looks can be deceiving, it's actually a feature *wink* That behaviour is actually the direct consequence of NANs being unordered. The IEEE-754 standard requires that comparisons with NANs all return False (apart from not-equal, which returns True). So NANs are neither less than, equal to, or greater than other values. Which makes sense numerically, NANs do not appear on the number line and are not ordered with numbers. So when you sort a list containing NANs, they end up in some arbitrary position that depends on the sort implementation, the other values in the list, and their initial position. NANs can even throw out the order of other values:

...

...
...
sorted([3, nan, 4, 2, nan, 1]) [3, nan, 1, 2, 4, nan]

and *that* violates `median`'s assumption that sorting values actually puts them in sorted order, which is why median returns the wrong value. I don't think that Timsort is buggy here. I expect that every sort algorithm on the planet will require a Total Order to get sensible results, and NANs violate that expectation. https://eli.thegreenplace.net/2018/partial-and-total-orders/ If we define the less than operator `<` as "isn't greater than (or equal to)", then we can see that sorted is *locally* correct: * 3 isn't greater than nan; * nan isn't greater than 1; * 1 isn't greater than 2; * 2 isn't greater than 4; * and 4 isn't greater than nan. sorted() has correctly sorted the values in the sense that the invariant "a comes before b iff a isn't greater than b" is satisfied between each pair of consecutive values, but globally the order is violated because NAN's are unordered and mess up transitivity: 3 isn't greater than NAN, and NAN isn't greater than 1, but it is not true that 3 isn't greater than 1. In the general case of sorting elements, I think that the solution is "don't do that". If you have objects which don't form a total order, then you can't expect to get sensible results from sorting them. In the case of floats, it would be nice to have a totalOrder function as specified in the 2008 revision of IEEE-354: https://irem.univ-reunion.fr/IMG/pdf/ieee-754-2008.pdf Then we could sensibly do: sorted(floats_or_decimals, key=totalorder) and at least NANs would end up in a consistent place and everything else sorted correctly. -- Steve

Marc-Andre Lemburg

28 Aug 28 Aug

3:23 a.m.

On 28.08.2021 05:32, Steven D'Aprano wrote:

...

On Thu, Aug 26, 2021 at 09:36:27AM +0200, Marc-Andre Lemburg wrote:

...
Indeed. The NAN handling in median() looks like a bug, more than anything else:

[slightly paraphrased]

...
...
...
...
l1 = [1,2,nan,4] l2 = [nan,1,2,4]

...
...
...
statistics.median(l1) nan statistics.median(l2) 1.5

Looks can be deceiving, it's actually a feature *wink*

That behaviour is actually the direct consequence of NANs being unordered. The IEEE-754 standard requires that comparisons with NANs all return False (apart from not-equal, which returns True). So NANs are neither less than, equal to, or greater than other values.

Which makes sense numerically, NANs do not appear on the number line and are not ordered with numbers.

So when you sort a list containing NANs, they end up in some arbitrary position that depends on the sort implementation, the other values in the list, and their initial position. NANs can even throw out the order of other values:

...
...
...
sorted([3, nan, 4, 2, nan, 1]) [3, nan, 1, 2, 4, nan]

and *that* violates `median`'s assumption that sorting values actually puts them in sorted order, which is why median returns the wrong value.

I don't think that Timsort is buggy here. I expect that every sort algorithm on the planet will require a Total Order to get sensible results, and NANs violate that expectation.

https://eli.thegreenplace.net/2018/partial-and-total-orders/

If we define the less than operator `<` as "isn't greater than (or equal to)", then we can see that sorted is *locally* correct:

* 3 isn't greater than nan; * nan isn't greater than 1; * 1 isn't greater than 2; * 2 isn't greater than 4; * and 4 isn't greater than nan.

sorted() has correctly sorted the values in the sense that the invariant "a comes before b iff a isn't greater than b" is satisfied between each pair of consecutive values, but globally the order is violated because NAN's are unordered and mess up transitivity:

3 isn't greater than NAN, and NAN isn't greater than 1, but it is not true that 3 isn't greater than 1.

In the general case of sorting elements, I think that the solution is "don't do that". If you have objects which don't form a total order, then you can't expect to get sensible results from sorting them.

In the case of floats, it would be nice to have a totalOrder function as specified in the 2008 revision of IEEE-354:

https://irem.univ-reunion.fr/IMG/pdf/ieee-754-2008.pdf

Then we could sensibly do:

sorted(floats_or_decimals, key=totalorder)

and at least NANs would end up in a consistent place and everything else sorted correctly.

Thanks for the analysis. To me, the behavior looked a lot like stripping NANs left and right from the list, but what you're explaining makes this appear even more as a bug in the implementation of median() - basically wrong assumptions about NANs sorting correctly. The outcome could be more or less random, it seems. In SQL NULL always sort smaller than anything else. Perhaps that would be a strategy to use here as well. The totalOrder predicate in the IEEE spec would make NANs get shifted to the left or right part of the sequence, depending on the NAN sign. In any case, +1 on anything which fixes this :-) -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Experts (#1, Aug 28 2021)

...

...
...
Python Projects, Coaching and Support ... https://www.egenix.com/ Python Product Development ... https://consulting.egenix.com/

Richard Damon

5:33 a.m.

On 8/28/21 6:23 AM, Marc-Andre Lemburg wrote:

...

To me, the behavior looked a lot like stripping NANs left and right from the list, but what you're explaining makes this appear even more as a bug in the implementation of median() - basically wrong assumptions about NANs sorting correctly. The outcome could be more or less random, it seems.

It isn't a 'bug in median()' making the wrong assumption about NANs sorting, it is an error in GIVING median a NAN which violates its precondition that the input have a total-order by the less than operator. Asking for the median value of a list that doesn't have a proper total order is a nonsense question, so you get a nonsense answer. It costs too much to have median test if the input does have a total order, just to try to report this sort of condition, that it won't be done for a general purpose operation. -- Richard Damon

Marc-Andre Lemburg

6:49 a.m.

On 28.08.2021 14:33, Richard Damon wrote:

...

On 8/28/21 6:23 AM, Marc-Andre Lemburg wrote:

...
To me, the behavior looked a lot like stripping NANs left and right from the list, but what you're explaining makes this appear even more as a bug in the implementation of median() - basically wrong assumptions about NANs sorting correctly. The outcome could be more or less random, it seems.

It isn't a 'bug in median()' making the wrong assumption about NANs sorting, it is an error in GIVING median a NAN which violates its precondition that the input have a total-order by the less than operator.

That precondition is not documented as such, though: https://docs.python.org/3/library/statistics.html#statistics.median

...

Asking for the median value of a list that doesn't have a proper total order is a nonsense question, so you get a nonsense answer.

Leaving aside that many programmers will probably not know that NANs cause the total ordering of Python floats to fail (even though they are of type float), you'd expect Python to do the right thing and either: - raise an exception or - apply a work-around to regain total ordering, as suggested by Steven, or - return NAN for the calculation as NumPy does.

...

...
...
import statistics statistics.median([1,2,3]) 2 nan = float('nan') statistics.median([1,2,3,nan]) 2.5 statistics.median([1,2,nan,3]) nan statistics.median([1,nan,2,3]) nan statistics.median([nan,1,2,3]) 1.5 nan < 1 False nan < nan False 1 < nan False

vs.

...

...
...
import numpy as np nan = np.nan np.median(np.array([1,2,3,nan])) nan np.median(np.array([1,2,nan,3])) nan np.median(np.array([1,nan,2,3])) nan np.median(np.array([nan,1,2,3])) nan nan < nan False nan < 1 False 1 < nan False

...

It costs too much to have median test if the input does have a total order, just to try to report this sort of condition, that it won't be done for a general purpose operation.

If NumPy can do it, why not Python ? -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Experts (#1, Aug 28 2021)

...

...
...
Python Projects, Coaching and Support ... https://www.egenix.com/ Python Product Development ... https://consulting.egenix.com/

Mark Dickinson

26 Aug 26 Aug

2:03 a.m.

Returning a NaN by default has the advantage of being consistent with IEEE 754 semantics for sequence-based operations (like `sum` and `dot`) and with existing Python `math` module functions like `fsum`, `prod` and `hypot`. In IEEE 754, the majority of operations silently return a NaN (not signalling any floating-point exception) when given a NaN as input.

Brendan Barnwell

11:41 a.m.

On 2021-08-23 20:53, Steven D'Aprano wrote:

...

So I propose that statistics functions gain a keyword only parameter to specify the desired behaviour when a NAN is found:

- raise an exception

- return NAN

- ignore it (filter out NANs)

which seem to be the three most common preference. (It seems to be split roughly equally between the three.)

Thoughts? Objections?

I agree that these are the three options that should be available because they're the most commonly used ones in other tools that handle NANs (like numpy and pandas).

...

Does anyone have any strong feelings about what should be the default?

I'm conflicted. The NAN-aware tool I use most is Pandas, which for the most part handles nans by filtering them out, and this is very handy. But that's partly because Pandas has a lot of NAN-awareness built in (making it easy to, for instance, fill in NANs with some default or imputed value). I think I'd lean toward "return NAN" as the best default, as it seems most consistent with how NAN works in ordinary mathematical expressions (e.g., `2 + nan`). One important thing we should think about is whether to add similar handling to `max` and `min`. These are builtin functions, not in the statistics module, but they have similarly confusing behavior with NAN: compare `max(1, 2, float('nan'))` with `max(float('nan'), 1, 2)`. As long as we're handling this for median and so on, it would be nice to have the ability to do NAN-aware max and min as well. -- Brendan Barnwell "Do not follow where the path may lead. Go, instead, where there is no path, and leave a trail." --author unknown

Jeff Allen

27 Aug 27 Aug

3:08 a.m.

On 26/08/2021 19:41, Brendan Barnwell wrote:

...

On 2021-08-23 20:53, Steven D'Aprano wrote:

...
So I propose that statistics functions gain a keyword only parameter to specify the desired behaviour when a NAN is found:

- raise an exception - return NAN - ignore it (filter out NANs)

which seem to be the three most common preference. (It seems to be split roughly equally between the three.)

Thoughts? Objections?

I'd like to suggest that there isn't a single answer that is most natural for all functions. There may be as few as two. Guido's proposal was that mean return nan because the naive arithmetic formula would return nan. The awkward first example was median(), which is based on order (comparison). Now Brendan has pointed out:

...

One important thing we should think about is whether to add similar handling to `max` and `min`. These are builtin functions, not in the statistics module, but they have similarly confusing behavior with NAN: compare `max(1, 2, float('nan'))` with `max(float('nan'), 1, 2)`.

The real behaviour of max() is to return the first argument that is not exceeded by any that follow, so:

...

...
...
max(nan, nan2, 1, 2) is nan True max(nan2, nan, 1, 2) is nan2 True

As a definition, that is not as easy to understand as "return the largest argument". The behaviour is because in Python, x>nan is False. This choice, which is often sensible, makes the set of float values less than totally ordered. It seems to me to be an error in principle to apply a function whose simple definition assumes a total ordering, to a set that cannot be ordered. So most natural to me would be to raise an error for this class of function. Meanwhile, functions that have a purely arithmetic definition most naturally return nan. Are there any other classes of function than comparison or arithmetic? Counting, perhaps or is that comparison again? Proposals for a general solution, especially if based on a replacement value, are more a question of how you would like to pre-filter your set. An API could offer some filters, or it may be clearer left to the caller. It is no doubt too late to alter the default behaviour of familiar functions, but there could be a "strict" mode. -- Jeff Allen

Christopher Barker

10:24 a.m.

If folks want faster processing (checking for, replacing) of NaNs in sequences, a function written in C could be added to the math module. Or the statistics module) Now that I said that, it might make sense to put such a function in the statistics package, for use their anyway. Personally, I think if you are working with large enough datasets to care, you probably should use numpy anyway. -CHB On Fri, Aug 27, 2021 at 3:39 AM Jeff Allen <ja.py@farowl.co.uk> wrote:

...

On 26/08/2021 19:41, Brendan Barnwell wrote:

On 2021-08-23 20:53, Steven D'Aprano wrote:

So I propose that statistics functions gain a keyword only parameter to specify the desired behaviour when a NAN is found:

- raise an exception - return NAN - ignore it (filter out NANs)

which seem to be the three most common preference. (It seems to be split roughly equally between the three.)

Thoughts? Objections?

I'd like to suggest that there isn't a single answer that is most natural for all functions. There may be as few as two.

Guido's proposal was that mean return nan because the naive arithmetic formula would return nan. The awkward first example was median(), which is based on order (comparison). Now Brendan has pointed out:

One important thing we should think about is whether to add similar handling to `max` and `min`. These are builtin functions, not in the statistics module, but they have similarly confusing behavior with NAN: compare `max(1, 2, float('nan'))` with `max(float('nan'), 1, 2)`.

The real behaviour of max() is to return the first argument that is not exceeded by any that follow, so:

...
...
...
max(nan, nan2, 1, 2) is nan True max(nan2, nan, 1, 2) is nan2 True

As a definition, that is not as easy to understand as "return the largest argument". The behaviour is because in Python, x>nan is False. This choice, which is often sensible, makes the set of float values less than totally ordered. It seems to me to be an error in principle to apply a function whose simple definition assumes a total ordering, to a set that cannot be ordered. So most natural to me would be to raise an error for this class of function.

Meanwhile, functions that have a purely arithmetic definition most naturally return nan. Are there any other classes of function than comparison or arithmetic? Counting, perhaps or is that comparison again?

Proposals for a general solution, especially if based on a replacement value, are more a question of how you would like to pre-filter your set. An API could offer some filters, or it may be clearer left to the caller. It is no doubt too late to alter the default behaviour of familiar functions, but there could be a "strict" mode.

--

Jeff Allen

_______________________________________________ Python-ideas mailing list -- python-ideas@python.org To unsubscribe send an email to python-ideas-leave@python.org https://mail.python.org/mailman3/lists/python-ideas.python.org/ Message archived at https://mail.python.org/archives/list/python-ideas@python.org/message/FQNZLN... Code of Conduct: http://python.org/psf/codeofconduct/

-- Christopher Barker, PhD (Chris) Python Language Consulting - Teaching - Scientific Software Development - Desktop GUI and Web Development - wxPython, numpy, scipy, Cython

Finn Mason

2:50 p.m.

Perhaps a math.hasnan() function for collections could be implemented with binary search? math.hasnan(seq) Though it is true that if you're using datasets large enough to care about speed, you should probably be using the SciPy stack instead of statistics in the first place. On Fri, Aug 27, 2021, 11:25 AM Christopher Barker <pythonchb@gmail.com> wrote:

...

If folks want faster processing (checking for, replacing) of NaNs in sequences, a function written in C could be added to the math module. Or the statistics module)

Now that I said that, it might make sense to put such a function in the statistics package, for use their anyway.

Personally, I think if you are working with large enough datasets to care, you probably should use numpy anyway.

-CHB

On Fri, Aug 27, 2021 at 3:39 AM Jeff Allen <ja.py@farowl.co.uk> wrote:

...
On 26/08/2021 19:41, Brendan Barnwell wrote:

On 2021-08-23 20:53, Steven D'Aprano wrote:

So I propose that statistics functions gain a keyword only parameter to specify the desired behaviour when a NAN is found:

- raise an exception - return NAN - ignore it (filter out NANs)

which seem to be the three most common preference. (It seems to be split roughly equally between the three.)

Thoughts? Objections?

I'd like to suggest that there isn't a single answer that is most natural for all functions. There may be as few as two.

Guido's proposal was that mean return nan because the naive arithmetic formula would return nan. The awkward first example was median(), which is based on order (comparison). Now Brendan has pointed out:

One important thing we should think about is whether to add similar handling to `max` and `min`. These are builtin functions, not in the statistics module, but they have similarly confusing behavior with NAN: compare `max(1, 2, float('nan'))` with `max(float('nan'), 1, 2)`.

The real behaviour of max() is to return the first argument that is not exceeded by any that follow, so:

...
...
...
max(nan, nan2, 1, 2) is nan True max(nan2, nan, 1, 2) is nan2 True

As a definition, that is not as easy to understand as "return the largest argument". The behaviour is because in Python, x>nan is False. This choice, which is often sensible, makes the set of float values less than totally ordered. It seems to me to be an error in principle to apply a function whose simple definition assumes a total ordering, to a set that cannot be ordered. So most natural to me would be to raise an error for this class of function.

Meanwhile, functions that have a purely arithmetic definition most naturally return nan. Are there any other classes of function than comparison or arithmetic? Counting, perhaps or is that comparison again?

Proposals for a general solution, especially if based on a replacement value, are more a question of how you would like to pre-filter your set. An API could offer some filters, or it may be clearer left to the caller. It is no doubt too late to alter the default behaviour of familiar functions, but there could be a "strict" mode.

--

Jeff Allen

_______________________________________________ Python-ideas mailing list -- python-ideas@python.org To unsubscribe send an email to python-ideas-leave@python.org https://mail.python.org/mailman3/lists/python-ideas.python.org/ Message archived at https://mail.python.org/archives/list/python-ideas@python.org/message/FQNZLN... Code of Conduct: http://python.org/psf/codeofconduct/

-- Christopher Barker, PhD (Chris)

Python Language Consulting - Teaching - Scientific Software Development - Desktop GUI and Web Development - wxPython, numpy, scipy, Cython _______________________________________________ Python-ideas mailing list -- python-ideas@python.org To unsubscribe send an email to python-ideas-leave@python.org https://mail.python.org/mailman3/lists/python-ideas.python.org/ Message archived at https://mail.python.org/archives/list/python-ideas@python.org/message/7CQK5A... Code of Conduct: http://python.org/psf/codeofconduct/

Cameron Simpson

29 Aug 29 Aug

2:35 p.m.

On 27Aug2021 15:50, Finn Mason <finnjavier08@gmail.com> wrote:

...

Perhaps a math.hasnan() function for collections could be implemented with binary search?

math.hasnan(seq)

tritium-list＠sdamon.com

5:20 p.m.

Not to go off on too much of a tangent, but isn't NaN unorderable? Its greater than nothing, and less than nothing, so you can't even really sort a list with a NaN value in it (..though I'm sure python does sort it by some metric for practical reasons) - it would be impossible to find a NaN with a binary search... it would be impossible to have a NaN in an ordered sequence .... wouldn't it? -----Original Message----- From: Cameron Simpson <cs@cskk.id.au> Sent: Sunday, August 29, 2021 5:36 PM To: python-ideas@python.org Subject: [Python-ideas] Re: NAN handling in statistics functions On 27Aug2021 15:50, Finn Mason <finnjavier08@gmail.com> wrote:

...

Perhaps a math.hasnan() function for collections could be implemented with binary search?

math.hasnan(seq)

Why would a binary search be of use? A staraight sequential scan of the sequence seems the only reliable method. Binary search is for finding a value in an ordered sequence. Cheers, Cameron Simpson <cs@cskk.id.au> _______________________________________________ Python-ideas mailing list -- python-ideas@python.org To unsubscribe send an email to python-ideas-leave@python.org https://mail.python.org/mailman3/lists/python-ideas.python.org/ Message archived at https://mail.python.org/archives/list/python-ideas@python.org/message/22IYUU JRCPCYUIZBXAUHWE6SBPTIHQME/ Code of Conduct: http://python.org/psf/codeofconduct/

Steven D'Aprano

8:31 p.m.

On Sun, Aug 29, 2021 at 08:20:07PM -0400, tritium-list@sdamon.com wrote:

...

Not to go off on too much of a tangent, but isn't NaN unorderable? Its greater than nothing, and less than nothing, so you can't even really sort a list with a NaN value in it (..though I'm sure python does sort it by some metric for practical reasons) - it would be impossible to find a NaN with a binary search... it would be impossible to have a NaN in an ordered sequence .... wouldn't it?

Sorting NANs will end up arranging them in arbitrary positions, and spoil the order of other values: >>> from math import nan >>> sorted([4, nan, 2, 5, 1, nan, 3, 0]) [4, nan, 0, 1, 2, 5, nan, 3] I *think* Timsort will end up leaving each NAN in its original position, but other implementations may do something different. However you sort, they end up messing the order up. However we could add a function, totalorder, which can be used as a key function to force an order on NANs. The 2008 version of the IEEE-754 standard recommends such a function: from some_module import totalorder sorted([4, nan, 2, 5, 1, nan, 3, 0], key=totalorder) # --> [nan, nan, 0, 1, 2, 3, 4, 5] It would be nice if such a totalorder function worked correctly on both floats and Decimals. Anyone feel up to writing one? Decimal already has a `compare_total` method, but I'm unsure if it behaves the expected way. But we have no equivalent key function for floats. -- Steve

MRAB

8:57 p.m.

On 2021-08-30 04:31, Steven D'Aprano wrote:

...

On Sun, Aug 29, 2021 at 08:20:07PM -0400, tritium-list@sdamon.com wrote:

...
Not to go off on too much of a tangent, but isn't NaN unorderable? Its greater than nothing, and less than nothing, so you can't even really sort a list with a NaN value in it (..though I'm sure python does sort it by some metric for practical reasons) - it would be impossible to find a NaN with a binary search... it would be impossible to have a NaN in an ordered sequence .... wouldn't it?

Sorting NANs will end up arranging them in arbitrary positions, and spoil the order of other values:

>>> from math import nan >>> sorted([4, nan, 2, 5, 1, nan, 3, 0]) [4, nan, 0, 1, 2, 5, nan, 3]

I *think* Timsort will end up leaving each NAN in its original position, but other implementations may do something different. However you sort, they end up messing the order up.

However we could add a function, totalorder, which can be used as a key function to force an order on NANs. The 2008 version of the IEEE-754 standard recommends such a function:

from some_module import totalorder sorted([4, nan, 2, 5, 1, nan, 3, 0], key=totalorder) # --> [nan, nan, 0, 1, 2, 3, 4, 5]

It would be nice if such a totalorder function worked correctly on both floats and Decimals. Anyone feel up to writing one?

How about: def totalorder(x): return (0,) if math.isnan(x) else (1, x)

...

Decimal already has a `compare_total` method, but I'm unsure if it behaves the expected way. But we have no equivalent key function for floats.

Chris Angelico

9:04 p.m.

On Mon, Aug 30, 2021 at 1:33 PM Steven D'Aprano <steve@pearwood.info> wrote:

...

However we could add a function, totalorder, which can be used as a key function to force an order on NANs. The 2008 version of the IEEE-754 standard recommends such a function:

from some_module import totalorder sorted([4, nan, 2, 5, 1, nan, 3, 0], key=totalorder) # --> [nan, nan, 0, 1, 2, 3, 4, 5]

It would be nice if such a totalorder function worked correctly on both floats and Decimals. Anyone feel up to writing one?

I really don't feel like buying the standards document itself, so I'm going based on this, which appears to be quoting the standard: https://github.com/rust-lang/rust/issues/5585 Based on that, I don't think it's possible to have a totalorder function that will work 100% correctly on float and Decimal in a mixture. I suspect it's not even possible to make it generic while still being fully compliant. The differences from the vanilla less-than operator are: 1) Positive zero sorts after negative zero 2) NaNs sort at one end or the other depending on their sign bit 3) Signalling NaNs are closer to zero than quiet 4) NaNs are sorted by payload 5) Different representations of the same value are distinguished by their exponents. I'm not sure when that would come up. So here are two partial implementations. 1) Ensure that NaNs are at the end, but otherwise unchanged. Compatible with all numeric types. def simpleorder(val): return (val != val, val) 2) Acknowledge signs on NaNs and zeroes. Compatible with floats, not sure about Decimals. I haven't figured out how to make a negative NaN in Decimal. def floatorder(val): return (math.copysign(1, val), val != val, val) Neither is fully compliant. A signalling NaN will probably cause an error. (I have no idea. Never worked with them.) NaN payloads... I don't know how to access those in Python other than with ctypes. And if two floats represent the same number, under what circumstances could their exponents differ? I doubt we'll get a fully compliant implementation in Python. If one is to exist, it'd probably be best to write it in C, using someone else's code: https://www.gnu.org/software/libc/manual/html_node/FP-Comparison-Functions.h... And it would be specific to float, not Decimal, which would need a completely different implementation. I'm not sure how many of the same concepts even exist. TBH, I would just use simpleorder for most situations. It's simple, easy, and doesn't care about data types. All NaNs get shoved to the end, everything else gets compared normally. ChrisA

Steven D'Aprano

27 Aug 27 Aug

6:49 p.m.

On Tue, Aug 24, 2021 at 01:53:51PM +1000, Steven D'Aprano wrote:

...

I've spoken to users of other statistics packages and languages, such as R, and I cannot find any consensus on what the "right" behaviour should be for NANs except "not that!".

So I propose that statistics functions gain a keyword only parameter to specify the desired behaviour when a NAN is found:

Thanks everyone for the feedback, does anyone have a strong opinion on what to name this parameter? In R, the usual parameter name is typically "na.rm" to remove them: https://stat.ethz.ch/R-manual/R-patched/library/base/html/mean.html https://stat.ethz.ch/R-manual/R-patched/library/stats/html/sd.html Matlab optionally takes one of two strings: https://au.mathworks.com/help/matlab/ref/mean.html?#d123e832786 It doesn't seem to have named parameters. I'm leaning towards "nans=..." with an enum. -- Steve

Sebastian Berg

8:48 p.m.

On Sat, 2021-08-28 at 11:49 +1000, Steven D'Aprano wrote:

...

On Tue, Aug 24, 2021 at 01:53:51PM +1000, Steven D'Aprano wrote:

...
I've spoken to users of other statistics packages and languages, such as R, and I cannot find any consensus on what the "right" behaviour should be for NANs except "not that!".

So I propose that statistics functions gain a keyword only parameter to specify the desired behaviour when a NAN is found:

Thanks everyone for the feedback, does anyone have a strong opinion on what to name this parameter?

In R, the usual parameter name is typically "na.rm" to remove them:

https://stat.ethz.ch/R-manual/R-patched/library/base/html/mean.html

https://stat.ethz.ch/R-manual/R-patched/library/stats/html/sd.html

Matlab optionally takes one of two strings:

https://au.mathworks.com/help/matlab/ref/mean.html?#d123e832786

It doesn't seem to have named parameters.

I'm leaning towards "nans=..." with an enum.

SciPy should probably also be a data-point, it uses: nan_policy : {'propagate', 'raise', 'omit'}, optional statsmodels seems to use: missing : str Available options are ‘none’, ‘drop’, and ‘raise’ pandas has skipna=bool. Since pandas and statsmodels hint to "missing values", there is likely a good reason to not worry about them. I guess it was already noted that both statsmodels and SciPy default to propagating. [1] Cheers, Sebastian [1] In general Python is more careful since it raises errors sometimes. But this is almost only(?) when creating a non-finite value from finite values. Not when propagating non-finite values (which are not normally IEEE warnings, although creating NaN from inf with `inf - inf` is). In that sense it is different, but probably not much.

Christopher Barker

10:14 p.m.

...

SciPy should probably also be a data-point, it uses:

nan_policy : {'propagate', 'raise', 'omit'}, optional

+1 Also +1 on a string flag, rather than an Enum. -CHB -- Christopher Barker, PhD (Chris) Python Language Consulting - Teaching - Scientific Software Development - Desktop GUI and Web Development - wxPython, numpy, scipy, Cython

David Mertz, Ph.D.

10:36 p.m.

I like the statsmodels spelling better: missing : str; Available options are ‘none’, ‘drop’, and ‘raise’ But this is bikeshed painting if the options exist. However, I WOULD urge the argument to take EITHER a string OR an enum. I don't think any other libraries mentioned do that, but it would just seem friendly. The names in the enum, of course, should match the string names (other than casing perhaps). On Sat, Aug 28, 2021 at 1:16 AM Christopher Barker <pythonchb@gmail.com> wrote:

...

SciPy should probably also be a data-point, it uses:

...
nan_policy : {'propagate', 'raise', 'omit'}, optional

+1

Also +1 on a string flag, rather than an Enum.

-CHB -- Christopher Barker, PhD (Chris)

Python Language Consulting - Teaching - Scientific Software Development - Desktop GUI and Web Development - wxPython, numpy, scipy, Cython _______________________________________________ Python-ideas mailing list -- python-ideas@python.org To unsubscribe send an email to python-ideas-leave@python.org https://mail.python.org/mailman3/lists/python-ideas.python.org/ Message archived at https://mail.python.org/archives/list/python-ideas@python.org/message/D4GVO4... Code of Conduct: http://python.org/psf/codeofconduct/

Steven D'Aprano

10:56 p.m.

On Sat, Aug 28, 2021 at 01:36:33AM -0400, David Mertz, Ph.D. wrote:

...

I like the statsmodels spelling better: missing : str; Available options are ‘none’, ‘drop’, and ‘raise’

NANs do not necessarily represent missing data. -- Steve

David Mertz, Ph.D.

11:03 p.m.

On Sat, Aug 28, 2021, 1:58 AM Steven D'Aprano <steve@pearwood.info> wrote:

...

On Sat, Aug 28, 2021 at 01:36:33AM -0400, David Mertz, Ph.D. wrote:

...
I like the statsmodels spelling better: missing : str; Available options are ‘none’, ‘drop’, and ‘raise’

NANs do not necessarily represent missing data.

I think in the context of `stats` they do. But this is color of bikeshed, and I defer to you, of course.

Marc-Andre Lemburg

28 Aug 28 Aug

3:11 a.m.

On 28.08.2021 07:14, Christopher Barker wrote:

...

SciPy should probably also be a data-point, it uses:

nan_policy : {'propagate', 'raise', 'omit'}, optional

+1

Also +1 on a string flag, rather than an Enum.

Same here. Codecs use strings as well: 'strict', 'ignore', 'replace' (and a bunch of others). -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Experts (#1, Aug 28 2021)

...

...
...
Python Projects, Coaching and Support ... https://www.egenix.com/ Python Product Development ... https://consulting.egenix.com/

Ronald Oussoren

30 Aug 30 Aug

12:57 a.m.

...

On 28 Aug 2021, at 07:14, Christopher Barker <pythonchb@gmail.com> wrote:

SciPy should probably also be a data-point, it uses:

nan_policy : {'propagate', 'raise', 'omit'}, optional

+1

Also +1 on a string flag, rather than an Enum.

Why do you prefer strings for the options rather than an Enum? The enum clearly documents what the valid options are for the option. Ronald — Twitter / micro.blog: @ronaldoussoren Blog: https://blog.ronaldoussoren.net/

Christopher Barker

9:19 a.m.

On Mon, Aug 30, 2021 at 12:57 AM Ronald Oussoren <ronaldoussoren@mac.com> wrote:

...

On 28 Aug 2021, at 07:14, Christopher Barker <pythonchb@gmail.com> wrote:

...

...
Also +1 on a string flag, rather than an Enum.

ou prefer strings for the options rather than an Enum? The enum clearly documents what the valid options are for the option.

So does documentation (docstrings, useful error messages). I don't think the documentation built in to an Enum is any easier to access. In fact, looking now, I'm trying to see how an Enum provides any easy to access documentation -- other than looking at the creation code. As a rule, I don't think Enums provide documentation of the valid values, but rather, enforcement. e.g.: what are the valid values? what do they mean? To be honest, I haven't really used Enums much (in fact, only to mirror C enums in extension code), but part of that is because I have yet to see what the point is in Python, over simple string flags. I suppose they provide a real advantage for static typing, but other than that I just don't see it. But what they do is create a burden of extra code to read and write. Compare: from statistics import median result = median(the_data, nan_policy='omit') with: from statistics import median, NaNPolicy result = median(the_data, nan_policy=NaNPolicy.Omit) or maybe: import statistics as stats result = stats.median(the_data, nan_policy=stats.Omit) There are any number of ways to import, and names to create, but they all seem to me to be more awkward to use than a simple text flag. Ever since I started using Python, I've really appreciated the use of string flags :-) I do see how using an Enum makes things a bit easier for the author of a package (more DRY), but I don't think that should be the priority here. No one needs to convince me, but I would be interested in seeing the recommended way to import and use Enum flags - maybe it doesn't have to be as awkward as I think it is. -CHB -- Christopher Barker, PhD (Chris) Python Language Consulting - Teaching - Scientific Software Development - Desktop GUI and Web Development - wxPython, numpy, scipy, Cython

Chris Angelico

9:23 a.m.

On Tue, Aug 31, 2021 at 2:19 AM Christopher Barker <pythonchb@gmail.com> wrote:

...

On Mon, Aug 30, 2021 at 12:57 AM Ronald Oussoren <ronaldoussoren@mac.com> wrote:

...
On 28 Aug 2021, at 07:14, Christopher Barker <pythonchb@gmail.com> wrote:

...
Also +1 on a string flag, rather than an Enum.

ou prefer strings for the options rather than an Enum? The enum clearly documents what the valid options are for the option.

So does documentation (docstrings, useful error messages). I don't think the documentation built in to an Enum is any easier to access. In fact, looking now, I'm trying to see how an Enum provides any easy to access documentation -- other than looking at the creation code. As a rule, I don't think Enums provide documentation of the valid values, but rather, enforcement.

e.g.: what are the valid values? what do they mean?

To be honest, I haven't really used Enums much (in fact, only to mirror C enums in extension code), but part of that is because I have yet to see what the point is in Python, over simple string flags.

I suppose they provide a real advantage for static typing, but other than that I just don't see it.

They provide a *huge* advantage when they can be combined. It's easy to accept a flags argument that is the bitwise Or of a collection of flags, and then ascertain whether or not a specific flag was included. The repr of such a combination is useful and readable, too. ChrisA

Steven D'Aprano

6:46 p.m.

On Tue, Aug 31, 2021 at 02:23:29AM +1000, Chris Angelico wrote:

...

On Tue, Aug 31, 2021 at 2:19 AM Christopher Barker <pythonchb@gmail.com> wrote:

...

...
I suppose they provide a real advantage for static typing, but other than that I just don't see it.

They provide a *huge* advantage when they can be combined. It's easy to accept a flags argument that is the bitwise Or of a collection of flags, and then ascertain whether or not a specific flag was included. The repr of such a combination is useful and readable, too.

I'm not a big user of Enums, but I *think* that only applies for IntEnums? In any case, in this case it wouldn't make sense to combine NAN policies. What would it mean to combine the "raise exception on NAN" and "ignore NANs" policies? -- Steve

Chris Angelico

6:49 p.m.

On Tue, Aug 31, 2021 at 11:47 AM Steven D'Aprano <steve@pearwood.info> wrote:

...

On Tue, Aug 31, 2021 at 02:23:29AM +1000, Chris Angelico wrote:

...
On Tue, Aug 31, 2021 at 2:19 AM Christopher Barker <pythonchb@gmail.com> wrote:

...
...
I suppose they provide a real advantage for static typing, but other than that I just don't see it.

They provide a *huge* advantage when they can be combined. It's easy to accept a flags argument that is the bitwise Or of a collection of flags, and then ascertain whether or not a specific flag was included. The repr of such a combination is useful and readable, too.

I'm not a big user of Enums, but I *think* that only applies for IntEnums?

In any case, in this case it wouldn't make sense to combine NAN policies. What would it mean to combine the "raise exception on NAN" and "ignore NANs" policies?

Agreed. In this case, an enum offers little that a string can't do just as well. But there are plenty of other situations where an enum would be better (*ahem* open modes?), although they do come with a performance hit in some cases. ChrisA

Christopher Barker

10:27 p.m.

On Mon, Aug 30, 2021 at 6:50 PM Steven D'Aprano <steve@pearwood.info> wrote:

...

...
They provide a *huge* advantage when they can be combined. It's easy to accept a flags argument that is the bitwise Or of a collection of flags,

...

I'm not a big user of Enums, but I *think* that only applies for IntEnums?

Actually, I think that's "Flag" Enums :-) -- and this is a nice use case, but not for the topic at hand. -CHB -- Christopher Barker, PhD (Chris) Python Language Consulting - Teaching - Scientific Software Development - Desktop GUI and Web Development - wxPython, numpy, scipy, Cython

Brendan Barnwell

6:56 p.m.

On 2021-08-30 09:23, Chris Angelico wrote:

...

On Tue, Aug 31, 2021 at 2:19 AM Christopher Barker <pythonchb@gmail.com> wrote:

...
To be honest, I haven't really used Enums much (in fact, only to mirror C enums in extension code), but part of that is because I have yet to see what the point is in Python, over simple string flags.

I suppose they provide a real advantage for static typing, but other than that I just don't see it.

They provide a *huge* advantage when they can be combined. It's easy to accept a flags argument that is the bitwise Or of a collection of flags, and then ascertain whether or not a specific flag was included. The repr of such a combination is useful and readable, too.

In general I find that harder to grok than just using separate boolean arguments for each flag. -- Brendan Barnwell "Do not follow where the path may lead. Go, instead, where there is no path, and leave a trail." --author unknown

Ronald Oussoren

31 Aug 31 Aug

4:17 a.m.

...

On 30 Aug 2021, at 18:19, Christopher Barker <pythonchb@gmail.com> wrote:

On Mon, Aug 30, 2021 at 12:57 AM Ronald Oussoren <ronaldoussoren@mac.com <mailto:ronaldoussoren@mac.com>> wrote:

...
On 28 Aug 2021, at 07:14, Christopher Barker <pythonchb@gmail.com <mailto:pythonchb@gmail.com>> wrote:

Also +1 on a string flag, rather than an Enum. ou prefer strings for the options rather than an Enum? The enum clearly documents what the valid options are for the option.

So does documentation (docstrings, useful error messages). I don't think the documentation built in to an Enum is any easier to access.

The enum definition shows the valid names that can be used, string literals are more open ended. Documentation helps, but can get out of sync.

...

In fact, looking now, I'm trying to see how an Enum provides any easy to access documentation -- other than looking at the creation code. As a rule, I don't think Enums provide documentation of the valid values, but rather, enforcement.

e.g.: what are the valid values? what do they mean?

To be honest, I haven't really used Enums much (in fact, only to mirror C enums in extension code), but part of that is because I have yet to see what the point is in Python, over simple string flags.

I suppose they provide a real advantage for static typing, but other than that I just don't see it.

Not just static typing, but static analysis in general. Tools like flake8 will complain about typos, completion tools can help typing, ….

...

But what they do is create a burden of extra code to read and write. Compare:

from statistics import median

result = median(the_data, nan_policy='omit')

with:

from statistics import median, NaNPolicy

result = median(the_data, nan_policy=NaNPolicy.Omit)

or maybe:

import statistics as stats

result = stats.median(the_data, nan_policy=stats.Omit)

It is not necessarily a hard choice, it is possible to define enums that can be compared with strings, such as: class NanPolicy(str, enum.Enum): ….

...

There are any number of ways to import, and names to create, but they all seem to me to be more awkward to use than a simple text flag.

Ever since I started using Python, I've really appreciated the use of string flags :-)

I tend to use constants instead of string literals because of better static analysis, and try to convert uses to enums over time. String flags work as well, but I’ve had too much problems due to typo’s in string literals that weren’t caught by incomplete tests. Being able to at least run a listing tool to find basic issues like typo’s is very convenient. But YMMV.

...

I do see how using an Enum makes things a bit easier for the author of a package (more DRY), but I don't think that should be the priority here.

No one needs to convince me, but I would be interested in seeing the recommended way to import and use Enum flags - maybe it doesn't have to be as awkward as I think it is.

I have no opinion on this particular API discussion, my question was out of interest in this particular remark. Ronald

...

-CHB

-- Christopher Barker, PhD (Chris)

Python Language Consulting - Teaching - Scientific Software Development - Desktop GUI and Web Development - wxPython, numpy, scipy, Cython

— Twitter / micro.blog: @ronaldoussoren Blog: https://blog.ronaldoussoren.net/

Christopher Barker

3:58 p.m.

First: I started this specifically in the context of the stats package and the NaN handling flag, but it did turn into a ore general discussion of Enums, so a final thought: On Tue, Aug 31, 2021 at 4:17 AM Ronald Oussoren <ronaldoussoren@mac.com> wrote:

...

Not just static typing, but static analysis in general. Tools like flake8 will complain about typos, completion tools can help typing, ….

...

I tend to use constants instead of string literals because of better static analysis, and try to convert uses to enums over time. String flags work as well, but I’ve had too much problems due to typo’s in string literals that weren’t caught by incomplete tests. Being able to at least run a listing tool to find basic issues like typo’s is very convenient. But YMMV.

I think this is the crux or it -- Enums are far more suited to static analysis. And that reflects a shift in Python over the years. Most of the changes in Python have made it a better "systems" language, and some have made it a somewhat more awkward "scripting" language. Features to support static analysis are a good example -- far less important for "scripting' that "systems programming". Personally, I find it odd -- I've spent literally a couple decades telling people that Python's dynamic nature is a net plus, and the bugs that static typing (and static analysis I suppose) catch for you are generally shallow bugs. (for example: misspelling a string flag is a shallow bug). But here we are. Anyway, if you are writing a quick script to calculate a few statistics, I think a string flag is easier. If you are writing a big system with some statistical calculations built in, you will appreciate the additional safety of an Enum. It's hard to optimize for different use cases. - CHB -- Christopher Barker, PhD (Chris) Python Language Consulting - Teaching - Scientific Software Development - Desktop GUI and Web Development - wxPython, numpy, scipy, Cython

Finn Mason

7:42 p.m.

I've honestly never really used an Enum, so I'm not an expert here. An idea might be using string flags, but setting module level constants equal to the string flag, so that you can use either. For example (using the ipython shell because it's easier in email with quoting and all): In [1]: import statistics as stats In [2]: stats.NAN_IGNORE Out [2]: 'ignore' In [3]: stats.median([3, float('nan'), 5], nan_policy=stats.NAN_IGNORE) == stats.median([3, float('nan'), 5], nan_policy='ignore') Out [3]: True I do think that it doesn't matter too much and an issue should be submitted soon. On Tue, Aug 31, 2021, 4:59 PM Christopher Barker <pythonchb@gmail.com> wrote:

...

First:

I started this specifically in the context of the stats package and the NaN handling flag, but it did turn into a ore general discussion of Enums, so a final thought:

On Tue, Aug 31, 2021 at 4:17 AM Ronald Oussoren <ronaldoussoren@mac.com> wrote:

...
Not just static typing, but static analysis in general. Tools like flake8 will complain about typos, completion tools can help typing, ….

...

...
I tend to use constants instead of string literals because of better static analysis, and try to convert uses to enums over time. String flags work as well, but I’ve had too much problems due to typo’s in string literals that weren’t caught by incomplete tests. Being able to at least run a listing tool to find basic issues like typo’s is very convenient. But YMMV.

I think this is the crux or it -- Enums are far more suited to static analysis.

And that reflects a shift in Python over the years. Most of the changes in Python have made it a better "systems" language, and some have made it a somewhat more awkward "scripting" language.

Features to support static analysis are a good example -- far less important for "scripting' that "systems programming".

Personally, I find it odd -- I've spent literally a couple decades telling people that Python's dynamic nature is a net plus, and the bugs that static typing (and static analysis I suppose) catch for you are generally shallow bugs. (for example: misspelling a string flag is a shallow bug).

But here we are.

Anyway, if you are writing a quick script to calculate a few statistics, I think a string flag is easier. If you are writing a big system with some statistical calculations built in, you will appreciate the additional safety of an Enum. It's hard to optimize for different use cases.

- CHB

-- Christopher Barker, PhD (Chris)

Python Language Consulting - Teaching - Scientific Software Development - Desktop GUI and Web Development - wxPython, numpy, scipy, Cython _______________________________________________ Python-ideas mailing list -- python-ideas@python.org To unsubscribe send an email to python-ideas-leave@python.org https://mail.python.org/mailman3/lists/python-ideas.python.org/ Message archived at https://mail.python.org/archives/list/python-ideas@python.org/message/LJ334T... Code of Conduct: http://python.org/psf/codeofconduct/

1164

Age (days ago)

1172

Last active (days ago)

List overview

Download

50 comments

18 participants

participants (18)

Brendan Barnwell
Cameron Simpson
Chris Angelico
Christopher Barker
David Mertz, Ph.D.
Finn Mason
Guido van Rossum
Jeff Allen
Marc-Andre Lemburg
Mark Dickinson
MRAB
Peter Otten
Richard Damon
Ronald Oussoren
Sebastian Berg
Serhiy Storchaka
Steven D'Aprano
tritium-list＠sdamon.com

NAN handling in statistics functions

tags

participants (18)