Have max and min functions ignore None

What do people think about having `max` and `min` ignore `None`? Examples: max(1, None) == 1 min(1, None) == 1 max([1, None]) == 1 max(None, None) == max() or max(None, None) == max([]) (The last one currently throws two different errors.) This change would allow one to use `None` as a default value. For example, def my_max(lst): best = None for x in lst: best = max(best, x) return best Currently, you would initialize `best` to the first element, or to float('-inf'). There are more complicated examples, which aren't just replications of `max`'s functionality. The example I have in mind wants to update several running maximums during iteration. I know that there are other ways to do it (having given one above). What if this becomes _the_ obvious way to do it? I'm concerned about this silencing some bugs which would have been caught before. I'm also worried about whether it would make sense to people learning Python. I'm less concerned about custom types which allow comparisons to `None`, because I don't understand why you would want that, but you can change my mind.

On 12/28/2015 11:08 PM, Franklin? Lee wrote:
This amounts to saying that the comparisions 1 < None and 1 > None are both defined and both True.
rewrite this as def my_best(iterable): it = iter(iterable) try: best = it.next() except StopIteration: raise ValueError('Empty iterable has no maximum') for x in it: if x > best: best = x
Currently, you would initialize `best` to the first element,
Since an empty iterable has no max (unless one wants to define the equivalent of float('-inf') as a default), initializing with the first element is the proper thing to do. ...
None currently means 'no value', which means that most operations on None are senseless. -- Terry Jan Reedy

I was hoping that my message was clear enough that I wouldn't get suggestions of alternatives. I know how to do without it. Take this example: from collections import defaultdict from math import inf def maxes(lst): bests = defaultdict(lambda: -inf) for x, y in lst: bests[x] = max(bests[x], y) return bests The proposed change would only save me an import, and I could've used `inf = float('inf')` instead. from collections import defaultdict def maxes(lst): bests = defaultdict(lambda: None) for x, y in lst: bests[x] = max(bests[x], y) return bests On Mon, Dec 28, 2015 at 11:52 PM, Terry Reedy <tjreedy@udel.edu> wrote:
Not exactly. max(1, None) == max(None, 1) == 1 There is no definable comparison to None which allows both max and min to return the correct value.
Well, `my_best = max` is the cleanest way. It's not the point.
Mathematically, the max of the empty set is the min of the ambient set. So the max of an empty collection of natural numbers is 0, while the max of an empty collection of reals is -inf. Of course, Python doesn't know what type of elements your collection is expected to have, so (as Nick said) you would manually specify the default with a keyword argument. But that's not the point.
No value, like a lack of something to consider in the calculation of the maximum? PS: This change would also allow one to use a `key` function which returns None for an object that shouldn't be considered. Now _that_ might be more useful. But again, I know how to deal without it: have the key function return `inf` or `-inf` instead. I'm asking if using `None` could become the "one obvious way to do it". There is semantic meaning to initializing `best = None`, after all: "At this point, there is no best yet."

On Dec 28, 2015, at 20:08, Franklin? Lee <leewangzhong+python@gmail.com> wrote:
This change would allow one to use `None` as a default value.
Actually, it might be useful to allow a default value in general. (In typed functional languages, you often specify a default, or use a type that has a default value, so max(list[A]) can always return an A.) Then again, you can write this pretty easily yourself: def my_max(iterable, *, default=_sentinel): try: return max(Iterable) except WhateverEmptyIterableRaises: if default is _sentinel: raise return default

On 29 December 2015 at 15:43, Andrew Barnert via Python-ideas <python-ideas@python.org> wrote:
min() and max() both support a "default" keyword-only parameter in 3.4+: >>> max([]) Traceback (most recent call last): File "<stdin>", line 1, in <module> ValueError: max() arg is an empty sequence >>> max([], default=None) That means using "None" as the default result for an empty iterable is already straightforward: def my_max(iterable): return max(iterable, default=None) def my_min(iterable): return min(iterable, default=None) You only have to filter the input data or use a custom key function in order to ignore None values that exist in the input, not to produce None rather than an exception when the input iterable is empty. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

On Mon, Dec 28, 2015 at 11:08:49PM -0500, Franklin? Lee wrote:
What do people think about having `max` and `min` ignore `None`?
I'd want to think about it carefully. I think that there might be a good case for making max and min a bit more sophisticated, but I'm not quite sure how sophisticated. There's more to it than just None. If you think of None as just some value, then including None in a list of numbers (say) is an error, and should raise an exception as it does now (in Python 3, not Python 2). So that's perfectly reasonable, and correct, behaviour. If you think of None as representing a missing value, then there are two equally good interpretations of max(x, None): either we ignore missing values and return x, or we say that if one value is unknown, the max is also clearly unknown, and propagate that missing value as the answer. So that's three perfectly reasonable behaviours: max(x, None) is an error and should raise; max(x, None) ignores None and returns x; max(x, None) is unknown or missing and returns None (or some other sentinel representing NA/Missing/Unknown). In R, the max or min of a list with missing values is the missing value, unless you specifically tell R to ignore NA:
In Javascript, I guess the equivalent would be null, which appears to be coerced to 0: js> Math.max(1, 2, null, 4) 4 js> Math.min(1, 2, null, 4) 0 I don't think there is any good justification for that behaviour. That's the sort of thing which gives weakly typed languages a bad name. Just tossing this out to be shot down... What if the builtin max and min remained unchanged, but we added variants of them to the statistics module which treated None as a missing value, to be either ignored or propagated, as R does? -- Steve

On 29 December 2015 at 21:22, Steven D'Aprano <steve@pearwood.info> wrote:
df.min(skipna=False) 0 NaN
If the statistics module were to start borrowing selected concepts from R, it makes sense to me to look at how those have been translated into the Python ecosystem by NumPy/SciPy/pandas first. In the case of min/max, the most relevant APIs appear to be: pandas.DataFrame.min pandas.DataFrame.max numpy.amin numpy.amax numpy.nanmin numpy.nanmax The pandas variants support a "skipna" argument, which indicates whether or not to ignore missing values (e.g. None, NaN). This defaults to true, so such null values are ignored. If you set it to False, they get included and propagate to the result: dtype: float64 dtype: float64 For NumPy, amin and amax propagate NaN/None, while nanmin/nanmax are able to filter out floating point NaN values, but emit TypeError if asked to cope with None as a value. I think the fact both NumPy and pandas support R-style handling of min() and max() counts in favour of having variants of those with additional options for handling missing data values in the standard library statistics module. Regards, Nick. P.S. Another option might be to consider the question as part of a general "data cleaning" strategy for the statistics module, similar to the one discussed for pandas at http://pandas.pydata.org/pandas-docs/stable/missing_data.html Even if the statistics module itself doesn't provide the tools to address those problems, it could provide some useful pointers on when someone may want to switch from the standard library module to a more comprehensive solution like pandas that better handles the messy complications of working with real world data (and data formats). -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

but emit TypeError if
asked to cope with None as a value.
Well, sort of. Numpy arrays are homogenous, you can't have a None in an array ( other than an object style). All the Numpy "ufuncs" create an array from the input first -- that's where you get your ValueError. But the Numpy experience is informative -- there have been years of " we need a better masked array" discussions, but no consensus on what it should be. For floats, NaN can be used for missing values, but there is no such value for integers, and each use case has a sufferer end "obvious" interpretation. That's why it's explicit what you want with the nan* functions. I don't think python should decide for users what None means in this context. -CHB

On Tue, Dec 29, 2015, 11:41 AM Chris Barker - NOAA Federal < chris.barker@noaa.gov> wrote:
None is obviously the sound of one hand clapping. When you understand its proper use, you become Enlightened.
NumPy and Pandas have a slightly different audience than Python core. The scientific community often veers more practical than pure, in some cases to the detriment of code clarity.
I prefer this option. Why solve the special case of max/min when we can solve (or help solve) the general case of missing data. There's already the internal ``_coerce`` method. Maybe clean that up for public consumption, or something like it, adding drop-missing functionality? If that flies, then there might be room for an ``interpolate(sequence, method='linear')`` which would be awesome.

On Thu, Dec 31, 2015 at 04:30:24AM +0000, Michael Selik wrote:
If that flies, then there might be room for an ``interpolate(sequence, method='linear')`` which would be awesome.
(I presume you're still talking about the statistics module here, not pandas.) What did you have in mind? -- Steve

On Wed, Dec 30, 2015 at 11:44 PM Steven D'Aprano <steve@pearwood.info> wrote:
While the scientific community is well-served by NumPy and Pandas, there are many users trying to do a lighter amount of data wrangling that does not include linear algebra. In my anecdotal experience, the most common tasks are: 1. drop records with missing/bad data 2. replace missing/bad values with a constant value 3. interpolate missing values with either a pad-forward or linear method While NumPy often has methods doing in-place mutation, the users I'm thinking of are generally not worried about memory size and would be better served by pure functions. Going back to the original topic of skipping None values. I'd like to add that many datasets use bizarre values like all 9s or -1 or '.' or whatever to represent missingness. So, I'm not confident there's a good general-purpose solution more simple than comprehensions.

On 12/28/2015 11:08 PM, Franklin? Lee wrote:
This amounts to saying that the comparisions 1 < None and 1 > None are both defined and both True.
rewrite this as def my_best(iterable): it = iter(iterable) try: best = it.next() except StopIteration: raise ValueError('Empty iterable has no maximum') for x in it: if x > best: best = x
Currently, you would initialize `best` to the first element,
Since an empty iterable has no max (unless one wants to define the equivalent of float('-inf') as a default), initializing with the first element is the proper thing to do. ...
None currently means 'no value', which means that most operations on None are senseless. -- Terry Jan Reedy

I was hoping that my message was clear enough that I wouldn't get suggestions of alternatives. I know how to do without it. Take this example: from collections import defaultdict from math import inf def maxes(lst): bests = defaultdict(lambda: -inf) for x, y in lst: bests[x] = max(bests[x], y) return bests The proposed change would only save me an import, and I could've used `inf = float('inf')` instead. from collections import defaultdict def maxes(lst): bests = defaultdict(lambda: None) for x, y in lst: bests[x] = max(bests[x], y) return bests On Mon, Dec 28, 2015 at 11:52 PM, Terry Reedy <tjreedy@udel.edu> wrote:
Not exactly. max(1, None) == max(None, 1) == 1 There is no definable comparison to None which allows both max and min to return the correct value.
Well, `my_best = max` is the cleanest way. It's not the point.
Mathematically, the max of the empty set is the min of the ambient set. So the max of an empty collection of natural numbers is 0, while the max of an empty collection of reals is -inf. Of course, Python doesn't know what type of elements your collection is expected to have, so (as Nick said) you would manually specify the default with a keyword argument. But that's not the point.
No value, like a lack of something to consider in the calculation of the maximum? PS: This change would also allow one to use a `key` function which returns None for an object that shouldn't be considered. Now _that_ might be more useful. But again, I know how to deal without it: have the key function return `inf` or `-inf` instead. I'm asking if using `None` could become the "one obvious way to do it". There is semantic meaning to initializing `best = None`, after all: "At this point, there is no best yet."

On Dec 28, 2015, at 20:08, Franklin? Lee <leewangzhong+python@gmail.com> wrote:
This change would allow one to use `None` as a default value.
Actually, it might be useful to allow a default value in general. (In typed functional languages, you often specify a default, or use a type that has a default value, so max(list[A]) can always return an A.) Then again, you can write this pretty easily yourself: def my_max(iterable, *, default=_sentinel): try: return max(Iterable) except WhateverEmptyIterableRaises: if default is _sentinel: raise return default

On 29 December 2015 at 15:43, Andrew Barnert via Python-ideas <python-ideas@python.org> wrote:
min() and max() both support a "default" keyword-only parameter in 3.4+: >>> max([]) Traceback (most recent call last): File "<stdin>", line 1, in <module> ValueError: max() arg is an empty sequence >>> max([], default=None) That means using "None" as the default result for an empty iterable is already straightforward: def my_max(iterable): return max(iterable, default=None) def my_min(iterable): return min(iterable, default=None) You only have to filter the input data or use a custom key function in order to ignore None values that exist in the input, not to produce None rather than an exception when the input iterable is empty. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

On Mon, Dec 28, 2015 at 11:08:49PM -0500, Franklin? Lee wrote:
What do people think about having `max` and `min` ignore `None`?
I'd want to think about it carefully. I think that there might be a good case for making max and min a bit more sophisticated, but I'm not quite sure how sophisticated. There's more to it than just None. If you think of None as just some value, then including None in a list of numbers (say) is an error, and should raise an exception as it does now (in Python 3, not Python 2). So that's perfectly reasonable, and correct, behaviour. If you think of None as representing a missing value, then there are two equally good interpretations of max(x, None): either we ignore missing values and return x, or we say that if one value is unknown, the max is also clearly unknown, and propagate that missing value as the answer. So that's three perfectly reasonable behaviours: max(x, None) is an error and should raise; max(x, None) ignores None and returns x; max(x, None) is unknown or missing and returns None (or some other sentinel representing NA/Missing/Unknown). In R, the max or min of a list with missing values is the missing value, unless you specifically tell R to ignore NA:
In Javascript, I guess the equivalent would be null, which appears to be coerced to 0: js> Math.max(1, 2, null, 4) 4 js> Math.min(1, 2, null, 4) 0 I don't think there is any good justification for that behaviour. That's the sort of thing which gives weakly typed languages a bad name. Just tossing this out to be shot down... What if the builtin max and min remained unchanged, but we added variants of them to the statistics module which treated None as a missing value, to be either ignored or propagated, as R does? -- Steve

On 29 December 2015 at 21:22, Steven D'Aprano <steve@pearwood.info> wrote:
df.min(skipna=False) 0 NaN
If the statistics module were to start borrowing selected concepts from R, it makes sense to me to look at how those have been translated into the Python ecosystem by NumPy/SciPy/pandas first. In the case of min/max, the most relevant APIs appear to be: pandas.DataFrame.min pandas.DataFrame.max numpy.amin numpy.amax numpy.nanmin numpy.nanmax The pandas variants support a "skipna" argument, which indicates whether or not to ignore missing values (e.g. None, NaN). This defaults to true, so such null values are ignored. If you set it to False, they get included and propagate to the result: dtype: float64 dtype: float64 For NumPy, amin and amax propagate NaN/None, while nanmin/nanmax are able to filter out floating point NaN values, but emit TypeError if asked to cope with None as a value. I think the fact both NumPy and pandas support R-style handling of min() and max() counts in favour of having variants of those with additional options for handling missing data values in the standard library statistics module. Regards, Nick. P.S. Another option might be to consider the question as part of a general "data cleaning" strategy for the statistics module, similar to the one discussed for pandas at http://pandas.pydata.org/pandas-docs/stable/missing_data.html Even if the statistics module itself doesn't provide the tools to address those problems, it could provide some useful pointers on when someone may want to switch from the standard library module to a more comprehensive solution like pandas that better handles the messy complications of working with real world data (and data formats). -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

but emit TypeError if
asked to cope with None as a value.
Well, sort of. Numpy arrays are homogenous, you can't have a None in an array ( other than an object style). All the Numpy "ufuncs" create an array from the input first -- that's where you get your ValueError. But the Numpy experience is informative -- there have been years of " we need a better masked array" discussions, but no consensus on what it should be. For floats, NaN can be used for missing values, but there is no such value for integers, and each use case has a sufferer end "obvious" interpretation. That's why it's explicit what you want with the nan* functions. I don't think python should decide for users what None means in this context. -CHB

On Tue, Dec 29, 2015, 11:41 AM Chris Barker - NOAA Federal < chris.barker@noaa.gov> wrote:
None is obviously the sound of one hand clapping. When you understand its proper use, you become Enlightened.
NumPy and Pandas have a slightly different audience than Python core. The scientific community often veers more practical than pure, in some cases to the detriment of code clarity.
I prefer this option. Why solve the special case of max/min when we can solve (or help solve) the general case of missing data. There's already the internal ``_coerce`` method. Maybe clean that up for public consumption, or something like it, adding drop-missing functionality? If that flies, then there might be room for an ``interpolate(sequence, method='linear')`` which would be awesome.

On Thu, Dec 31, 2015 at 04:30:24AM +0000, Michael Selik wrote:
If that flies, then there might be room for an ``interpolate(sequence, method='linear')`` which would be awesome.
(I presume you're still talking about the statistics module here, not pandas.) What did you have in mind? -- Steve

On Wed, Dec 30, 2015 at 11:44 PM Steven D'Aprano <steve@pearwood.info> wrote:
While the scientific community is well-served by NumPy and Pandas, there are many users trying to do a lighter amount of data wrangling that does not include linear algebra. In my anecdotal experience, the most common tasks are: 1. drop records with missing/bad data 2. replace missing/bad values with a constant value 3. interpolate missing values with either a pad-forward or linear method While NumPy often has methods doing in-place mutation, the users I'm thinking of are generally not worried about memory size and would be better served by pure functions. Going back to the original topic of skipping None values. I'd like to add that many datasets use bizarre values like all 9s or -1 or '.' or whatever to represent missingness. So, I'm not confident there's a good general-purpose solution more simple than comprehensions.
participants (8)
-
Andrew Barnert
-
Chris Angelico
-
Chris Barker - NOAA Federal
-
Franklin? Lee
-
Michael Selik
-
Nick Coghlan
-
Steven D'Aprano
-
Terry Reedy