[Python-ideas] Have max and min functions ignore None

Tue Dec 29 07:02:24 EST 2015

On 29 December 2015 at 21:22, Steven D'Aprano <steve at pearwood.info> wrote:
> What if the builtin max and min remained unchanged, but we added
> variants of them to the statistics module which treated None as a
> missing value, to be either ignored or propagated, as R does?

If the statistics module were to start borrowing selected concepts
from R, it makes sense to me to look at how those have been translated
into the Python ecosystem by NumPy/SciPy/pandas first.

In the case of min/max, the most relevant APIs appear to be:

    pandas.DataFrame.min
    pandas.DataFrame.max
    numpy.amin
    numpy.amax
    numpy.nanmin
    numpy.nanmax

The pandas variants support a "skipna" argument, which indicates
whether or not to ignore missing values (e.g. None, NaN). This
defaults to true, so such null values are ignored. If you set it to
False, they get included and propagate to the result:

>>> df = pandas.DataFrame([1, 2, 3, None, float("nan")])
>>> df.min()
0    1
dtype: float64
>>> df.min(skipna=False)
0   NaN
dtype: float64

For NumPy, amin and amax propagate NaN/None, while nanmin/nanmax are
able to filter out floating point NaN values, but emit TypeError if
asked to cope with None as a value.

I think the fact both NumPy and pandas support R-style handling of
min() and max() counts in favour of having variants of those with
additional options for handling missing data values in the standard
library statistics module.

Regards,
Nick.

P.S. Another option might be to consider the question as part of a
general "data cleaning" strategy for the statistics module, similar to
the one discussed for pandas at
http://pandas.pydata.org/pandas-docs/stable/missing_data.html

Even if the statistics module itself doesn't provide the tools to
address those problems, it could provide some useful pointers on when
someone may want to switch from the standard library module to a more
comprehensive solution like pandas that better handles the messy
complications of working with real world data (and data formats).

-- 
Nick Coghlan   |   ncoghlan at gmail.com   |   Brisbane, Australia