[Pandas-dev] [pydata] Feedback request for return value of empty or, all-NA sum (0 or NA?)
Stephen Simmons
mail at stevesimmons.com
Sun Dec 3 05:34:06 EST 2017
Nat Smith wrote:
> I am baffled by the idea that sum([]) would return NaN.
So am I. Here are two cases that leave me confused what the intention is.
Case #1 - Summing an empty integer series
Not only does the answer change from 0 to NaN, but the type changes from int to float.
That occurs whether skipna is True or False!
> pd.Series([], dtype=int).sum()
nan
> pd.Series([], dtype=int).sum(skipna=True)
nan
> pd.Series([], dtype=int).sum(skipna=False)
nan
This confused me so I went back to the docstring and tried it with a float Series:
> pd.Series.sum?
Signature: pd.Series.sum(self, axis=None, skipna=None, level=None, numeric_only=None, **kwargs)
Docstring:
Return the sum of the values for the requested axis
Parameters
----------
axis : {index (0)}
skipna : boolean, default True
Exclude NA/null values. If an entire row/column is NA or empty, the result
will be NA
level : int or level name, default None
If the axis is a MultiIndex (hierarchical), count along a
particular level, collapsing into a scalar
numeric_only : boolean, default None
Include only float, int, boolean columns. If None, will attempt to use
everything, then use only numeric data. Not implemented for Series.
I would expect skipna being True to mean we don't want NaNs affecting the sum.
So why would we want NaN when the series is empty?
In fact, for an empty series, skipna gives the same NaN output for both
skipna=True and skipna=False:
> pd.Series([], dtype=float).sum(skipna=False)
nan
>pd.Series([], dtype=float).sum(skipna=True)
nan
This looks even more weird in this case:
> pd.Series([0, float('nan')], dtype=float).sum(skipna=True)
0.0 # NaN is skipped, sum is non-NaN. So far so good...
So what happens with different non-empty input?
> pd.Series([float('nan')], dtype=float).sum(skipna=True)
nan # Skip all NaNs, get empty series to sum, so return NaN???
So if we want to avoid NaNs in our output, the skipna parameter doesn't help.
For every use of sum(), we now need to separately check two special cases:
- empty input
- input with only NaNs
I can't see how this behaviour helps anyone!
Regards
Stephen
More information about the Pandas-dev
mailing list