API: Make silent casting behavior consistent by deprecating silent _object_-dtype casting
TLDR ---- We have inconsistent silent-casting vs raising logic for numpy vs EA dtypes (and inconsistencies within EA dtypes). By deprecating silently casting to *object* dtype, we can *mostly* make the behaviors match. Background ---------- A number of Series/DataFrame methods will silently cast when dealing with mismatched values. With a numpy dtype, each of the following silently cast to float64: ser = pd.Series([1, 2, 3], dtype="i8") ser.shift(1, fill_value=1.5) ser.mask([True, False, False], 1.5) ser.where([False, True, True], 1.5) ser.replace(1, 1.5) ser[0] = 1.5 ser.fillna(1.5) # <- this one doesn't cast as it is a no-op If we were to pass "foo" or a pd.Period, these would coerce to object instead of float. By contrast, similar mixed-type operations with an ExtensionDtype Series _mostly_ raise: ser2 = pd.Series(pd.period_range("2016-01-01", periods=3, freq="D")) ser2.shift(1, fill_value=1.5) # <- ValueError ser2.mask([True, False, False], 1.5) # <- ValueError ser2.where([False, True, True], 1.5) # <- ValueError ser2.fillna(1.5) # <- TypeError ser2.replace(ser2[0], 1.5) # <- coerces to object ser2[0] = 1.5 # <- coerces to object ser3 = pd.Series([pd.NA, 2, 3], dtype="Int64") ser3.shift(1, fill_value=1.5) # <- TypeError ser3.mask([True, False, False], 1.5) # <- TypeError ser3.where([False, True, True], 1.5) # <- TypeError ser3.fillna(1.5) # <- TypeError ser3.replace(ser3[0], 1.5) # <- TypeError ser3[0] = 1.5 # <- TypeError timedelta64, datetime64, and datetime64tz mostly behave like the numpy dtypes, with a few exceptions: - shift raises on mismatch - fillna raises on mismatch for timedelta64, casts for the others Categorical mostly behaves like other ExtensionDtypes, except for replace which has special logic. Goals ----- - Have matching behavior across dtypes. - Share code. Options ------- 1) Change EA (and dt64/td64) behavior to match non-EA behavior 2) Change non-EA behavior to match EA behavior (or stricter xref https://github.com/pandas-dev/pandas/issues/39584) 3) Deprecate (and eventually raise on) silent casting to _object_ dtype, allowing silent casting otherwise. Here I am advocating for option 3). The advantages as I see them: A) For numpy dtypes, we retain the most useful cases (int->float) B) Deprecates cases most likely to be unintentional (e.g. typo "2016-01-01" -> "2p16-01-01" causing a datetime64 Series to silently cast) C) For td64/dt64/dt64tz/period, the *only* silent casting is to object, so this completely gets rid of special-casing among that code D) For IntegerArray, FloatingArray, IntervalArray leaves open the option of allowing e.g. Integer->Floating casting (xref https://github.com/pandas-dev/pandas/issues/25288#issuecomment-941762174) E) Does not preclude later deciding on the stricter options in 2)
Thanks for bringing this up. Limiting the discussion to getitem for a moment (I think other methods like fillna could deviate if we really want, or could have keywords for it), I am personally in favor of option 2: making everything strict (since I opened that referenced issue about it: https://github.com/pandas-dev/pandas/issues/39584) Now, on the short term, already starting to deprecate silent casting to object (so the first aspect of option 3) doesn't prevent later becoming even more strict (it only wouldn't fully solve the existing inconsistencies), so for that point of view, I personally am fine with that. Joris On Wed, 27 Oct 2021 at 06:38, Brock Mendel <jbrockmendel@gmail.com> wrote:
TLDR ---- We have inconsistent silent-casting vs raising logic for numpy vs EA dtypes (and inconsistencies within EA dtypes). By deprecating silently casting to *object* dtype, we can *mostly* make the behaviors match.
Background ---------- A number of Series/DataFrame methods will silently cast when dealing with mismatched values. With a numpy dtype, each of the following silently cast to float64:
ser = pd.Series([1, 2, 3], dtype="i8")
ser.shift(1, fill_value=1.5) ser.mask([True, False, False], 1.5) ser.where([False, True, True], 1.5) ser.replace(1, 1.5) ser[0] = 1.5 ser.fillna(1.5) # <- this one doesn't cast as it is a no-op
If we were to pass "foo" or a pd.Period, these would coerce to object instead of float.
By contrast, similar mixed-type operations with an ExtensionDtype Series _mostly_ raise:
ser2 = pd.Series(pd.period_range("2016-01-01", periods=3, freq="D"))
ser2.shift(1, fill_value=1.5) # <- ValueError ser2.mask([True, False, False], 1.5) # <- ValueError ser2.where([False, True, True], 1.5) # <- ValueError ser2.fillna(1.5) # <- TypeError ser2.replace(ser2[0], 1.5) # <- coerces to object ser2[0] = 1.5 # <- coerces to object
ser3 = pd.Series([pd.NA, 2, 3], dtype="Int64")
ser3.shift(1, fill_value=1.5) # <- TypeError ser3.mask([True, False, False], 1.5) # <- TypeError ser3.where([False, True, True], 1.5) # <- TypeError ser3.fillna(1.5) # <- TypeError ser3.replace(ser3[0], 1.5) # <- TypeError ser3[0] = 1.5 # <- TypeError
timedelta64, datetime64, and datetime64tz mostly behave like the numpy dtypes, with a few exceptions:
- shift raises on mismatch - fillna raises on mismatch for timedelta64, casts for the others
Categorical mostly behaves like other ExtensionDtypes, except for replace which has special logic.
Goals ----- - Have matching behavior across dtypes. - Share code.
Options ------- 1) Change EA (and dt64/td64) behavior to match non-EA behavior 2) Change non-EA behavior to match EA behavior (or stricter xref https://github.com/pandas-dev/pandas/issues/39584) 3) Deprecate (and eventually raise on) silent casting to _object_ dtype, allowing silent casting otherwise.
Here I am advocating for option 3). The advantages as I see them:
A) For numpy dtypes, we retain the most useful cases (int->float) B) Deprecates cases most likely to be unintentional (e.g. typo "2016-01-01" -> "2p16-01-01" causing a datetime64 Series to silently cast) C) For td64/dt64/dt64tz/period, the *only* silent casting is to object, so this completely gets rid of special-casing among that code D) For IntegerArray, FloatingArray, IntervalArray leaves open the option of allowing e.g. Integer->Floating casting (xref https://github.com/pandas-dev/pandas/issues/25288#issuecomment-941762174) E) Does not preclude later deciding on the stricter options in 2) _______________________________________________ Pandas-dev mailing list Pandas-dev@python.org https://mail.python.org/mailman/listinfo/pandas-dev
participants (2)
-
Brock Mendel -
Joris Van den Bossche