[Pandas-dev] Fwd: What could a pandas 2.0 look like?

Tom Augspurger tom.augspurger88 at gmail.com
Tue Feb 18 12:20:01 EST 2020


(Accidentally dropped the mailing list)

On Mon, Feb 17, 2020 at 7:17 PM Brock Mendel <jbrockmendel at gmail.com> wrote:

> > You have no problem with changing the behavior of NaT, or changing to
> use pd.NA?
>
> If/when we get to a point where we propagate NAs in all other comparisons,
> I would have no problem with editing `NaT.__richcmp__` to match that
> convention.
>

What are the advantages of a NaT with NA-like comparison semantics over
using NA
(or NA[datetime])?

1. Retain dtype in array - scalar ops with a scalar NA
2. ...
3. Less disruptive than changing to NA

My ... could include things like `isinstance(NaT, Timestamp)` being true and
`NaT.<attr>` for Timestamp attributes. But those don't strike me as
necessarily
good things. They seem sometimes useful and sometimes harmful.

The downside of changing NaT in comparison operations are

1. We're diverging from `np.NaT`. I don't know how problematic this
actually is.
2. It's a special case. Should users need to know that datelikes use their
own
   NA value because the underlying storage is able to store them "in-band"
   rather than as a mask? My gut reaction is "no, users shouldn't be
exposed to
   this."
3. Changing NaT would leave just NaN with the "always unequal in
comparisons"
   behavior.

Thus far, I see three options going forward

1. Use NaN for floats, NaT for datelikes, NA for other.
  1-a: Leave NaT with always unequal
  1-b: Change NaT to have NA-like comparison behavior
2. Use NA everywhere (no NaN for float, no NaT for datelike
3. Implement a typed `NA<T>`, where we have an `NA` per dtype.

Option 3 I think solves the array - scalar op issue. It's more complex for
users
though hopefully not too complex? My biggest worry is that it makes the
implementation much more complex, though perhaps I'm being pessimistic.

On balance, I'm not sure where I come down yet. Good news: we can take time
to
figure this out :)


> On Mon, Feb 17, 2020 at 10:06 AM Tom Augspurger <
> tom.augspurger88 at gmail.com> wrote:
>
>>
>>
>>
>> On Mon, Feb 17, 2020 at 11:58 AM Brock Mendel <jbrockmendel at gmail.com>
>> wrote:
>>
>>> > or changing the behavior of NaT in comparisons to be like NA.
>>>
>>> Pending the kinks being worked out of pd.NA, I have no problem with that.
>>>
>>
>> You have no problem with changing the behavior of NaT, or changing to use
>> pd.NA?
>>
>> Is changing the defined behavior of NaT even an option? Is it defined in
>> a spec
>> like NaN, or did NumPy just choose that behavior?
>>
>> Assuming NaT had NA-like behavior in comparisons, what's remaining
>> arguments for keeping NaT?
>> Preserving dtypes in scalar - array ops? Anything else?
>>
>> On Mon, Feb 17, 2020 at 9:55 AM Tom Augspurger <
>>> tom.augspurger88 at gmail.com> wrote:
>>>
>>>> Is NaT defined to be unequal in all comparisons, just like NaN? I think
>>>> the goal of propagating NA
>>>> requires either using NA or changing the behavior of NaT in comparisons
>>>> to be like NA.
>>>>
>>>> On Mon, Feb 17, 2020 at 11:50 AM Brock Mendel <jbrockmendel at gmail.com>
>>>> wrote:
>>>>
>>>>> > I think consistently propagating NA in comparison operations is a
>>>>> worthwhile goal.
>>>>>
>>>>> That's an argument for having a three-valued bool-dtype, not for
>>>>> replacing all other NA-like values.
>>>>>
>>>>> On Mon, Feb 17, 2020 at 8:34 AM Tom Augspurger <
>>>>> tom.augspurger88 at gmail.com> wrote:
>>>>>
>>>>>> > 2) The "only one NA value is simpler" argument strikes me as a
>>>>>> solution in search of a problem.
>>>>>>
>>>>>> I don't think that's correct. I think consistently propagating NA in
>>>>>> comparison operations is a worthwhile goal.
>>>>>>
>>>>>> On Mon, Feb 17, 2020 at 10:25 AM Brock Mendel <jbrockmendel at gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> > It's not fully clear to me what you want to say with this, so a
>>>>>>> more detailed clarification is welcome (I mean, I understand the sentence
>>>>>>> and remember the discussion, but don't fully understand the point being
>>>>>>> made in context, or in what direction you think more discussion is needed).
>>>>>>>
>>>>>>> I don't particularly think more discussion is needed, as this is a
>>>>>>> rehash of #28095, where this horse has already been beaten to death.
>>>>>>>
>>>>>>> As Tom noted here
>>>>>>> <https://github.com/pandas-dev/pandas/issues/28095#issuecomment-537501744>,
>>>>>>> using pd.NA in places where we currently use NaT breaks the usual identity
>>>>>>> (that we rely on A LOT)
>>>>>>>
>>>>>>> ```(array + array)[0].dtype <=> (array + array[0]).dtype```
>>>>>>>
>>>>>>> (Yes, this holds only imperfectly for NaT because NaT serves as both
>>>>>>> NotADatetime and NotATimedelta, and I'll refer you to the decomposing horse
>>>>>>> in #28095.)
>>>>>>>
>>>>>>> Also from #28095:
>>>>>>>
>>>>>>> ```Series[timedelta64] * pd.NaT``` unambiguously raises, but
>>>>>>> ```Series[timedelta64] * pd.NA``` could be timedelta64
>>>>>>>
>>>>>>> > Assume we introduce a new "nullable datetime" dtype that uses a
>>>>>>> mask to track NAs, and can still have NaT in the values. In practice, this
>>>>>>> still means that we "replace NaT with NA"
>>>>>>>
>>>>>>> This strikes me as contradictory.
>>>>>>>
>>>>>>> > So do you mean: "in my opinion, we should not do this" (what I
>>>>>>> just described above), because in practice that would mean breaking
>>>>>>> arithmetic consistency? Or that if we want to start using NA for
>>>>>>> datetimelike dtypes, you think "dtype-parametrized" NA values are necessary
>>>>>>> (so you can distinguish NA[datetime] and NA[timedelta] ?)
>>>>>>>
>>>>>>> I think:
>>>>>>>
>>>>>>> 1) pd.NA solves an _actual_ problem which is that we used to use
>>>>>>> np.nan in places (categorical, object) where np.nan was semantically
>>>>>>> misleading.
>>>>>>>    a) What these have in common is that they are in general
>>>>>>> non-arithmetic dtypes.
>>>>>>>    b) This is an improvement, and I'm glad you put in the effort to
>>>>>>> make it happen.
>>>>>>>    c) Trying to shoe-horn pd.NA into cases where it is semantically
>>>>>>> misleading based on the Highlander Principle is counter-productive.
>>>>>>>
>>>>>>> 2) The "only one NA value is simpler" argument strikes me as a
>>>>>>> solution in search of a problem.
>>>>>>>    a) All the more so if you want this to supplement np.nan/pd.NaT
>>>>>>> instead of replace them.
>>>>>>>    b) *the idea of replacing vs supplementing needs to be made much
>>>>>>> more explicit/clear*
>>>>>>>
>>>>>>> 3) The "dtype-parametrized" NA did come up in #28095, but I never
>>>>>>> advocated it.
>>>>>>>    a) I am open to separating out a NaTimedelta (xref #24983) from
>>>>>>> pd.NaT, and don't particularly care what it is called.
>>>>>>>
>>>>>>>
>>>>>>> On Mon, Feb 17, 2020 at 3:37 AM Joris Van den Bossche <
>>>>>>> jorisvandenbossche at gmail.com> wrote:
>>>>>>>
>>>>>>>> > This would also imply creating a nullable float dtype and making
>>>>>>>>> our datelikes use NA rather than NaT too. That seemed to be generally OK,
>>>>>>>>> but wasn't discussed too much.
>>>>>>>>>
>>>>>>>>> My understanding of the discussion is that using a mask on top of
>>>>>>>>> datetimelike arrays would not _replace_ NaT, but supplement it with
>>>>>>>>> something semantically different.
>>>>>>>>>
>>>>>>>>
>>>>>>>> Yes, if we see it similar as NaNs for floats (where NaN is a
>>>>>>>> specific float value in the data array, while NAs are tracked in the mask
>>>>>>>> array), then for datetimelike arrays we can do something similar. And the
>>>>>>>> same discussions about to what extent to distinguish NaN and NA or whether
>>>>>>>> we need to provide options that we are going to have for float dtypes, will
>>>>>>>> also be relevant for datetimelike dtypes (but then for NaT and NA).
>>>>>>>>
>>>>>>>> But note that in practice, I *think* that the big majority of use
>>>>>>>> cases will mostly use NA and not NaT in the data (eg when reading from
>>>>>>>> files that have missing data).
>>>>>>>>
>>>>>>>> Replacing NaT with NA breaks arithmetic consistency, as has been
>>>>>>>>> discussed ad nauseum.
>>>>>>>>>
>>>>>>>>
>>>>>>>> It's not fully clear to me what you want to say with this, so a
>>>>>>>> more detailed clarification is welcome (I mean, I understand the sentence
>>>>>>>> and remember the discussion, but don't fully understand the point being
>>>>>>>> made in context, or in what direction you think more discussion is needed).
>>>>>>>>
>>>>>>>> Assume we introduce a new "nullable datetime" dtype that uses a
>>>>>>>> mask to track NAs, and can still have NaT in the values. In practice, this
>>>>>>>> still means that we "replace NaT with NA" (because even though NaT is still
>>>>>>>> possible, I think you would mostly get NAs as mentioned above; eg reading a
>>>>>>>> file would now give NA instaed of NaT).
>>>>>>>> So do you mean: "in my opinion, we should not do this" (what I just
>>>>>>>> described above), because in practice that would mean breaking arithmetic
>>>>>>>> consistency? Or that if we want to start using NA for datetimelike dtypes,
>>>>>>>> you think "dtype-parametrized" NA values are necessary (so you can
>>>>>>>> distinguish NA[datetime] and NA[timedelta] ?)
>>>>>>>>
>>>>>>>> Joris
>>>>>>>>
>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> Pandas-dev mailing list
>>>>>>>> Pandas-dev at python.org
>>>>>>>> https://mail.python.org/mailman/listinfo/pandas-dev
>>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> Pandas-dev mailing list
>>>>>>> Pandas-dev at python.org
>>>>>>> https://mail.python.org/mailman/listinfo/pandas-dev
>>>>>>>
>>>>>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/pandas-dev/attachments/20200218/416a2664/attachment-0001.html>


More information about the Pandas-dev mailing list