[Pandas-dev] What could a pandas 2.0 look like?

Tom Augspurger tom.augspurger88 at gmail.com
Mon Feb 17 11:33:52 EST 2020


> 2) The "only one NA value is simpler" argument strikes me as a solution
in search of a problem.

I don't think that's correct. I think consistently propagating NA in
comparison operations is a worthwhile goal.

On Mon, Feb 17, 2020 at 10:25 AM Brock Mendel <jbrockmendel at gmail.com>
wrote:

> > It's not fully clear to me what you want to say with this, so a more
> detailed clarification is welcome (I mean, I understand the sentence and
> remember the discussion, but don't fully understand the point being made in
> context, or in what direction you think more discussion is needed).
>
> I don't particularly think more discussion is needed, as this is a rehash
> of #28095, where this horse has already been beaten to death.
>
> As Tom noted here
> <https://github.com/pandas-dev/pandas/issues/28095#issuecomment-537501744>,
> using pd.NA in places where we currently use NaT breaks the usual identity
> (that we rely on A LOT)
>
> ```(array + array)[0].dtype <=> (array + array[0]).dtype```
>
> (Yes, this holds only imperfectly for NaT because NaT serves as both
> NotADatetime and NotATimedelta, and I'll refer you to the decomposing horse
> in #28095.)
>
> Also from #28095:
>
> ```Series[timedelta64] * pd.NaT``` unambiguously raises, but
> ```Series[timedelta64] * pd.NA``` could be timedelta64
>
> > Assume we introduce a new "nullable datetime" dtype that uses a mask to
> track NAs, and can still have NaT in the values. In practice, this still
> means that we "replace NaT with NA"
>
> This strikes me as contradictory.
>
> > So do you mean: "in my opinion, we should not do this" (what I just
> described above), because in practice that would mean breaking arithmetic
> consistency? Or that if we want to start using NA for datetimelike dtypes,
> you think "dtype-parametrized" NA values are necessary (so you can
> distinguish NA[datetime] and NA[timedelta] ?)
>
> I think:
>
> 1) pd.NA solves an _actual_ problem which is that we used to use np.nan in
> places (categorical, object) where np.nan was semantically misleading.
>    a) What these have in common is that they are in general non-arithmetic
> dtypes.
>    b) This is an improvement, and I'm glad you put in the effort to make
> it happen.
>    c) Trying to shoe-horn pd.NA into cases where it is semantically
> misleading based on the Highlander Principle is counter-productive.
>
> 2) The "only one NA value is simpler" argument strikes me as a solution in
> search of a problem.
>    a) All the more so if you want this to supplement np.nan/pd.NaT instead
> of replace them.
>    b) *the idea of replacing vs supplementing needs to be made much more
> explicit/clear*
>
> 3) The "dtype-parametrized" NA did come up in #28095, but I never
> advocated it.
>    a) I am open to separating out a NaTimedelta (xref #24983) from pd.NaT,
> and don't particularly care what it is called.
>
>
> On Mon, Feb 17, 2020 at 3:37 AM Joris Van den Bossche <
> jorisvandenbossche at gmail.com> wrote:
>
>> > This would also imply creating a nullable float dtype and making our
>>> datelikes use NA rather than NaT too. That seemed to be generally OK, but
>>> wasn't discussed too much.
>>>
>>> My understanding of the discussion is that using a mask on top of
>>> datetimelike arrays would not _replace_ NaT, but supplement it with
>>> something semantically different.
>>>
>>
>> Yes, if we see it similar as NaNs for floats (where NaN is a specific
>> float value in the data array, while NAs are tracked in the mask array),
>> then for datetimelike arrays we can do something similar. And the same
>> discussions about to what extent to distinguish NaN and NA or whether we
>> need to provide options that we are going to have for float dtypes, will
>> also be relevant for datetimelike dtypes (but then for NaT and NA).
>>
>> But note that in practice, I *think* that the big majority of use cases
>> will mostly use NA and not NaT in the data (eg when reading from files that
>> have missing data).
>>
>> Replacing NaT with NA breaks arithmetic consistency, as has been
>>> discussed ad nauseum.
>>>
>>
>> It's not fully clear to me what you want to say with this, so a more
>> detailed clarification is welcome (I mean, I understand the sentence and
>> remember the discussion, but don't fully understand the point being made in
>> context, or in what direction you think more discussion is needed).
>>
>> Assume we introduce a new "nullable datetime" dtype that uses a mask to
>> track NAs, and can still have NaT in the values. In practice, this still
>> means that we "replace NaT with NA" (because even though NaT is still
>> possible, I think you would mostly get NAs as mentioned above; eg reading a
>> file would now give NA instaed of NaT).
>> So do you mean: "in my opinion, we should not do this" (what I just
>> described above), because in practice that would mean breaking arithmetic
>> consistency? Or that if we want to start using NA for datetimelike dtypes,
>> you think "dtype-parametrized" NA values are necessary (so you can
>> distinguish NA[datetime] and NA[timedelta] ?)
>>
>> Joris
>>
>>
>> _______________________________________________
>> Pandas-dev mailing list
>> Pandas-dev at python.org
>> https://mail.python.org/mailman/listinfo/pandas-dev
>>
> _______________________________________________
> Pandas-dev mailing list
> Pandas-dev at python.org
> https://mail.python.org/mailman/listinfo/pandas-dev
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/pandas-dev/attachments/20200217/7354147c/attachment-0001.html>


More information about the Pandas-dev mailing list