[Pandas-dev] What could a pandas 2.0 look like?

Mon Feb 17 12:50:38 EST 2020

> I think consistently propagating NA in comparison operations is a
worthwhile goal.

That's an argument for having a three-valued bool-dtype, not for replacing
all other NA-like values.

On Mon, Feb 17, 2020 at 8:34 AM Tom Augspurger <tom.augspurger88 at gmail.com>
wrote:

> > 2) The "only one NA value is simpler" argument strikes me as a solution
> in search of a problem.
>
> I don't think that's correct. I think consistently propagating NA in
> comparison operations is a worthwhile goal.
>
> On Mon, Feb 17, 2020 at 10:25 AM Brock Mendel <jbrockmendel at gmail.com>
> wrote:
>
>> > It's not fully clear to me what you want to say with this, so a more
>> detailed clarification is welcome (I mean, I understand the sentence and
>> remember the discussion, but don't fully understand the point being made in
>> context, or in what direction you think more discussion is needed).
>>
>> I don't particularly think more discussion is needed, as this is a rehash
>> of #28095, where this horse has already been beaten to death.
>>
>> As Tom noted here
>> <https://github.com/pandas-dev/pandas/issues/28095#issuecomment-537501744>,
>> using pd.NA in places where we currently use NaT breaks the usual identity
>> (that we rely on A LOT)
>>
>> ```(array + array)[0].dtype <=> (array + array[0]).dtype```
>>
>> (Yes, this holds only imperfectly for NaT because NaT serves as both
>> NotADatetime and NotATimedelta, and I'll refer you to the decomposing horse
>> in #28095.)
>>
>> Also from #28095:
>>
>> ```Series[timedelta64] * pd.NaT``` unambiguously raises, but
>> ```Series[timedelta64] * pd.NA``` could be timedelta64
>>
>> > Assume we introduce a new "nullable datetime" dtype that uses a mask to
>> track NAs, and can still have NaT in the values. In practice, this still
>> means that we "replace NaT with NA"
>>
>> This strikes me as contradictory.
>>
>> > So do you mean: "in my opinion, we should not do this" (what I just
>> described above), because in practice that would mean breaking arithmetic
>> consistency? Or that if we want to start using NA for datetimelike dtypes,
>> you think "dtype-parametrized" NA values are necessary (so you can
>> distinguish NA[datetime] and NA[timedelta] ?)
>>
>> I think:
>>
>> 1) pd.NA solves an _actual_ problem which is that we used to use np.nan
>> in places (categorical, object) where np.nan was semantically misleading.
>>    a) What these have in common is that they are in general
>> non-arithmetic dtypes.
>>    b) This is an improvement, and I'm glad you put in the effort to make
>> it happen.
>>    c) Trying to shoe-horn pd.NA into cases where it is semantically
>> misleading based on the Highlander Principle is counter-productive.
>>
>> 2) The "only one NA value is simpler" argument strikes me as a solution
>> in search of a problem.
>>    a) All the more so if you want this to supplement np.nan/pd.NaT
>> instead of replace them.
>>    b) *the idea of replacing vs supplementing needs to be made much more
>> explicit/clear*
>>
>> 3) The "dtype-parametrized" NA did come up in #28095, but I never
>> advocated it.
>>    a) I am open to separating out a NaTimedelta (xref #24983) from
>> pd.NaT, and don't particularly care what it is called.
>>
>>
>> On Mon, Feb 17, 2020 at 3:37 AM Joris Van den Bossche <
>> jorisvandenbossche at gmail.com> wrote:
>>
>>> > This would also imply creating a nullable float dtype and making our
>>>> datelikes use NA rather than NaT too. That seemed to be generally OK, but
>>>> wasn't discussed too much.
>>>>
>>>> My understanding of the discussion is that using a mask on top of
>>>> datetimelike arrays would not _replace_ NaT, but supplement it with
>>>> something semantically different.
>>>>
>>>
>>> Yes, if we see it similar as NaNs for floats (where NaN is a specific
>>> float value in the data array, while NAs are tracked in the mask array),
>>> then for datetimelike arrays we can do something similar. And the same
>>> discussions about to what extent to distinguish NaN and NA or whether we
>>> need to provide options that we are going to have for float dtypes, will
>>> also be relevant for datetimelike dtypes (but then for NaT and NA).
>>>
>>> But note that in practice, I *think* that the big majority of use cases
>>> will mostly use NA and not NaT in the data (eg when reading from files that
>>> have missing data).
>>>
>>> Replacing NaT with NA breaks arithmetic consistency, as has been
>>>> discussed ad nauseum.
>>>>
>>>
>>> It's not fully clear to me what you want to say with this, so a more
>>> detailed clarification is welcome (I mean, I understand the sentence and
>>> remember the discussion, but don't fully understand the point being made in
>>> context, or in what direction you think more discussion is needed).
>>>
>>> Assume we introduce a new "nullable datetime" dtype that uses a mask to
>>> track NAs, and can still have NaT in the values. In practice, this still
>>> means that we "replace NaT with NA" (because even though NaT is still
>>> possible, I think you would mostly get NAs as mentioned above; eg reading a
>>> file would now give NA instaed of NaT).
>>> So do you mean: "in my opinion, we should not do this" (what I just
>>> described above), because in practice that would mean breaking arithmetic
>>> consistency? Or that if we want to start using NA for datetimelike dtypes,
>>> you think "dtype-parametrized" NA values are necessary (so you can
>>> distinguish NA[datetime] and NA[timedelta] ?)
>>>
>>> Joris
>>>
>>>
>>> _______________________________________________
>>> Pandas-dev mailing list
>>> Pandas-dev at python.org
>>> https://mail.python.org/mailman/listinfo/pandas-dev
>>>
>> _______________________________________________
>> Pandas-dev mailing list
>> Pandas-dev at python.org
>> https://mail.python.org/mailman/listinfo/pandas-dev
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/pandas-dev/attachments/20200217/cb84ffb2/attachment.html>