[Pandas-dev] What could a pandas 2.0 look like?

Brock Mendel jbrockmendel at gmail.com
Mon Feb 17 11:25:26 EST 2020


> It's not fully clear to me what you want to say with this, so a more
detailed clarification is welcome (I mean, I understand the sentence and
remember the discussion, but don't fully understand the point being made in
context, or in what direction you think more discussion is needed).

I don't particularly think more discussion is needed, as this is a rehash
of #28095, where this horse has already been beaten to death.

As Tom noted here
<https://github.com/pandas-dev/pandas/issues/28095#issuecomment-537501744>,
using pd.NA in places where we currently use NaT breaks the usual identity
(that we rely on A LOT)

```(array + array)[0].dtype <=> (array + array[0]).dtype```

(Yes, this holds only imperfectly for NaT because NaT serves as both
NotADatetime and NotATimedelta, and I'll refer you to the decomposing horse
in #28095.)

Also from #28095:

```Series[timedelta64] * pd.NaT``` unambiguously raises, but
```Series[timedelta64] * pd.NA``` could be timedelta64

> Assume we introduce a new "nullable datetime" dtype that uses a mask to
track NAs, and can still have NaT in the values. In practice, this still
means that we "replace NaT with NA"

This strikes me as contradictory.

> So do you mean: "in my opinion, we should not do this" (what I just
described above), because in practice that would mean breaking arithmetic
consistency? Or that if we want to start using NA for datetimelike dtypes,
you think "dtype-parametrized" NA values are necessary (so you can
distinguish NA[datetime] and NA[timedelta] ?)

I think:

1) pd.NA solves an _actual_ problem which is that we used to use np.nan in
places (categorical, object) where np.nan was semantically misleading.
   a) What these have in common is that they are in general non-arithmetic
dtypes.
   b) This is an improvement, and I'm glad you put in the effort to make it
happen.
   c) Trying to shoe-horn pd.NA into cases where it is semantically
misleading based on the Highlander Principle is counter-productive.

2) The "only one NA value is simpler" argument strikes me as a solution in
search of a problem.
   a) All the more so if you want this to supplement np.nan/pd.NaT instead
of replace them.
   b) *the idea of replacing vs supplementing needs to be made much more
explicit/clear*

3) The "dtype-parametrized" NA did come up in #28095, but I never advocated
it.
   a) I am open to separating out a NaTimedelta (xref #24983) from pd.NaT,
and don't particularly care what it is called.


On Mon, Feb 17, 2020 at 3:37 AM Joris Van den Bossche <
jorisvandenbossche at gmail.com> wrote:

> > This would also imply creating a nullable float dtype and making our
>> datelikes use NA rather than NaT too. That seemed to be generally OK, but
>> wasn't discussed too much.
>>
>> My understanding of the discussion is that using a mask on top of
>> datetimelike arrays would not _replace_ NaT, but supplement it with
>> something semantically different.
>>
>
> Yes, if we see it similar as NaNs for floats (where NaN is a specific
> float value in the data array, while NAs are tracked in the mask array),
> then for datetimelike arrays we can do something similar. And the same
> discussions about to what extent to distinguish NaN and NA or whether we
> need to provide options that we are going to have for float dtypes, will
> also be relevant for datetimelike dtypes (but then for NaT and NA).
>
> But note that in practice, I *think* that the big majority of use cases
> will mostly use NA and not NaT in the data (eg when reading from files that
> have missing data).
>
> Replacing NaT with NA breaks arithmetic consistency, as has been discussed
>> ad nauseum.
>>
>
> It's not fully clear to me what you want to say with this, so a more
> detailed clarification is welcome (I mean, I understand the sentence and
> remember the discussion, but don't fully understand the point being made in
> context, or in what direction you think more discussion is needed).
>
> Assume we introduce a new "nullable datetime" dtype that uses a mask to
> track NAs, and can still have NaT in the values. In practice, this still
> means that we "replace NaT with NA" (because even though NaT is still
> possible, I think you would mostly get NAs as mentioned above; eg reading a
> file would now give NA instaed of NaT).
> So do you mean: "in my opinion, we should not do this" (what I just
> described above), because in practice that would mean breaking arithmetic
> consistency? Or that if we want to start using NA for datetimelike dtypes,
> you think "dtype-parametrized" NA values are necessary (so you can
> distinguish NA[datetime] and NA[timedelta] ?)
>
> Joris
>
>
> _______________________________________________
> Pandas-dev mailing list
> Pandas-dev at python.org
> https://mail.python.org/mailman/listinfo/pandas-dev
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/pandas-dev/attachments/20200217/7d6a3ef1/attachment.html>


More information about the Pandas-dev mailing list