[Pandas-dev] Fwd: What could a pandas 2.0 look like?

Wed Feb 19 17:55:27 EST 2020

On Tue, 18 Feb 2020 at 18:20, Tom Augspurger <tom.augspurger88 at gmail.com>
wrote:

>
> On Mon, Feb 17, 2020 at 7:17 PM Brock Mendel <jbrockmendel at gmail.com>
> wrote:
>
>> > You have no problem with changing the behavior of NaT, or changing to
>> use pd.NA?
>>
>> If/when we get to a point where we propagate NAs in all other
>> comparisons, I would have no problem with editing `NaT.__richcmp__` to
>> match that convention.
>>
>
> What are the advantages of a NaT with NA-like comparison semantics over
> using NA
> (or NA[datetime])?
>
> 1. Retain dtype in array - scalar ops with a scalar NA
> 2. ...
> 3. Less disruptive than changing to NA
>
> My ... could include things like `isinstance(NaT, Timestamp)` being true
> and
> `NaT.<attr>` for Timestamp attributes. But those don't strike me as
> necessarily
> good things. They seem sometimes useful and sometimes harmful.
>
> The downside of changing NaT in comparison operations are
>
> 1. We're diverging from `np.NaT`. I don't know how problematic this
> actually is.
> 2. It's a special case. Should users need to know that datelikes use their
> own
>    NA value because the underlying storage is able to store them "in-band"
>    rather than as a mask? My gut reaction is "no, users shouldn't be
> exposed to
>    this."
> 3. Changing NaT would leave just NaN with the "always unequal in
> comparisons"
>    behavior.
>

Personally, I think changing the behaviour of NaT in pandas, and thus
deviating from the behaviour of the same value in numpy, is not a good
idea. For me, that seems more confusing than having a clearly distinct
value (pd.NA) that has the different behaviour.

>
> Thus far, I see three options going forward
>
> 1. Use NaN for floats, NaT for datelikes, NA for other.
>   1-a: Leave NaT with always unequal
>   1-b: Change NaT to have NA-like comparison behavior
> 2. Use NA everywhere (no NaN for float, no NaT for datelike
> 3. Implement a typed `NA<T>`, where we have an `NA` per dtype.
>
> Option 3 I think solves the array - scalar op issue. It's more complex for
> users
> though hopefully not too complex? My biggest worry is that it makes the
> implementation much more complex, though perhaps I'm being pessimistic.
>
> On balance, I'm not sure where I come down yet. Good news: we can take
> time to
> figure this out :)
>

Thanks for the summary!
Personally, I don't like the first option *long term* as it keeps different
missing values (eg NaN) with different behaviours for some dtypes as
default, while I would like to see us moving to a consistent missing value
indicator.
And I think we can take a similar approach as we somewhat decided in the
original discussion on pd.NA: let's start with a single pd.NA, and we can
see later if there is a need to make it typed.

Joris
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/pandas-dev/attachments/20200219/b9cec9ba/attachment.html>