[Pandas-dev] Fwd: What could a pandas 2.0 look like?

Brock Mendel jbrockmendel at gmail.com
Wed Feb 19 18:52:10 EST 2020


Pivoting: Joris, on the call you mentioned a TimestampArray.  Can you
expand on that a bit?

On Wed, Feb 19, 2020 at 2:55 PM Joris Van den Bossche <
jorisvandenbossche at gmail.com> wrote:

>
>
> On Tue, 18 Feb 2020 at 18:20, Tom Augspurger <tom.augspurger88 at gmail.com>
> wrote:
>
>>
>> On Mon, Feb 17, 2020 at 7:17 PM Brock Mendel <jbrockmendel at gmail.com>
>> wrote:
>>
>>> > You have no problem with changing the behavior of NaT, or changing to
>>> use pd.NA?
>>>
>>> If/when we get to a point where we propagate NAs in all other
>>> comparisons, I would have no problem with editing `NaT.__richcmp__` to
>>> match that convention.
>>>
>>
>> What are the advantages of a NaT with NA-like comparison semantics over
>> using NA
>> (or NA[datetime])?
>>
>> 1. Retain dtype in array - scalar ops with a scalar NA
>> 2. ...
>> 3. Less disruptive than changing to NA
>>
>> My ... could include things like `isinstance(NaT, Timestamp)` being true
>> and
>> `NaT.<attr>` for Timestamp attributes. But those don't strike me as
>> necessarily
>> good things. They seem sometimes useful and sometimes harmful.
>>
>> The downside of changing NaT in comparison operations are
>>
>> 1. We're diverging from `np.NaT`. I don't know how problematic this
>> actually is.
>> 2. It's a special case. Should users need to know that datelikes use
>> their own
>>    NA value because the underlying storage is able to store them "in-band"
>>    rather than as a mask? My gut reaction is "no, users shouldn't be
>> exposed to
>>    this."
>> 3. Changing NaT would leave just NaN with the "always unequal in
>> comparisons"
>>    behavior.
>>
>
> Personally, I think changing the behaviour of NaT in pandas, and thus
> deviating from the behaviour of the same value in numpy, is not a good
> idea. For me, that seems more confusing than having a clearly distinct
> value (pd.NA) that has the different behaviour.
>
>
>>
>> Thus far, I see three options going forward
>>
>> 1. Use NaN for floats, NaT for datelikes, NA for other.
>>   1-a: Leave NaT with always unequal
>>   1-b: Change NaT to have NA-like comparison behavior
>> 2. Use NA everywhere (no NaN for float, no NaT for datelike
>> 3. Implement a typed `NA<T>`, where we have an `NA` per dtype.
>>
>> Option 3 I think solves the array - scalar op issue. It's more complex
>> for users
>> though hopefully not too complex? My biggest worry is that it makes the
>> implementation much more complex, though perhaps I'm being pessimistic.
>>
>> On balance, I'm not sure where I come down yet. Good news: we can take
>> time to
>> figure this out :)
>>
>
> Thanks for the summary!
> Personally, I don't like the first option *long term* as it keeps
> different missing values (eg NaN) with different behaviours for some dtypes
> as default, while I would like to see us moving to a consistent missing
> value indicator.
> And I think we can take a similar approach as we somewhat decided in the
> original discussion on pd.NA: let's start with a single pd.NA, and we can
> see later if there is a need to make it typed.
>
> Joris
>
> _______________________________________________
> Pandas-dev mailing list
> Pandas-dev at python.org
> https://mail.python.org/mailman/listinfo/pandas-dev
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/pandas-dev/attachments/20200219/9acc7aa5/attachment-0001.html>


More information about the Pandas-dev mailing list