PDEP-14: Dedicated string data type for pandas 3.0
Hi all, I want to notify you of a new PDEP enhancement proposal being discussed at https://github.com/pandas-dev/pandas/pull/58551. Rendered version of the current text: https://jorisvandenbossche.github.io/pandas-website-preview/pdeps/0014-strin... *Summary* This PDEP proposes to introduce a dedicated string dtype that will be used by default in pandas 3.0: - In pandas 3.0, enable a "string" dtype by default, using PyArrow if available or otherwise a string dtype using numpy object-dtype under the hood as fallback. - The default string dtype will use missing value semantics (using NaN) consistent with the other default data types. This will give users a long-awaited proper string dtype for 3.0, while 1) not (yet) making PyArrow a *hard* dependency, but only a dependency used by default, and 2) leaving room for future improvements (different missing value semantics, using NumPy 2.0 strings, etc) after 3.0. *Backwards compatibility* The biggest compatibility issues will be present for users of the existing StringDtype. Users can already specify dtype="string" for several years, and right now that gives a string dtype that uses pd.NA as the missing value sentinel, while with this proposal that will change to NaN by default in pandas 3.0. The current proposal is to start raising a deprecation warning for this in pandas 2.3 before changing the behaviour in pandas 3.0, and users will have the option to explicitly keep using the variant of the dtype that uses pd.NA (although most users should be fine with the new default). See the "Backwards compatibility" section in the PDEP text. For the full proposal and background, see the links above. Feedback welcome!
Hey Joris, I’ve been trying to understand the impact of this for our internal codebase. I’d just been starting to introduce the non object based string type. This was based on my understanding of PyArrow becoming a requirement in 3.0. We use (Py)Arrow) already, mainly for Parquet, so I was not worried about this. This proposal sounds a little bit as two steps forward, one step backward wrt to the direction Pandas is taking for NA values. Or does this mean that Pandas specific “NA”, masked arrays, etc, is effectively dead? How is the I/O compatibility going to work? I’m worried about the NA vs. NaN in logic expressions change. If I understand this correctly, it might require code audits to validate? Looking forward to your thoughts, and please let me know if there is a better forum for this discussion. Regards, Maarten.
On May 20, 2024, at 11:48, Joris Van den Bossche <jorisvandenbossche@gmail.com> wrote:
Hi all,
I want to notify you of a new PDEP enhancement proposal being discussed at https://github.com/pandas-dev/pandas/pull/58551. Rendered version of the current text: https://jorisvandenbossche.github.io/pandas-website-preview/pdeps/0014-strin...
Summary This PDEP proposes to introduce a dedicated string dtype that will be used by default in pandas 3.0: In pandas 3.0, enable a "string" dtype by default, using PyArrow if available or otherwise a string dtype using numpy object-dtype under the hood as fallback. The default string dtype will use missing value semantics (using NaN) consistent with the other default data types. This will give users a long-awaited proper string dtype for 3.0, while 1) not (yet) making PyArrow a hard dependency, but only a dependency used by default, and 2) leaving room for future improvements (different missing value semantics, using NumPy 2.0 strings, etc) after 3.0.
Backwards compatibility The biggest compatibility issues will be present for users of the existing StringDtype. Users can already specify dtype="string" for several years, and right now that gives a string dtype that uses pd.NA as the missing value sentinel, while with this proposal that will change to NaN by default in pandas 3.0. The current proposal is to start raising a deprecation warning for this in pandas 2.3 before changing the behaviour in pandas 3.0, and users will have the option to explicitly keep using the variant of the dtype that uses pd.NA (although most users should be fine with the new default). See the "Backwards compatibility" section in the PDEP text.
For the full proposal and background, see the links above.
Feedback welcome!
_______________________________________________ Pandas-dev mailing list Pandas-dev@python.org https://mail.python.org/mailman/listinfo/pandas-dev
Hi, I am in the same line with Maarten. A default string dtype is great, but going back to NaN for missing values is strange. I read the PDEP that clearly motivates the reason for using NaN instead of NA for the default string dtype. The argument makes sense, having consistency in default dtypes, but as only floats are nullable, it looks like a weak consistency. I see more arguments for keeping NA for the default strings than against it. - since 1.0, pd.NA is envisioned to replace all other sentinels to improve consistency. Introducing new dtypes with NaN sentinel sends a strange message to the community that pd.NA might not eventually become the default sentinel. Whereas pandas made fantastic progress in terms of consistency with CoW, loc in-place semantic, methods renaming and deprecation, etc. there is still a long way to go (one major improvement would be the simplification of the dtypes with all nullable dtypes by default). Each improvement comes at the price of a lot of work for pandas users to support the changes in behavior. It is a fair price to pay for a more consistent pandas, but if there are doubt on the way it goes, it might send the wrong signal to pandas users. - the argument for having only NaN for default dtypes is weak in my opinion. Every user (even beginners) must be aware and use extended dtypes. It makes no sense to use the current default string dtype that identifies as object, as it is an avenue for bugs, using non nullable numpy ints is also an issue due to implicit float casting as long as we have missing values (for numerous kind of operation and methods). Nullable ints are today stable and easy to use. Moreover, all methods working with sentinel values support seamlessly NaN and NA, so mixing NaN and NA sentinels in different columns (beside the mere display) is never an issue. - when numpy 2.0 will support nullable strings with efficient storage, the use of NaN (Not a Number) is a poor choice for a sentinel (of course it is not a number as it is a string array). It will be way better in pandas with the new numpy string dtype to use NA (when displaying and setting) instead of NaN, so that the user manipulate NA even if it is backed by NaN under the hood. - supporting multiple string dtypes with different sentinels is a significant code, maintenance, and feature discovery issue. we currently have string 3 dtypes, it is already too much. Using the current string dtype as the default one would be a great simplification and improvement in the default semantic for strings. Using NA for sentinel is, in my opinion, natural to all current users and can be easily explained to new users. It is indeed a backward compatibility issue, but a strong benefit, moderate impact. Regards, Arnaud. Le 20/05/2024 à 22:15, Maarten Ballintijn a écrit :
Hey Joris,
I’ve been trying to understand the impact of this for our internal codebase.
I’d just been starting to introduce the non object based string type. This was based on my understanding of PyArrow becoming a requirement in 3.0.
We use (Py)Arrow) already, mainly for Parquet, so I was not worried about this.
This proposal sounds a little bit as two steps forward, one step backward wrt to the direction Pandas is taking for NA values.
Or does this mean that Pandas specific “NA”, masked arrays, etc, is effectively dead?
How is the I/O compatibility going to work?
I’m worried about the NA vs. NaN in logic expressions change. If I understand this correctly, it might require code audits to validate?
Looking forward to your thoughts, and please let me know if there is a better forum for this discussion.
Regards, Maarten.
On May 20, 2024, at 11:48, Joris Van den Bossche <jorisvandenbossche@gmail.com> wrote:
Hi all,
I want to notify you of a new PDEP enhancement proposal being discussed at https://github.com/pandas-dev/pandas/pull/58551. Rendered version of the current text: https://jorisvandenbossche.github.io/pandas-website-preview/pdeps/0014-strin...
*Summary* This PDEP proposes to introduce a dedicated string dtype that will be used by default in pandas 3.0:
* In pandas 3.0, enable a "string" dtype by default, using PyArrow if available or otherwise a string dtype using numpy object-dtype under the hood as fallback. * The default string dtype will use missing value semantics (using NaN) consistent with the other default data types.
This will give users a long-awaited proper string dtype for 3.0, while 1) not (yet) making PyArrow a /hard/ dependency, but only a dependency used by default, and 2) leaving room for future improvements (different missing value semantics, using NumPy 2.0 strings, etc) after 3.0.
*Backwards compatibility* The biggest compatibility issues will be present for users of the existing StringDtype. Users can already specify dtype="string" for several years, and right now that gives a string dtype that uses pd.NA as the missing value sentinel, while with this proposal that will change to NaN by default in pandas 3.0. The current proposal is to start raising a deprecation warning for this in pandas 2.3 before changing the behaviour in pandas 3.0, and users will have the option to explicitly keep using the variant of the dtype that uses pd.NA (although most users should be fine with the new default). See the "Backwards compatibility" section in the PDEP text.
For the full proposal and background, see the links above.
Feedback welcome!
_______________________________________________ Pandas-dev mailing list Pandas-dev@python.org https://mail.python.org/mailman/listinfo/pandas-dev
_______________________________________________ Pandas-dev mailing list Pandas-dev@python.org https://mail.python.org/mailman/listinfo/pandas-dev
Hi Arnaud, Thanks for the feedback! On Tue, 21 May 2024 at 11:10, Arnaud Legout <arnaud.legout@inria.fr> wrote:
Hi,
I am in the same line with Maarten. A default string dtype is great, but going back to NaN for missing values is strange. I read the PDEP that clearly motivates the reason for using NaN instead of NA for the default string dtype. The argument makes sense, having consistency in default dtypes, but as only floats are nullable, it looks like a weak consistency.
While NaN is only native in floating dtypes, it's not only the float dtype that is nullable (using NaN semantics) among the default dtypes. We use NaN for strings in object dtype as well, and the NaT in datetimelike dtypes has similar semantics (and also categorical and interval use NaN).
I see more arguments for keeping NA for the default strings than against it. - since 1.0, pd.NA is envisioned to replace all other sentinels to improve consistency. Introducing new dtypes with NaN sentinel sends a strange message to the community that pd.NA might not eventually become the default sentinel. Whereas pandas made fantastic progress in terms of consistency with CoW, loc in-place semantic, methods renaming and deprecation, etc. there is still a long way to go (one major improvement would be the simplification of the dtypes with all nullable dtypes by default). Each improvement comes at the price of a lot of work for pandas users to support the changes in behavior. It is a fair price to pay for a more consistent pandas, but if there are doubt on the way it goes, it might send the wrong signal to pandas users.
I fully agree on the "major improvement would be the simplification of the dtypes with all nullable dtypes by default", and I would love to see that. While pd.NA already existed for several years, for various reasons this is not fully ready, and so this won't happen in pandas 3.0. We can't have all major improvements at the same time, and I personally hope we will be able to have all nullable dtypes by default in pandas 4.0. We could try to make that message clearer (but one complication is that we haven't yet officially decided on this through a PDEP). - the argument for having only NaN for default dtypes is weak in my
opinion. Every user (even beginners) must be aware and use extended dtypes. It makes no sense to use the current default string dtype that identifies as object, as it is an avenue for bugs, using non nullable numpy ints is also an issue due to implicit float casting as long as we have missing values (for numerous kind of operation and methods). Nullable ints are today stable and easy to use. Moreover, all methods working with sentinel values support seamlessly NaN and NA, so mixing NaN and NA sentinels in different columns (beside the mere display) is never an issue.
Even though you might think it makes no sense, I do think that the reality is that still quite some people use the default object dtype for string
columns, for better or worse. And I think this is one of the most important considerations in the discussion, but at the same time also very uncertain / difficult to know ... Without hard facts, my _guess_ is that a majority of users of pandas is still using (with my guess informed on seeing user questions, beginner tutorials from a quick google, the fact that our IO methods still give you that by default except if you explicitly opt-in, ..). But I could also well be wrong in that guess.
... - supporting multiple string dtypes with different sentinels is a significant code, maintenance, and feature discovery issue. we currently have string 3 dtypes, it is already too much. Using the current string dtype as the default one would be a great simplification and improvement in the default semantic for strings. Using NA for sentinel is, in my opinion, natural to all current users and can be easily explained to new users. It is indeed a backward compatibility issue, but a strong benefit, moderate impact.
I hear you. And I fully agree this is an unfortunate situation that we would be creating even more string dtypes (and as the original person
If most users were already using the current StringDtype with NA, then for sure, let's just make that the default. But if there is a good chunk of our user base is not yet using any opt-in data types, for this part of the user base a future default string dtype with NaN in 3.0 will be more natural and easier to migrate to. Maybe one of the answers to this is to make the proposal less breaking and to use a different name for this new dtype (which has also been brought up in the discussion on the PR <https://github.com/pandas-dev/pandas/pull/58551>). For example, we could use "str" for the new default dtype using NaN, so that existing users of dtype="string" don't have to worry about any of this. Essentially, then only users that were not yet using the current string dtype / that were still using object dtype, will see the new string dtype using NaN. Of course, that does not resolve the complexity/confusion around having multiple string dtypes (and might actually only make it more confusing?). pushing for pd.NA, I definitely agree that this would be a big improvement). But I think you might underestimate the impact of switching to pd.NA (eg compatibility with numpy). My personal stance is that I wouldn't introduce a _default_ dtype that uses pd.NA at this point (as the only default one using it), so from that point of view it is either a string dtype with NaN in 3.0, or just no default string dtype at all (continue to use object dtype by default, and hopefully get a string dtype by default in 4.0). In any case, regardless of whether we end up doing a string dtype in pandas 3.0 or not, it is clear that we should do a concerted effort to clarify the situation around pd.NA and try to get us to use it for all dtypes in pandas 4.0. Cheers, Joris
Regards, Arnaud.
Le 20/05/2024 à 22:15, Maarten Ballintijn a écrit :
Hey Joris,
I’ve been trying to understand the impact of this for our internal codebase.
I’d just been starting to introduce the non object based string type. This was based on my understanding of PyArrow becoming a requirement in 3.0.
We use (Py)Arrow) already, mainly for Parquet, so I was not worried about this.
This proposal sounds a little bit as two steps forward, one step backward wrt to the direction Pandas is taking for NA values.
Or does this mean that Pandas specific “NA”, masked arrays, etc, is effectively dead?
How is the I/O compatibility going to work?
I’m worried about the NA vs. NaN in logic expressions change. If I understand this correctly, it might require code audits to validate?
Looking forward to your thoughts, and please let me know if there is a better forum for this discussion.
Regards, Maarten.
On May 20, 2024, at 11:48, Joris Van den Bossche <jorisvandenbossche@gmail.com> <jorisvandenbossche@gmail.com> wrote:
Hi all,
I want to notify you of a new PDEP enhancement proposal being discussed at https://github.com/pandas-dev/pandas/pull/58551. Rendered version of the current text: https://jorisvandenbossche.github.io/pandas-website-preview/pdeps/0014-strin...
*Summary* This PDEP proposes to introduce a dedicated string dtype that will be used by default in pandas 3.0:
- In pandas 3.0, enable a "string" dtype by default, using PyArrow if available or otherwise a string dtype using numpy object-dtype under the hood as fallback. - The default string dtype will use missing value semantics (using NaN) consistent with the other default data types.
This will give users a long-awaited proper string dtype for 3.0, while 1) not (yet) making PyArrow a *hard* dependency, but only a dependency used by default, and 2) leaving room for future improvements (different missing value semantics, using NumPy 2.0 strings, etc) after 3.0.
*Backwards compatibility* The biggest compatibility issues will be present for users of the existing StringDtype. Users can already specify dtype="string" for several years, and right now that gives a string dtype that uses pd.NA as the missing value sentinel, while with this proposal that will change to NaN by default in pandas 3.0. The current proposal is to start raising a deprecation warning for this in pandas 2.3 before changing the behaviour in pandas 3.0, and users will have the option to explicitly keep using the variant of the dtype that uses pd.NA (although most users should be fine with the new default). See the "Backwards compatibility" section in the PDEP text.
For the full proposal and background, see the links above.
Feedback welcome!
_______________________________________________ Pandas-dev mailing list Pandas-dev@python.org https://mail.python.org/mailman/listinfo/pandas-dev
_______________________________________________ Pandas-dev mailing listPandas-dev@python.orghttps://mail.python.org/mailman/listinfo/pandas-dev
Hi all, I’m the primary author of the new variable-width UTF-8 string dtype shipping in NumPy 2.0. It’s a little unfortunate the timing of all of this, since if the new string dtype existed even a year or two earlier it might have been viable as a migration target in this period. One neat thing about the numpy dtype is the null string is parametrized, so this is perfectly fine:
dt_with_na = np.dtypes.StringDType(na_object=pd.NA)
This is in fact how I implemented native support in pandas for the numpy stringdtype (which we’re hoping to have a non-fraft PR open for soon). Note that this isn’t using NaN under the hood (although internally it has the same semantics as NaN, for ease of compatibility), it’s using pd.NA directly. It also supports missing string values like None that generate errors as well as default string values. Another neat thing is that because the underlying represenation for jull values in the array buffer is independent of the NA values stored in the dtype, in principle it’s cheap to change the NA value in-place without a copy or changing any values if you want to change the NA semantics temporarily. Just wanted to point that out because there was a mention earlier in this thread about numpy 2.0 using NaN under the hood, which isn’t quite right. On Tue, May 21, 2024 at 6:49 AM Joris Van den Bossche < jorisvandenbossche@gmail.com> wrote:
Hi Arnaud,
Thanks for the feedback!
On Tue, 21 May 2024 at 11:10, Arnaud Legout <arnaud.legout@inria.fr> wrote:
Hi,
I am in the same line with Maarten. A default string dtype is great, but going back to NaN for missing values is strange. I read the PDEP that clearly motivates the reason for using NaN instead of NA for the default string dtype. The argument makes sense, having consistency in default dtypes, but as only floats are nullable, it looks like a weak consistency.
While NaN is only native in floating dtypes, it's not only the float dtype that is nullable (using NaN semantics) among the default dtypes. We use NaN for strings in object dtype as well, and the NaT in datetimelike dtypes has similar semantics (and also categorical and interval use NaN).
I see more arguments for keeping NA for the default strings than against it. - since 1.0, pd.NA is envisioned to replace all other sentinels to improve consistency. Introducing new dtypes with NaN sentinel sends a strange message to the community that pd.NA might not eventually become the default sentinel. Whereas pandas made fantastic progress in terms of consistency with CoW, loc in-place semantic, methods renaming and deprecation, etc. there is still a long way to go (one major improvement would be the simplification of the dtypes with all nullable dtypes by default). Each improvement comes at the price of a lot of work for pandas users to support the changes in behavior. It is a fair price to pay for a more consistent pandas, but if there are doubt on the way it goes, it might send the wrong signal to pandas users.
I fully agree on the "major improvement would be the simplification of the dtypes with all nullable dtypes by default", and I would love to see that. While pd.NA already existed for several years, for various reasons this is not fully ready, and so this won't happen in pandas 3.0.
We can't have all major improvements at the same time, and I personally hope we will be able to have all nullable dtypes by default in pandas 4.0. We could try to make that message clearer (but one complication is that we haven't yet officially decided on this through a PDEP).
opinion. Every user (even beginners) must be aware and use extended dtypes. It makes no sense to use the current default string dtype that identifies as object, as it is an avenue for bugs, using non nullable numpy ints is also an issue due to implicit float casting as long as we have missing values (for numerous kind of operation and methods). Nullable ints are today stable and easy to use. Moreover, all methods working with sentinel values support seamlessly NaN and NA, so mixing NaN and NA sentinels in different columns (beside the mere display) is never an issue.
Even though you might think it makes no sense, I do think that the reality is that still quite some people use the default object dtype for string columns, for better or worse. And I think this is one of the most important considerations in the discussion, but at the same time also very uncertain / difficult to know ... Without hard facts, my _guess_ is that a majority of users of pandas is still using (with my guess informed on seeing user questions, beginner tutorials from a quick google, the fact that our IO methods still give you
- the argument for having only NaN for default dtypes is weak in my that by default except if you explicitly opt-in, ..). But I could also well be wrong in that guess.
If most users were already using the current StringDtype with NA, then for sure, let's just make that the default. But if there is a good chunk of our user base is not yet using any opt-in data types, for this part of the user base a future default string dtype with NaN in 3.0 will be more natural and easier to migrate to.
Maybe one of the answers to this is to make the proposal less breaking and to use a different name for this new dtype (which has also been brought up in the discussion on the PR <https://github.com/pandas-dev/pandas/pull/58551>). For example, we could use "str" for the new default dtype using NaN, so that existing users of dtype="string" don't have to worry about any of this. Essentially, then only users that were not yet using the current string dtype / that were still using object dtype, will see the new string dtype using NaN. Of course, that does not resolve the complexity/confusion around having multiple string dtypes (and might actually only make it more confusing?).
... - supporting multiple string dtypes with different sentinels is a significant code, maintenance, and feature discovery issue. we currently have string 3 dtypes, it is already too much. Using the current string dtype as the default one would be a great simplification and improvement in the default semantic for strings. Using NA for sentinel is, in my opinion, natural to all current users and can be easily explained to new users. It is indeed a backward compatibility issue, but a strong benefit, moderate impact.
I hear you. And I fully agree this is an unfortunate situation that we would be creating even more string dtypes (and as the original person pushing for pd.NA, I definitely agree that this would be a big improvement). But I think you might underestimate the impact of switching to pd.NA (eg compatibility with numpy). My personal stance is that I wouldn't introduce a _default_ dtype that uses pd.NA at this point (as the only default one using it), so from that point of view it is either a string dtype with NaN in 3.0, or just no default string dtype at all (continue to use object dtype by default, and hopefully get a string dtype by default in 4.0).
In any case, regardless of whether we end up doing a string dtype in pandas 3.0 or not, it is clear that we should do a concerted effort to clarify the situation around pd.NA and try to get us to use it for all dtypes in pandas 4.0.
Cheers, Joris
Regards, Arnaud.
Le 20/05/2024 à 22:15, Maarten Ballintijn a écrit :
Hey Joris,
I’ve been trying to understand the impact of this for our internal codebase.
I’d just been starting to introduce the non object based string type. This was based on my understanding of PyArrow becoming a requirement in 3.0.
We use (Py)Arrow) already, mainly for Parquet, so I was not worried about this.
This proposal sounds a little bit as two steps forward, one step backward wrt to the direction Pandas is taking for NA values.
Or does this mean that Pandas specific “NA”, masked arrays, etc, is effectively dead?
How is the I/O compatibility going to work?
I’m worried about the NA vs. NaN in logic expressions change. If I understand this correctly, it might require code audits to validate?
Looking forward to your thoughts, and please let me know if there is a better forum for this discussion.
Regards, Maarten.
On May 20, 2024, at 11:48, Joris Van den Bossche <jorisvandenbossche@gmail.com> <jorisvandenbossche@gmail.com> wrote:
Hi all,
I want to notify you of a new PDEP enhancement proposal being discussed at https://github.com/pandas-dev/pandas/pull/58551. Rendered version of the current text: https://jorisvandenbossche.github.io/pandas-website-preview/pdeps/0014-strin...
*Summary* This PDEP proposes to introduce a dedicated string dtype that will be used by default in pandas 3.0:
- In pandas 3.0, enable a "string" dtype by default, using PyArrow if available or otherwise a string dtype using numpy object-dtype under the hood as fallback. - The default string dtype will use missing value semantics (using NaN) consistent with the other default data types.
This will give users a long-awaited proper string dtype for 3.0, while 1) not (yet) making PyArrow a *hard* dependency, but only a dependency used by default, and 2) leaving room for future improvements (different missing value semantics, using NumPy 2.0 strings, etc) after 3.0.
*Backwards compatibility* The biggest compatibility issues will be present for users of the existing StringDtype. Users can already specify dtype="string" for several years, and right now that gives a string dtype that uses pd.NA as the missing value sentinel, while with this proposal that will change to NaN by default in pandas 3.0. The current proposal is to start raising a deprecation warning for this in pandas 2.3 before changing the behaviour in pandas 3.0, and users will have the option to explicitly keep using the variant of the dtype that uses pd.NA (although most users should be fine with the new default). See the "Backwards compatibility" section in the PDEP text.
For the full proposal and background, see the links above.
Feedback welcome!
_______________________________________________ Pandas-dev mailing list Pandas-dev@python.org https://mail.python.org/mailman/listinfo/pandas-dev
_______________________________________________ Pandas-dev mailing listPandas-dev@python.orghttps://mail.python.org/mailman/listinfo/pandas-dev
_______________________________________________ Pandas-dev mailing list Pandas-dev@python.org https://mail.python.org/mailman/listinfo/pandas-dev
Hi all, An update on this topic: based on the feedback here and on the PR, the PDEP has been updated as follows (summarizing): as in the initial version, the proposed default string dtype for pandas 3.0 will still use the "NaN-semantics" like all other default dtypes (so not yet using pd.NA), but for existing users of the "string" / pd.StringDtype() dtype (which does use pd.NA), the proposal is now backwards compatible. So if you were already using this string dtype, it should keep working as is. This backwards compatibility is mostly achieved by keeping the "string" alias reserved for this existing version of the dtype that uses pd.NA, and to use "str" for the new default string dtype. PR with all the discussion: https://github.com/pandas-dev/pandas/pull/58551. Rendered version of the updated text: https://jorisvandenbossche.github.io/pandas-website-preview/pdeps/0014-strin... If no further substantive discussion happens, we are planning to vote on this in 2 weeks time. Best, Joris
Hey Joris,
I’ve been trying to understand the impact of this for our internal codebase.
I’d just been starting to introduce the non object based string type. This was based on my understanding of PyArrow becoming a requirement in 3.0.
(note that the NA vs NaN issue is independent of pyarrow becoming a required dependency. Also if we actually make pyarrow required in 3.0 (although given you already use it, that doesn't matter for your use case), we would need to decide on whether to use NaN or NA by default (an important aspect which was unfortunately not considered when discussing
Hi Maarten, On Mon, 20 May 2024 at 22:15, Maarten Ballintijn <maartenb@xs4all.nl> wrote: pyarrow as required dependency))
We use (Py)Arrow) already, mainly for Parquet, so I was not worried about this.
This proposal sounds a little bit as two steps forward, one step backward wrt to the direction Pandas is taking for NA values. Or does this mean that Pandas specific “NA”, masked arrays, etc, is effectively dead?
The NA dtypes are definitely not dead, but essentially this PDEP means "not ready for 3.0" (not ready to use for *all* dtypes by default, and I think we should only use it by default if we can use it by default for all dtypes). Given that other data types still use NaN semantics, the proposal is to use the same semantics for the _default_ string dtype. That doesn't mean that you can't continue to use the string dtype using NA, though. And there are discussions (planned) about moving to NA for all dtypes in a later major release of pandas, but not yet a concrete proposal or timeline.
How is the I/O compatibility going to work?
You mean for eg Parquet files that have been written with pandas < 3 and using the string dtype, and then reading those in pandas >= 3? By default that will work out of the box for the default string dtype. But in case you still want the NA-variant of the dtype, that's less clear (the metadata in the Parquet file does not allow us to distinguish that). That is something we should look into
I’m worried about the NA vs. NaN in logic expressions change. If I understand this correctly, it might require code audits to validate?
Yes, that could have impact. First, if you were already using the string dtype, you will need to decide for your code base whether you want to continue using the variant with NA, or whether you would prefer to just use the default dtype (and so use the variant with NaN). The answer to this will depend on your use case. For example, if your motivation to introduce the non object based string type in your code base was the better performance using pyarrow, but you are not actively using the pd.NA aspect of it, you might be fine with just using the future default string dtype. In that case, you can enable it (pd.options.future.infer_string = True), and test your code. If you do have specific code using the various nullable dtypes using pd.NA and masked arrays, you can simply keep using those. Then you will need some changes to the code to ensure to keep using it (like occurences of dtype="string"), but then apart from that there should be no behaviour change or further audits necessary. Some feedback on how big the impact is on a real world code base is definitely welcome and valuable! Joris
Looking forward to your thoughts, and please let me know if there is a better forum for this discussion.
Regards, Maarten.
On May 20, 2024, at 11:48, Joris Van den Bossche < jorisvandenbossche@gmail.com> wrote:
Hi all,
I want to notify you of a new PDEP enhancement proposal being discussed at https://github.com/pandas-dev/pandas/pull/58551. Rendered version of the current text: https://jorisvandenbossche.github.io/pandas-website-preview/pdeps/0014-strin...
*Summary* This PDEP proposes to introduce a dedicated string dtype that will be used by default in pandas 3.0:
- In pandas 3.0, enable a "string" dtype by default, using PyArrow if available or otherwise a string dtype using numpy object-dtype under the hood as fallback. - The default string dtype will use missing value semantics (using NaN) consistent with the other default data types.
This will give users a long-awaited proper string dtype for 3.0, while 1) not (yet) making PyArrow a *hard* dependency, but only a dependency used by default, and 2) leaving room for future improvements (different missing value semantics, using NumPy 2.0 strings, etc) after 3.0.
*Backwards compatibility* The biggest compatibility issues will be present for users of the existing StringDtype. Users can already specify dtype="string" for several years, and right now that gives a string dtype that uses pd.NA as the missing value sentinel, while with this proposal that will change to NaN by default in pandas 3.0. The current proposal is to start raising a deprecation warning for this in pandas 2.3 before changing the behaviour in pandas 3.0, and users will have the option to explicitly keep using the variant of the dtype that uses pd.NA (although most users should be fine with the new default). See the "Backwards compatibility" section in the PDEP text.
For the full proposal and background, see the links above.
Feedback welcome!
_______________________________________________ Pandas-dev mailing list Pandas-dev@python.org https://mail.python.org/mailman/listinfo/pandas-dev
participants (4)
-
Arnaud Legout -
Joris Van den Bossche -
Maarten Ballintijn -
Nathan