Proposal: pandas 1.0 and ideas for pandas 2.0 / future
hi folks, As a continuation of ongoing discussions on GitHub and on the mailing list around deprecations and future innovation and internal reworkings of pandas, I had a couple of ideas to share that I am looking for feedback on. As far as pandas 0.19.x today, I would like to propose that we consider releasing the project as pandas 1.0 in the next major release or the one after. The Python community does have a penchant for "eternal betas", but after all the hard work of the core developers and community over the last 5 years, I think we can safely consider making a stable 1.X production release. If we do decide to release pandas 1.0, I also propose that we strongly consider making 1.X an LTS / Long Term Support branch where we can continue to make releases, but bug fixes and documentation improvements only. Or, we can add new features, but on an extremely conservative basis. This might require some changes to development process, so looking for feedback on this. If we commit to this path, I would suggest that we start a pandas-2.0 integration branch where we can begin more seriously planning and executing on - Cleanup and removal of years' worth of accumulated cruft / legacy code - Removal of deprecated features - Series and DataFrame internals revamp. I had hoped that 2016 would offer me more time to work on the internals revamp, but between my day job and the 2nd ed of "Python for Data Analysis" that turned out to be a little too ambitious. I have been almost continuously thinking about how to go about this though, and it might be good to figure out a process where we can start documenting and coming up with a more granular development roadmap for this. Part of this will be carefully documenting any APIs we change or unit tests we break along the way. We would want to give ample time for heavy pandas users to run their 3rd-party code based on pandas 2.0-dev to give feedback on whether our assumptions about the impact of changes affect real production code. As a concrete example: integer and boolean Series would be able to accommodate missing data without implicitly casting to float or object NumPy dtype respectively. Since many users will have inserted workarounds / data massaging code because of such rough edges, this may cause code breakage or simply redundancy in some cases. As another example: we should probably remove the .ix indexing attribute altogether. I'm sure many users are still using .ix, but it would be worthwhile to go through such code and decide whether it's really .loc or .iloc. My hope would be (being a deadline-motivated person) that we could see a pandas 2.0 alpha release sometime mid- or 2nd half 2017, with a target beta / pre-production QA release in early 2018 or thereabouts. Part of this would be creating a 1.0 to 2.0 migration guide for users. My biggest concern with pandas in recent years is how not to be held back by strict backwards compatibility and still be able to innovate and stay relevant into the 2020s. For pandas 2.0 some of the most important issues I've been thinking about are: - Logical type abstraction layer / decoupling. pandas-only data types (Categorical, DatetimeTZ, Period, etc.) will become equal citizens as compared with data types mapping 1-1 on NumPy numeric dtypes - Decoupling physical storage to permit non-NumPy data structures inside Series - Removal of BlockManager and 2D block consolidation in DataFrame, in favor of a native C++ internal table (vector-of-arrays) data structure - Consistent NA semantics across all data types - Significantly improved handling of string/UTF8 data (performance, memory use -- elimination of PyObject boxes). From the above 2 items, we could even make all string arrays internally categorical (with the option to explicitly cast to categorical) -- in the database world this is often called dictionary encoding. - Refactor of most Cython algorithms into C++11/14 templates - Copy-on-write for Series and DataFrame - Removal of Panel, ndim > 3 data structures - Analytical expression VM (for example -- things like df[boolean_arr].groupby(...).agg(...) could be evaluated by a small Numexpr-like VM, not dissimilar to R's dplyr library, with significantly improved memory use and maybe performance too) There's a lot to unpack here, but let me know what everyone thinks about these things. The "pandas 2.0" / internals revamp discussion we can tackle in a separate thread or in perhaps in a GitHub repo or design folder in the pandas codebase. Thanks, Wes
I know I expressed concerns about cross-compatibility with the rest of the SciPy ecosystem before (especially xarray), but this plan sounds very solid to me. Flexible data types in N-dimensional arrays are important for other use cases, but also not really a problem for pandas. A separate 2.0 release will let us make the major breaking changes to the pandas data model necessary for it to work well in the long term. There are a few other API warts that will be able to clean up this way (detailed in github.com/pydata/pandas/issues/10000), indexing on DataFrames being the most obvious one. On Tue, Jul 26, 2016 at 1:51 PM, Wes McKinney <wesmckinn@gmail.com> wrote:
hi folks,
As a continuation of ongoing discussions on GitHub and on the mailing list around deprecations and future innovation and internal reworkings of pandas, I had a couple of ideas to share that I am looking for feedback on.
As far as pandas 0.19.x today, I would like to propose that we consider releasing the project as pandas 1.0 in the next major release or the one after. The Python community does have a penchant for "eternal betas", but after all the hard work of the core developers and community over the last 5 years, I think we can safely consider making a stable 1.X production release.
If we do decide to release pandas 1.0, I also propose that we strongly consider making 1.X an LTS / Long Term Support branch where we can continue to make releases, but bug fixes and documentation improvements only. Or, we can add new features, but on an extremely conservative basis. This might require some changes to development process, so looking for feedback on this.
If we commit to this path, I would suggest that we start a pandas-2.0 integration branch where we can begin more seriously planning and executing on
- Cleanup and removal of years' worth of accumulated cruft / legacy code - Removal of deprecated features - Series and DataFrame internals revamp.
I had hoped that 2016 would offer me more time to work on the internals revamp, but between my day job and the 2nd ed of "Python for Data Analysis" that turned out to be a little too ambitious. I have been almost continuously thinking about how to go about this though, and it might be good to figure out a process where we can start documenting and coming up with a more granular development roadmap for this. Part of this will be carefully documenting any APIs we change or unit tests we break along the way.
We would want to give ample time for heavy pandas users to run their 3rd-party code based on pandas 2.0-dev to give feedback on whether our assumptions about the impact of changes affect real production code. As a concrete example: integer and boolean Series would be able to accommodate missing data without implicitly casting to float or object NumPy dtype respectively. Since many users will have inserted workarounds / data massaging code because of such rough edges, this may cause code breakage or simply redundancy in some cases. As another example: we should probably remove the .ix indexing attribute altogether. I'm sure many users are still using .ix, but it would be worthwhile to go through such code and decide whether it's really .loc or .iloc.
My hope would be (being a deadline-motivated person) that we could see a pandas 2.0 alpha release sometime mid- or 2nd half 2017, with a target beta / pre-production QA release in early 2018 or thereabouts. Part of this would be creating a 1.0 to 2.0 migration guide for users.
My biggest concern with pandas in recent years is how not to be held back by strict backwards compatibility and still be able to innovate and stay relevant into the 2020s.
For pandas 2.0 some of the most important issues I've been thinking about are:
- Logical type abstraction layer / decoupling. pandas-only data types (Categorical, DatetimeTZ, Period, etc.) will become equal citizens as compared with data types mapping 1-1 on NumPy numeric dtypes
- Decoupling physical storage to permit non-NumPy data structures inside Series
- Removal of BlockManager and 2D block consolidation in DataFrame, in favor of a native C++ internal table (vector-of-arrays) data structure
- Consistent NA semantics across all data types
- Significantly improved handling of string/UTF8 data (performance, memory use -- elimination of PyObject boxes). From the above 2 items, we could even make all string arrays internally categorical (with the option to explicitly cast to categorical) -- in the database world this is often called dictionary encoding.
- Refactor of most Cython algorithms into C++11/14 templates
- Copy-on-write for Series and DataFrame
- Removal of Panel, ndim > 3 data structures
- Analytical expression VM (for example -- things like df[boolean_arr].groupby(...).agg(...) could be evaluated by a small Numexpr-like VM, not dissimilar to R's dplyr library, with significantly improved memory use and maybe performance too)
There's a lot to unpack here, but let me know what everyone thinks about these things. The "pandas 2.0" / internals revamp discussion we can tackle in a separate thread or in perhaps in a GitHub repo or design folder in the pandas codebase.
Thanks, Wes _______________________________________________ Pandas-dev mailing list Pandas-dev@python.org https://mail.python.org/mailman/listinfo/pandas-dev
I applaud the vision and ambition for the roadmap of the future of pandas. However, the resources are lacking for much of these changes. Currently pandas is just barely keeping up with the (recently increased) user flow of pull-requests, not to mention the issue reports. These are all great indicators of community use and exercising the edge cases. A roadmap is an excellent start, but the resource question needs to be front and center. The current process *could* evolve into LTS. In 0.19.0, lots of progress towards removing older code (and of course deprecating things) is happening. An aggressive push of this into 0.20.0 will go a long ways towards de-facto establishing 1.0 / LTS. (and maybe that's what we simply call 0.20.0). I would agree we could simply release 1.0 / LTS without adding any 'new' features (like fixed getitem indexing and such). I would like to see 2.0 with a user facing API that is a drop-in replacement (though allowing for some breaking changes that are NOT back-compat, e.g. getitem indexing). I think it would be acceptable to break the back-end API (meaning to numpy) though. For the resource question, as I have mentioned off-list, I will format this roadmap in order for pandas to support a fund-raising effort to garner resources for these changes. Jeff On Tue, Jul 26, 2016 at 5:13 PM, Stephan Hoyer <shoyer@gmail.com> wrote:
I know I expressed concerns about cross-compatibility with the rest of the SciPy ecosystem before (especially xarray), but this plan sounds very solid to me. Flexible data types in N-dimensional arrays are important for other use cases, but also not really a problem for pandas.
A separate 2.0 release will let us make the major breaking changes to the pandas data model necessary for it to work well in the long term. There are a few other API warts that will be able to clean up this way (detailed in github.com/pydata/pandas/issues/10000), indexing on DataFrames being the most obvious one.
On Tue, Jul 26, 2016 at 1:51 PM, Wes McKinney <wesmckinn@gmail.com> wrote:
hi folks,
As a continuation of ongoing discussions on GitHub and on the mailing list around deprecations and future innovation and internal reworkings of pandas, I had a couple of ideas to share that I am looking for feedback on.
As far as pandas 0.19.x today, I would like to propose that we consider releasing the project as pandas 1.0 in the next major release or the one after. The Python community does have a penchant for "eternal betas", but after all the hard work of the core developers and community over the last 5 years, I think we can safely consider making a stable 1.X production release.
If we do decide to release pandas 1.0, I also propose that we strongly consider making 1.X an LTS / Long Term Support branch where we can continue to make releases, but bug fixes and documentation improvements only. Or, we can add new features, but on an extremely conservative basis. This might require some changes to development process, so looking for feedback on this.
If we commit to this path, I would suggest that we start a pandas-2.0 integration branch where we can begin more seriously planning and executing on
- Cleanup and removal of years' worth of accumulated cruft / legacy code - Removal of deprecated features - Series and DataFrame internals revamp.
I had hoped that 2016 would offer me more time to work on the internals revamp, but between my day job and the 2nd ed of "Python for Data Analysis" that turned out to be a little too ambitious. I have been almost continuously thinking about how to go about this though, and it might be good to figure out a process where we can start documenting and coming up with a more granular development roadmap for this. Part of this will be carefully documenting any APIs we change or unit tests we break along the way.
We would want to give ample time for heavy pandas users to run their 3rd-party code based on pandas 2.0-dev to give feedback on whether our assumptions about the impact of changes affect real production code. As a concrete example: integer and boolean Series would be able to accommodate missing data without implicitly casting to float or object NumPy dtype respectively. Since many users will have inserted workarounds / data massaging code because of such rough edges, this may cause code breakage or simply redundancy in some cases. As another example: we should probably remove the .ix indexing attribute altogether. I'm sure many users are still using .ix, but it would be worthwhile to go through such code and decide whether it's really .loc or .iloc.
My hope would be (being a deadline-motivated person) that we could see a pandas 2.0 alpha release sometime mid- or 2nd half 2017, with a target beta / pre-production QA release in early 2018 or thereabouts. Part of this would be creating a 1.0 to 2.0 migration guide for users.
My biggest concern with pandas in recent years is how not to be held back by strict backwards compatibility and still be able to innovate and stay relevant into the 2020s.
For pandas 2.0 some of the most important issues I've been thinking about are:
- Logical type abstraction layer / decoupling. pandas-only data types (Categorical, DatetimeTZ, Period, etc.) will become equal citizens as compared with data types mapping 1-1 on NumPy numeric dtypes
- Decoupling physical storage to permit non-NumPy data structures inside Series
- Removal of BlockManager and 2D block consolidation in DataFrame, in favor of a native C++ internal table (vector-of-arrays) data structure
- Consistent NA semantics across all data types
- Significantly improved handling of string/UTF8 data (performance, memory use -- elimination of PyObject boxes). From the above 2 items, we could even make all string arrays internally categorical (with the option to explicitly cast to categorical) -- in the database world this is often called dictionary encoding.
- Refactor of most Cython algorithms into C++11/14 templates
- Copy-on-write for Series and DataFrame
- Removal of Panel, ndim > 3 data structures
- Analytical expression VM (for example -- things like df[boolean_arr].groupby(...).agg(...) could be evaluated by a small Numexpr-like VM, not dissimilar to R's dplyr library, with significantly improved memory use and maybe performance too)
There's a lot to unpack here, but let me know what everyone thinks about these things. The "pandas 2.0" / internals revamp discussion we can tackle in a separate thread or in perhaps in a GitHub repo or design folder in the pandas codebase.
Thanks, Wes _______________________________________________ Pandas-dev mailing list Pandas-dev@python.org https://mail.python.org/mailman/listinfo/pandas-dev
_______________________________________________ Pandas-dev mailing list Pandas-dev@python.org https://mail.python.org/mailman/listinfo/pandas-dev
Wes, thanks for your mail! I like the idea of first releasing a pandas 1.0 before the 'big refactor'. We for sure know that this will take a while to stabilize (even with a lot of resources), and I think the idea was to provide a kind of LTS release. In that regard, it is just clearer to name this pandas 1.x then 0.19.x. Maybe we can start a separate thread to discuss on this 1.0, as there are of course some questions to discuss: - do we first release 0.19 (we didn't specifically discuss this, but I think the rough idea was to have somewhere in august a release candidate), or do we directly aim at 1.0? - are there some certain changes we want to do before 1.0 that are feasible in the short term? - are there some of the current ideas of deprecations that we should exclude/include for this release? (eg I think deprecating PanelND (as just landed in master) is good, but the idea of deprecating Panel should rather wait until 2.0?) - ... How exactly to tackle those bug fix releases / LTS branch, is also something that can be discussed, but I would not worry too much about that (there are enough examples of other projects to do something similar, we just have to search for a process that suits us). What I think a more important issue or problem with this process is the community of contributors. If we would effectively have a period of about two years (before a final 2.0 release) where for the current (1.0) version only certain bug-fixes are considered, but on the other hand it is still difficult to contribute to the new version. We would maybe have to say no to many of the PRs or enhancement ideas. Such a situation could hinder the process of community contributions and participation. And there are currently a lot of contributions. As Jeff also said, the current active contributors are barely keeping up with managing all issues and pull requests. I have worked the last few weeks more on pandas (thanks to Continuum), and indeed I spent most of my time answering issues and reviewing PRs, and hardly have any time to do much coding myself. But of course this is also a choice that I currently make. And I (we) could also make the choice to focus more on pandas 1.0/2.0 related issues, or try to steer some of the active contributors to that. I also have some concerns about the compatibility with the rest of the ecosystem, but at the same time it is clear I think that there should be some kind of refactor, and it is in the further elaboration of the roadmap that such concerns can be addressed. Joris 2016-07-27 12:04 GMT+02:00 Jeff Reback <jeffreback@gmail.com>:
I applaud the vision and ambition for the roadmap of the future of pandas.
However, the resources are lacking for much of these changes. Currently pandas is just barely keeping up with the (recently increased) user flow of pull-requests, not to mention the issue reports. These are all great indicators of community use and exercising the edge cases.
A roadmap is an excellent start, but the resource question needs to be front and center.
The current process *could* evolve into LTS. In 0.19.0, lots of progress towards removing older code (and of course deprecating things) is happening. An aggressive push of this into 0.20.0 will go a long ways towards de-facto establishing 1.0 / LTS. (and maybe that's what we simply call 0.20.0).
I would agree we could simply release 1.0 / LTS without adding any 'new' features (like fixed getitem indexing and such).
I would like to see 2.0 with a user facing API that is a drop-in replacement (though allowing for some breaking changes that are NOT back-compat, e.g. getitem indexing). I think it would be acceptable to break the back-end API (meaning to numpy) though.
For the resource question, as I have mentioned off-list, I will format this roadmap in order for pandas to support a fund-raising effort to garner resources for these changes.
Jeff
On Tue, Jul 26, 2016 at 5:13 PM, Stephan Hoyer <shoyer@gmail.com> wrote:
I know I expressed concerns about cross-compatibility with the rest of the SciPy ecosystem before (especially xarray), but this plan sounds very solid to me. Flexible data types in N-dimensional arrays are important for other use cases, but also not really a problem for pandas.
A separate 2.0 release will let us make the major breaking changes to the pandas data model necessary for it to work well in the long term. There are a few other API warts that will be able to clean up this way (detailed in github.com/pydata/pandas/issues/10000), indexing on DataFrames being the most obvious one.
On Tue, Jul 26, 2016 at 1:51 PM, Wes McKinney <wesmckinn@gmail.com> wrote:
hi folks,
As a continuation of ongoing discussions on GitHub and on the mailing list around deprecations and future innovation and internal reworkings of pandas, I had a couple of ideas to share that I am looking for feedback on.
As far as pandas 0.19.x today, I would like to propose that we consider releasing the project as pandas 1.0 in the next major release or the one after. The Python community does have a penchant for "eternal betas", but after all the hard work of the core developers and community over the last 5 years, I think we can safely consider making a stable 1.X production release.
If we do decide to release pandas 1.0, I also propose that we strongly consider making 1.X an LTS / Long Term Support branch where we can continue to make releases, but bug fixes and documentation improvements only. Or, we can add new features, but on an extremely conservative basis. This might require some changes to development process, so looking for feedback on this.
If we commit to this path, I would suggest that we start a pandas-2.0 integration branch where we can begin more seriously planning and executing on
- Cleanup and removal of years' worth of accumulated cruft / legacy code - Removal of deprecated features - Series and DataFrame internals revamp.
I had hoped that 2016 would offer me more time to work on the internals revamp, but between my day job and the 2nd ed of "Python for Data Analysis" that turned out to be a little too ambitious. I have been almost continuously thinking about how to go about this though, and it might be good to figure out a process where we can start documenting and coming up with a more granular development roadmap for this. Part of this will be carefully documenting any APIs we change or unit tests we break along the way.
We would want to give ample time for heavy pandas users to run their 3rd-party code based on pandas 2.0-dev to give feedback on whether our assumptions about the impact of changes affect real production code. As a concrete example: integer and boolean Series would be able to accommodate missing data without implicitly casting to float or object NumPy dtype respectively. Since many users will have inserted workarounds / data massaging code because of such rough edges, this may cause code breakage or simply redundancy in some cases. As another example: we should probably remove the .ix indexing attribute altogether. I'm sure many users are still using .ix, but it would be worthwhile to go through such code and decide whether it's really .loc or .iloc.
My hope would be (being a deadline-motivated person) that we could see a pandas 2.0 alpha release sometime mid- or 2nd half 2017, with a target beta / pre-production QA release in early 2018 or thereabouts. Part of this would be creating a 1.0 to 2.0 migration guide for users.
My biggest concern with pandas in recent years is how not to be held back by strict backwards compatibility and still be able to innovate and stay relevant into the 2020s.
For pandas 2.0 some of the most important issues I've been thinking about are:
- Logical type abstraction layer / decoupling. pandas-only data types (Categorical, DatetimeTZ, Period, etc.) will become equal citizens as compared with data types mapping 1-1 on NumPy numeric dtypes
- Decoupling physical storage to permit non-NumPy data structures inside Series
- Removal of BlockManager and 2D block consolidation in DataFrame, in favor of a native C++ internal table (vector-of-arrays) data structure
- Consistent NA semantics across all data types
- Significantly improved handling of string/UTF8 data (performance, memory use -- elimination of PyObject boxes). From the above 2 items, we could even make all string arrays internally categorical (with the option to explicitly cast to categorical) -- in the database world this is often called dictionary encoding.
- Refactor of most Cython algorithms into C++11/14 templates
- Copy-on-write for Series and DataFrame
- Removal of Panel, ndim > 3 data structures
- Analytical expression VM (for example -- things like df[boolean_arr].groupby(...).agg(...) could be evaluated by a small Numexpr-like VM, not dissimilar to R's dplyr library, with significantly improved memory use and maybe performance too)
There's a lot to unpack here, but let me know what everyone thinks about these things. The "pandas 2.0" / internals revamp discussion we can tackle in a separate thread or in perhaps in a GitHub repo or design folder in the pandas codebase.
Thanks, Wes _______________________________________________ Pandas-dev mailing list Pandas-dev@python.org https://mail.python.org/mailman/listinfo/pandas-dev
_______________________________________________ Pandas-dev mailing list Pandas-dev@python.org https://mail.python.org/mailman/listinfo/pandas-dev
_______________________________________________ Pandas-dev mailing list Pandas-dev@python.org https://mail.python.org/mailman/listinfo/pandas-dev
1) I would be in favour of releasing 0.19.0 in part because we already pushed back and actually forgone 0.18.2. I think these plans are better served for the release after this one to give more time to map this but also to push out the changes that have already been made in preparation for this release. 2) In terms of organisation, I wonder if we might be better served *reorganising* the way in which PR's are reviewed during the time period between one release and the next instead of having these parallel tracks of development in light of the concern brought up by @jorisvanenbossche. Perhaps rather than just reviewing PR's as they come in, specify which types of PR's should be submitted during certain periods of time. For example, a large chunk of the period could be devoted to accepting enhancements / new features after which the remaining time before a release could be devoted to just organisation / refactoring / deprecations / what have you (maybe include bug fixes too). That way we could have a *contiguous* block of time to focus on stabilising and tidying up the release. It would also allow for the refactoring to take place (perhaps incrementally) without the concern of being destabilised by a new feature. For this to work, this would have to be clearly stated in the contributing docs as well as circulated in emails to pandas-dev AND other related groups so that way people know what's going on in terms of the development cycle. On Wed, Jul 27, 2016 at 7:51 PM, Joris Van den Bossche < jorisvandenbossche@gmail.com> wrote:
Wes, thanks for your mail!
I like the idea of first releasing a pandas 1.0 before the 'big refactor'. We for sure know that this will take a while to stabilize (even with a lot of resources), and I think the idea was to provide a kind of LTS release. In that regard, it is just clearer to name this pandas 1.x then 0.19.x.
Maybe we can start a separate thread to discuss on this 1.0, as there are of course some questions to discuss: - do we first release 0.19 (we didn't specifically discuss this, but I think the rough idea was to have somewhere in august a release candidate), or do we directly aim at 1.0? - are there some certain changes we want to do before 1.0 that are feasible in the short term? - are there some of the current ideas of deprecations that we should exclude/include for this release? (eg I think deprecating PanelND (as just landed in master) is good, but the idea of deprecating Panel should rather wait until 2.0?) - ...
How exactly to tackle those bug fix releases / LTS branch, is also something that can be discussed, but I would not worry too much about that (there are enough examples of other projects to do something similar, we just have to search for a process that suits us).
What I think a more important issue or problem with this process is the community of contributors. If we would effectively have a period of about two years (before a final 2.0 release) where for the current (1.0) version only certain bug-fixes are considered, but on the other hand it is still difficult to contribute to the new version. We would maybe have to say no to many of the PRs or enhancement ideas. Such a situation could hinder the process of community contributions and participation. And there are currently a lot of contributions. As Jeff also said, the current active contributors are barely keeping up with managing all issues and pull requests. I have worked the last few weeks more on pandas (thanks to Continuum), and indeed I spent most of my time answering issues and reviewing PRs, and hardly have any time to do much coding myself. But of course this is also a choice that I currently make. And I (we) could also make the choice to focus more on pandas 1.0/2.0 related issues, or try to steer some of the active contributors to that.
I also have some concerns about the compatibility with the rest of the ecosystem, but at the same time it is clear I think that there should be some kind of refactor, and it is in the further elaboration of the roadmap that such concerns can be addressed.
Joris
2016-07-27 12:04 GMT+02:00 Jeff Reback <jeffreback@gmail.com>:
I applaud the vision and ambition for the roadmap of the future of pandas.
However, the resources are lacking for much of these changes. Currently pandas is just barely keeping up with the (recently increased) user flow of pull-requests, not to mention the issue reports. These are all great indicators of community use and exercising the edge cases.
A roadmap is an excellent start, but the resource question needs to be front and center.
The current process *could* evolve into LTS. In 0.19.0, lots of progress towards removing older code (and of course deprecating things) is happening. An aggressive push of this into 0.20.0 will go a long ways towards de-facto establishing 1.0 / LTS. (and maybe that's what we simply call 0.20.0).
I would agree we could simply release 1.0 / LTS without adding any 'new' features (like fixed getitem indexing and such).
I would like to see 2.0 with a user facing API that is a drop-in replacement (though allowing for some breaking changes that are NOT back-compat, e.g. getitem indexing). I think it would be acceptable to break the back-end API (meaning to numpy) though.
For the resource question, as I have mentioned off-list, I will format this roadmap in order for pandas to support a fund-raising effort to garner resources for these changes.
Jeff
On Tue, Jul 26, 2016 at 5:13 PM, Stephan Hoyer <shoyer@gmail.com> wrote:
I know I expressed concerns about cross-compatibility with the rest of the SciPy ecosystem before (especially xarray), but this plan sounds very solid to me. Flexible data types in N-dimensional arrays are important for other use cases, but also not really a problem for pandas.
A separate 2.0 release will let us make the major breaking changes to the pandas data model necessary for it to work well in the long term. There are a few other API warts that will be able to clean up this way (detailed in github.com/pydata/pandas/issues/10000), indexing on DataFrames being the most obvious one.
On Tue, Jul 26, 2016 at 1:51 PM, Wes McKinney <wesmckinn@gmail.com> wrote:
hi folks,
As a continuation of ongoing discussions on GitHub and on the mailing list around deprecations and future innovation and internal reworkings of pandas, I had a couple of ideas to share that I am looking for feedback on.
As far as pandas 0.19.x today, I would like to propose that we consider releasing the project as pandas 1.0 in the next major release or the one after. The Python community does have a penchant for "eternal betas", but after all the hard work of the core developers and community over the last 5 years, I think we can safely consider making a stable 1.X production release.
If we do decide to release pandas 1.0, I also propose that we strongly consider making 1.X an LTS / Long Term Support branch where we can continue to make releases, but bug fixes and documentation improvements only. Or, we can add new features, but on an extremely conservative basis. This might require some changes to development process, so looking for feedback on this.
If we commit to this path, I would suggest that we start a pandas-2.0 integration branch where we can begin more seriously planning and executing on
- Cleanup and removal of years' worth of accumulated cruft / legacy code - Removal of deprecated features - Series and DataFrame internals revamp.
I had hoped that 2016 would offer me more time to work on the internals revamp, but between my day job and the 2nd ed of "Python for Data Analysis" that turned out to be a little too ambitious. I have been almost continuously thinking about how to go about this though, and it might be good to figure out a process where we can start documenting and coming up with a more granular development roadmap for this. Part of this will be carefully documenting any APIs we change or unit tests we break along the way.
We would want to give ample time for heavy pandas users to run their 3rd-party code based on pandas 2.0-dev to give feedback on whether our assumptions about the impact of changes affect real production code. As a concrete example: integer and boolean Series would be able to accommodate missing data without implicitly casting to float or object NumPy dtype respectively. Since many users will have inserted workarounds / data massaging code because of such rough edges, this may cause code breakage or simply redundancy in some cases. As another example: we should probably remove the .ix indexing attribute altogether. I'm sure many users are still using .ix, but it would be worthwhile to go through such code and decide whether it's really .loc or .iloc.
My hope would be (being a deadline-motivated person) that we could see a pandas 2.0 alpha release sometime mid- or 2nd half 2017, with a target beta / pre-production QA release in early 2018 or thereabouts. Part of this would be creating a 1.0 to 2.0 migration guide for users.
My biggest concern with pandas in recent years is how not to be held back by strict backwards compatibility and still be able to innovate and stay relevant into the 2020s.
For pandas 2.0 some of the most important issues I've been thinking about are:
- Logical type abstraction layer / decoupling. pandas-only data types (Categorical, DatetimeTZ, Period, etc.) will become equal citizens as compared with data types mapping 1-1 on NumPy numeric dtypes
- Decoupling physical storage to permit non-NumPy data structures inside Series
- Removal of BlockManager and 2D block consolidation in DataFrame, in favor of a native C++ internal table (vector-of-arrays) data structure
- Consistent NA semantics across all data types
- Significantly improved handling of string/UTF8 data (performance, memory use -- elimination of PyObject boxes). From the above 2 items, we could even make all string arrays internally categorical (with the option to explicitly cast to categorical) -- in the database world this is often called dictionary encoding.
- Refactor of most Cython algorithms into C++11/14 templates
- Copy-on-write for Series and DataFrame
- Removal of Panel, ndim > 3 data structures
- Analytical expression VM (for example -- things like df[boolean_arr].groupby(...).agg(...) could be evaluated by a small Numexpr-like VM, not dissimilar to R's dplyr library, with significantly improved memory use and maybe performance too)
There's a lot to unpack here, but let me know what everyone thinks about these things. The "pandas 2.0" / internals revamp discussion we can tackle in a separate thread or in perhaps in a GitHub repo or design folder in the pandas codebase.
Thanks, Wes _______________________________________________ Pandas-dev mailing list Pandas-dev@python.org https://mail.python.org/mailman/listinfo/pandas-dev
_______________________________________________ Pandas-dev mailing list Pandas-dev@python.org https://mail.python.org/mailman/listinfo/pandas-dev
_______________________________________________ Pandas-dev mailing list Pandas-dev@python.org https://mail.python.org/mailman/listinfo/pandas-dev
_______________________________________________ Pandas-dev mailing list Pandas-dev@python.org https://mail.python.org/mailman/listinfo/pandas-dev
OK, let me try to collect some of the feedback and give my thoughts 1) 0.19 and 0.20: I think we should push to release 0.19.0 soon and then plan what we want to add/change/deprecate for 1.0 which might otherwise have been 1.0. I think delaying 0.19.0 since we already pushed back 0.18.2, and there are some significant new patches (asof_merge and variable rolling windows), it would be good to get this into production before we declare a stable 1.0. 2) We will need to raise a significant amount of money for pandas (I estimate in the ballpark of US $300-500K -- better to have too much than too little) to be able to pursue the pandas 2.0 plan wholeheartedly. I would like to dedicate a minimum 5-10 hours per week to it in 2017 but this will not be sufficient to do everything (I am also a human being, and have a day job). It would be better to collaborate with one or two good freelance developers (with proven experience in C++ and Python) who are spending at least 50% of their time on pandas next year. I am going to start spending some time on design documentation so that we can start resolving some of the design questions and tradeoffs (not all of these decisions will be easy). We'll work on this offline and look to start soliciting funding (if anyone with the ability to write checks is reading, feel free to contact me offline). 3) I agree we will need to come up with a development process that facilitates both an invasive modification of pandas internals while also supporting production users of pandas 1.X. Cherry-picking bug fixes into the pandas 2.x branch will grow increasingly complicated; we need to factor this into our process (for example: we might collect all the unit tests for bug fixes -- assuming they rely on definitely stable behavior -- into a "to fix" folder so that we can return and adapt the bug fixes once the 2.x branch is getting more stable). To have developers both maintaining 1.x and trying to drive forward the 2.x branch at the same time does not seem realistic -- we should talk to the IPython/Jupyter devs to understand how they handled this through their long-lived IPython 1.0 branch IIRC (see http://ipython.org/news.html#ipython-1-0). 4) My goal, which I think we're all aligned on, would be for pandas 2.0 to be a drop-in replacement for 90-95% of normal pandas use. Many power users will have embraced some of the idiosyncrasies of pandas's implementation details, but I think some of the changes (e.g. missing data consistency, copy-on-write / improved semantics around memory ownership and views) will be welcomed. We should clearly document (in a dedicated "pandas's internal relationship with NumPy" document) and maintain very tight contracts around what kinds of zero-copy NumPy interoperability are supported -- it is not clear to me for example that arrays of Python string/unicode objects are a NumPy use case that is especially important to preserve, but most numeric data use cases are. This will also be helpful for power users to understand the nuances and how things are going to stay the same or change (for example: boolean and integer arrays with NAs will probably not be zero-copyable to NumPy arrays). We should maybe start side threads about each of these items. Just deciding what we want to deprecate or do in 0.20 aka 1.0 is a large enough task. Thanks all Wes On Wed, Jul 27, 2016 at 8:39 PM, G Young <gfyoung17@gmail.com> wrote:
1) I would be in favour of releasing 0.19.0 in part because we already pushed back and actually forgone 0.18.2. I think these plans are better served for the release after this one to give more time to map this but also to push out the changes that have already been made in preparation for this release.
2) In terms of organisation, I wonder if we might be better served reorganising the way in which PR's are reviewed during the time period between one release and the next instead of having these parallel tracks of development in light of the concern brought up by @jorisvanenbossche. Perhaps rather than just reviewing PR's as they come in, specify which types of PR's should be submitted during certain periods of time.
For example, a large chunk of the period could be devoted to accepting enhancements / new features after which the remaining time before a release could be devoted to just organisation / refactoring / deprecations / what have you (maybe include bug fixes too). That way we could have a contiguous block of time to focus on stabilising and tidying up the release. It would also allow for the refactoring to take place (perhaps incrementally) without the concern of being destabilised by a new feature.
For this to work, this would have to be clearly stated in the contributing docs as well as circulated in emails to pandas-dev AND other related groups so that way people know what's going on in terms of the development cycle.
On Wed, Jul 27, 2016 at 7:51 PM, Joris Van den Bossche <jorisvandenbossche@gmail.com> wrote:
Wes, thanks for your mail!
I like the idea of first releasing a pandas 1.0 before the 'big refactor'. We for sure know that this will take a while to stabilize (even with a lot of resources), and I think the idea was to provide a kind of LTS release. In that regard, it is just clearer to name this pandas 1.x then 0.19.x.
Maybe we can start a separate thread to discuss on this 1.0, as there are of course some questions to discuss: - do we first release 0.19 (we didn't specifically discuss this, but I think the rough idea was to have somewhere in august a release candidate), or do we directly aim at 1.0? - are there some certain changes we want to do before 1.0 that are feasible in the short term? - are there some of the current ideas of deprecations that we should exclude/include for this release? (eg I think deprecating PanelND (as just landed in master) is good, but the idea of deprecating Panel should rather wait until 2.0?) - ...
How exactly to tackle those bug fix releases / LTS branch, is also something that can be discussed, but I would not worry too much about that (there are enough examples of other projects to do something similar, we just have to search for a process that suits us).
What I think a more important issue or problem with this process is the community of contributors. If we would effectively have a period of about two years (before a final 2.0 release) where for the current (1.0) version only certain bug-fixes are considered, but on the other hand it is still difficult to contribute to the new version. We would maybe have to say no to many of the PRs or enhancement ideas. Such a situation could hinder the process of community contributions and participation. And there are currently a lot of contributions. As Jeff also said, the current active contributors are barely keeping up with managing all issues and pull requests. I have worked the last few weeks more on pandas (thanks to Continuum), and indeed I spent most of my time answering issues and reviewing PRs, and hardly have any time to do much coding myself. But of course this is also a choice that I currently make. And I (we) could also make the choice to focus more on pandas 1.0/2.0 related issues, or try to steer some of the active contributors to that.
I also have some concerns about the compatibility with the rest of the ecosystem, but at the same time it is clear I think that there should be some kind of refactor, and it is in the further elaboration of the roadmap that such concerns can be addressed.
Joris
2016-07-27 12:04 GMT+02:00 Jeff Reback <jeffreback@gmail.com>:
I applaud the vision and ambition for the roadmap of the future of pandas.
However, the resources are lacking for much of these changes. Currently pandas is just barely keeping up with the (recently increased) user flow of pull-requests, not to mention the issue reports. These are all great indicators of community use and exercising the edge cases.
A roadmap is an excellent start, but the resource question needs to be front and center.
The current process *could* evolve into LTS. In 0.19.0, lots of progress towards removing older code (and of course deprecating things) is happening. An aggressive push of this into 0.20.0 will go a long ways towards de-facto establishing 1.0 / LTS. (and maybe that's what we simply call 0.20.0).
I would agree we could simply release 1.0 / LTS without adding any 'new' features (like fixed getitem indexing and such).
I would like to see 2.0 with a user facing API that is a drop-in replacement (though allowing for some breaking changes that are NOT back-compat, e.g. getitem indexing). I think it would be acceptable to break the back-end API (meaning to numpy) though.
For the resource question, as I have mentioned off-list, I will format this roadmap in order for pandas to support a fund-raising effort to garner resources for these changes.
Jeff
On Tue, Jul 26, 2016 at 5:13 PM, Stephan Hoyer <shoyer@gmail.com> wrote:
I know I expressed concerns about cross-compatibility with the rest of the SciPy ecosystem before (especially xarray), but this plan sounds very solid to me. Flexible data types in N-dimensional arrays are important for other use cases, but also not really a problem for pandas.
A separate 2.0 release will let us make the major breaking changes to the pandas data model necessary for it to work well in the long term. There are a few other API warts that will be able to clean up this way (detailed in github.com/pydata/pandas/issues/10000), indexing on DataFrames being the most obvious one.
On Tue, Jul 26, 2016 at 1:51 PM, Wes McKinney <wesmckinn@gmail.com> wrote:
hi folks,
As a continuation of ongoing discussions on GitHub and on the mailing list around deprecations and future innovation and internal reworkings of pandas, I had a couple of ideas to share that I am looking for feedback on.
As far as pandas 0.19.x today, I would like to propose that we consider releasing the project as pandas 1.0 in the next major release or the one after. The Python community does have a penchant for "eternal betas", but after all the hard work of the core developers and community over the last 5 years, I think we can safely consider making a stable 1.X production release.
If we do decide to release pandas 1.0, I also propose that we strongly consider making 1.X an LTS / Long Term Support branch where we can continue to make releases, but bug fixes and documentation improvements only. Or, we can add new features, but on an extremely conservative basis. This might require some changes to development process, so looking for feedback on this.
If we commit to this path, I would suggest that we start a pandas-2.0 integration branch where we can begin more seriously planning and executing on
- Cleanup and removal of years' worth of accumulated cruft / legacy code - Removal of deprecated features - Series and DataFrame internals revamp.
I had hoped that 2016 would offer me more time to work on the internals revamp, but between my day job and the 2nd ed of "Python for Data Analysis" that turned out to be a little too ambitious. I have been almost continuously thinking about how to go about this though, and it might be good to figure out a process where we can start documenting and coming up with a more granular development roadmap for this. Part of this will be carefully documenting any APIs we change or unit tests we break along the way.
We would want to give ample time for heavy pandas users to run their 3rd-party code based on pandas 2.0-dev to give feedback on whether our assumptions about the impact of changes affect real production code. As a concrete example: integer and boolean Series would be able to accommodate missing data without implicitly casting to float or object NumPy dtype respectively. Since many users will have inserted workarounds / data massaging code because of such rough edges, this may cause code breakage or simply redundancy in some cases. As another example: we should probably remove the .ix indexing attribute altogether. I'm sure many users are still using .ix, but it would be worthwhile to go through such code and decide whether it's really .loc or .iloc.
My hope would be (being a deadline-motivated person) that we could see a pandas 2.0 alpha release sometime mid- or 2nd half 2017, with a target beta / pre-production QA release in early 2018 or thereabouts. Part of this would be creating a 1.0 to 2.0 migration guide for users.
My biggest concern with pandas in recent years is how not to be held back by strict backwards compatibility and still be able to innovate and stay relevant into the 2020s.
For pandas 2.0 some of the most important issues I've been thinking about are:
- Logical type abstraction layer / decoupling. pandas-only data types (Categorical, DatetimeTZ, Period, etc.) will become equal citizens as compared with data types mapping 1-1 on NumPy numeric dtypes
- Decoupling physical storage to permit non-NumPy data structures inside Series
- Removal of BlockManager and 2D block consolidation in DataFrame, in favor of a native C++ internal table (vector-of-arrays) data structure
- Consistent NA semantics across all data types
- Significantly improved handling of string/UTF8 data (performance, memory use -- elimination of PyObject boxes). From the above 2 items, we could even make all string arrays internally categorical (with the option to explicitly cast to categorical) -- in the database world this is often called dictionary encoding.
- Refactor of most Cython algorithms into C++11/14 templates
- Copy-on-write for Series and DataFrame
- Removal of Panel, ndim > 3 data structures
- Analytical expression VM (for example -- things like df[boolean_arr].groupby(...).agg(...) could be evaluated by a small Numexpr-like VM, not dissimilar to R's dplyr library, with significantly improved memory use and maybe performance too)
There's a lot to unpack here, but let me know what everyone thinks about these things. The "pandas 2.0" / internals revamp discussion we can tackle in a separate thread or in perhaps in a GitHub repo or design folder in the pandas codebase.
Thanks, Wes _______________________________________________ Pandas-dev mailing list Pandas-dev@python.org https://mail.python.org/mailman/listinfo/pandas-dev
_______________________________________________ Pandas-dev mailing list Pandas-dev@python.org https://mail.python.org/mailman/listinfo/pandas-dev
_______________________________________________ Pandas-dev mailing list Pandas-dev@python.org https://mail.python.org/mailman/listinfo/pandas-dev
_______________________________________________ Pandas-dev mailing list Pandas-dev@python.org https://mail.python.org/mailman/listinfo/pandas-dev
_______________________________________________ Pandas-dev mailing list Pandas-dev@python.org https://mail.python.org/mailman/listinfo/pandas-dev
Regarding 1), I agree it is a good idea to push a 0.19.0 release soonish, en we can then discuss what we further want to do (or not to do) for the 1.0 release. I am on holidays the coming week and a half, but afterwards I will also focus on getting 0.19.0 out. A release candidate in the last week of August is maybe a good deadline? Joris 2016-07-29 0:15 GMT+02:00 Wes McKinney <wesmckinn@gmail.com>:
OK, let me try to collect some of the feedback and give my thoughts
1) 0.19 and 0.20: I think we should push to release 0.19.0 soon and then plan what we want to add/change/deprecate for 1.0 which might otherwise have been 1.0. I think delaying 0.19.0 since we already pushed back 0.18.2, and there are some significant new patches (asof_merge and variable rolling windows), it would be good to get this into production before we declare a stable 1.0.
2) We will need to raise a significant amount of money for pandas (I estimate in the ballpark of US $300-500K -- better to have too much than too little) to be able to pursue the pandas 2.0 plan wholeheartedly. I would like to dedicate a minimum 5-10 hours per week to it in 2017 but this will not be sufficient to do everything (I am also a human being, and have a day job). It would be better to collaborate with one or two good freelance developers (with proven experience in C++ and Python) who are spending at least 50% of their time on pandas next year. I am going to start spending some time on design documentation so that we can start resolving some of the design questions and tradeoffs (not all of these decisions will be easy). We'll work on this offline and look to start soliciting funding (if anyone with the ability to write checks is reading, feel free to contact me offline).
3) I agree we will need to come up with a development process that facilitates both an invasive modification of pandas internals while also supporting production users of pandas 1.X. Cherry-picking bug fixes into the pandas 2.x branch will grow increasingly complicated; we need to factor this into our process (for example: we might collect all the unit tests for bug fixes -- assuming they rely on definitely stable behavior -- into a "to fix" folder so that we can return and adapt the bug fixes once the 2.x branch is getting more stable). To have developers both maintaining 1.x and trying to drive forward the 2.x branch at the same time does not seem realistic -- we should talk to the IPython/Jupyter devs to understand how they handled this through their long-lived IPython 1.0 branch IIRC (see http://ipython.org/news.html#ipython-1-0).
4) My goal, which I think we're all aligned on, would be for pandas 2.0 to be a drop-in replacement for 90-95% of normal pandas use. Many power users will have embraced some of the idiosyncrasies of pandas's implementation details, but I think some of the changes (e.g. missing data consistency, copy-on-write / improved semantics around memory ownership and views) will be welcomed. We should clearly document (in a dedicated "pandas's internal relationship with NumPy" document) and maintain very tight contracts around what kinds of zero-copy NumPy interoperability are supported -- it is not clear to me for example that arrays of Python string/unicode objects are a NumPy use case that is especially important to preserve, but most numeric data use cases are. This will also be helpful for power users to understand the nuances and how things are going to stay the same or change (for example: boolean and integer arrays with NAs will probably not be zero-copyable to NumPy arrays).
We should maybe start side threads about each of these items. Just deciding what we want to deprecate or do in 0.20 aka 1.0 is a large enough task.
Thanks all Wes
1) I would be in favour of releasing 0.19.0 in part because we already pushed back and actually forgone 0.18.2. I think these plans are better served for the release after this one to give more time to map this but also to push out the changes that have already been made in preparation for
release.
2) In terms of organisation, I wonder if we might be better served reorganising the way in which PR's are reviewed during the time period between one release and the next instead of having these parallel tracks of development in light of the concern brought up by @jorisvanenbossche. Perhaps rather than just reviewing PR's as they come in, specify which types of PR's should be submitted during certain periods of time.
For example, a large chunk of the period could be devoted to accepting enhancements / new features after which the remaining time before a release could be devoted to just organisation / refactoring / deprecations / what have you (maybe include bug fixes too). That way we could have a contiguous block of time to focus on stabilising and tidying up the release. It would also allow for the refactoring to take place (perhaps incrementally) without the concern of being destabilised by a new feature.
For this to work, this would have to be clearly stated in the contributing docs as well as circulated in emails to pandas-dev AND other related groups so that way people know what's going on in terms of the development cycle.
On Wed, Jul 27, 2016 at 7:51 PM, Joris Van den Bossche <jorisvandenbossche@gmail.com> wrote:
Wes, thanks for your mail!
I like the idea of first releasing a pandas 1.0 before the 'big
refactor'.
We for sure know that this will take a while to stabilize (even with a lot of resources), and I think the idea was to provide a kind of LTS release. In that regard, it is just clearer to name this pandas 1.x then 0.19.x.
Maybe we can start a separate thread to discuss on this 1.0, as there are of course some questions to discuss: - do we first release 0.19 (we didn't specifically discuss this, but I think the rough idea was to have somewhere in august a release candidate), or do we directly aim at 1.0? - are there some certain changes we want to do before 1.0 that are feasible in the short term? - are there some of the current ideas of deprecations that we should exclude/include for this release? (eg I think deprecating PanelND (as just landed in master) is good, but the idea of deprecating Panel should rather wait until 2.0?) - ...
How exactly to tackle those bug fix releases / LTS branch, is also something that can be discussed, but I would not worry too much about
(there are enough examples of other projects to do something similar, we just have to search for a process that suits us).
What I think a more important issue or problem with this process is the community of contributors. If we would effectively have a period of about two years (before a final 2.0 release) where for the current (1.0) version only certain bug-fixes are considered, but on the other hand it is still difficult to contribute to the new version. We would maybe have to say no to many of the PRs or enhancement ideas. Such a situation could hinder the process of community contributions and participation. And there are currently a lot of contributions. As Jeff also said, the current active contributors are barely keeping up with managing all issues and pull requests. I have worked the last few weeks more on pandas (thanks to Continuum), and indeed I spent most of my time answering issues and reviewing PRs, and hardly have any time to do much coding myself. But of course this is also a choice that I currently make. And I (we) could also make the choice to focus more on pandas 1.0/2.0 related issues, or try to steer some of the active contributors to that.
I also have some concerns about the compatibility with the rest of the ecosystem, but at the same time it is clear I think that there should be some kind of refactor, and it is in the further elaboration of the roadmap that such concerns can be addressed.
Joris
2016-07-27 12:04 GMT+02:00 Jeff Reback <jeffreback@gmail.com>:
I applaud the vision and ambition for the roadmap of the future of pandas.
However, the resources are lacking for much of these changes. Currently pandas is just barely keeping up with the (recently increased) user
flow
of pull-requests, not to mention the issue reports. These are all great indicators of community use and exercising the edge cases.
A roadmap is an excellent start, but the resource question needs to be front and center.
The current process *could* evolve into LTS. In 0.19.0, lots of
On Wed, Jul 27, 2016 at 8:39 PM, G Young <gfyoung17@gmail.com> wrote: this that progress
towards removing older code (and of course deprecating things) is happening. An aggressive push of this into 0.20.0 will go a long ways towards de-facto establishing 1.0 / LTS. (and maybe that's what we simply call 0.20.0).
I would agree we could simply release 1.0 / LTS without adding any 'new' features (like fixed getitem indexing and such).
I would like to see 2.0 with a user facing API that is a drop-in replacement (though allowing for some breaking changes that are NOT back-compat, e.g. getitem indexing). I think it would be acceptable to break the back-end API (meaning to numpy) though.
For the resource question, as I have mentioned off-list, I will format this roadmap in order for pandas to support a fund-raising effort to garner resources for these changes.
Jeff
On Tue, Jul 26, 2016 at 5:13 PM, Stephan Hoyer <shoyer@gmail.com> wrote:
I know I expressed concerns about cross-compatibility with the rest of the SciPy ecosystem before (especially xarray), but this plan sounds
very
solid to me. Flexible data types in N-dimensional arrays are important for other use cases, but also not really a problem for pandas.
A separate 2.0 release will let us make the major breaking changes to the pandas data model necessary for it to work well in the long term. There are a few other API warts that will be able to clean up this way (detailed in github.com/pydata/pandas/issues/10000), indexing on DataFrames being the most obvious one.
On Tue, Jul 26, 2016 at 1:51 PM, Wes McKinney <wesmckinn@gmail.com> wrote:
hi folks,
As a continuation of ongoing discussions on GitHub and on the mailing list around deprecations and future innovation and internal
reworkings
of pandas, I had a couple of ideas to share that I am looking for feedback on.
As far as pandas 0.19.x today, I would like to propose that we consider releasing the project as pandas 1.0 in the next major release or the one after. The Python community does have a penchant for "eternal betas", but after all the hard work of the core developers and community over the last 5 years, I think we can safely consider making a stable 1.X production release.
If we do decide to release pandas 1.0, I also propose that we strongly consider making 1.X an LTS / Long Term Support branch where we can continue to make releases, but bug fixes and documentation improvements only. Or, we can add new features, but on an extremely conservative basis. This might require some changes to development process, so looking for feedback on this.
If we commit to this path, I would suggest that we start a pandas-2.0 integration branch where we can begin more seriously planning and executing on
- Cleanup and removal of years' worth of accumulated cruft / legacy code - Removal of deprecated features - Series and DataFrame internals revamp.
I had hoped that 2016 would offer me more time to work on the internals revamp, but between my day job and the 2nd ed of "Python for Data Analysis" that turned out to be a little too ambitious. I have been almost continuously thinking about how to go about this though, and it might be good to figure out a process where we can start documenting and coming up with a more granular development roadmap for this. Part of this will be carefully documenting any APIs we change or unit tests we break along the way.
We would want to give ample time for heavy pandas users to run their 3rd-party code based on pandas 2.0-dev to give feedback on whether our assumptions about the impact of changes affect real production code. As a concrete example: integer and boolean Series would be able to accommodate missing data without implicitly casting to float or object NumPy dtype respectively. Since many users will have inserted workarounds / data massaging code because of such rough edges, this may cause code breakage or simply redundancy in some cases. As another example: we should probably remove the .ix indexing attribute altogether. I'm sure many users are still using .ix, but it would be worthwhile to go through such code and decide whether it's really .loc or .iloc.
My hope would be (being a deadline-motivated person) that we could see a pandas 2.0 alpha release sometime mid- or 2nd half 2017, with a target beta / pre-production QA release in early 2018 or thereabouts. Part of this would be creating a 1.0 to 2.0 migration guide for users.
My biggest concern with pandas in recent years is how not to be held back by strict backwards compatibility and still be able to innovate and stay relevant into the 2020s.
For pandas 2.0 some of the most important issues I've been thinking about are:
- Logical type abstraction layer / decoupling. pandas-only data types (Categorical, DatetimeTZ, Period, etc.) will become equal citizens as compared with data types mapping 1-1 on NumPy numeric dtypes
- Decoupling physical storage to permit non-NumPy data structures inside Series
- Removal of BlockManager and 2D block consolidation in DataFrame, in favor of a native C++ internal table (vector-of-arrays) data structure
- Consistent NA semantics across all data types
- Significantly improved handling of string/UTF8 data (performance, memory use -- elimination of PyObject boxes). From the above 2 items, we could even make all string arrays internally categorical (with the option to explicitly cast to categorical) -- in the database world this is often called dictionary encoding.
- Refactor of most Cython algorithms into C++11/14 templates
- Copy-on-write for Series and DataFrame
- Removal of Panel, ndim > 3 data structures
- Analytical expression VM (for example -- things like df[boolean_arr].groupby(...).agg(...) could be evaluated by a small Numexpr-like VM, not dissimilar to R's dplyr library, with significantly improved memory use and maybe performance too)
There's a lot to unpack here, but let me know what everyone thinks about these things. The "pandas 2.0" / internals revamp discussion we can tackle in a separate thread or in perhaps in a GitHub repo or design folder in the pandas codebase.
Thanks, Wes _______________________________________________ Pandas-dev mailing list Pandas-dev@python.org https://mail.python.org/mailman/listinfo/pandas-dev
_______________________________________________ Pandas-dev mailing list Pandas-dev@python.org https://mail.python.org/mailman/listinfo/pandas-dev
_______________________________________________ Pandas-dev mailing list Pandas-dev@python.org https://mail.python.org/mailman/listinfo/pandas-dev
_______________________________________________ Pandas-dev mailing list Pandas-dev@python.org https://mail.python.org/mailman/listinfo/pandas-dev
_______________________________________________ Pandas-dev mailing list Pandas-dev@python.org https://mail.python.org/mailman/listinfo/pandas-dev
Crazy thought. Perhaps ya'll could put together a road map and resources you will need to get it done (as in money for FTEs). I would like to see NumFOCUS try to push our sponsors to fund more FTEs for projects like this. If we have a road map in hand it makes the conversations much more tangible. -- Andy On Sun, Jul 31, 2016 at 5:03 PM, Joris Van den Bossche < jorisvandenbossche@gmail.com> wrote:
Regarding 1), I agree it is a good idea to push a 0.19.0 release soonish, en we can then discuss what we further want to do (or not to do) for the 1.0 release. I am on holidays the coming week and a half, but afterwards I will also focus on getting 0.19.0 out. A release candidate in the last week of August is maybe a good deadline?
Joris
2016-07-29 0:15 GMT+02:00 Wes McKinney <wesmckinn@gmail.com>:
OK, let me try to collect some of the feedback and give my thoughts
1) 0.19 and 0.20: I think we should push to release 0.19.0 soon and then plan what we want to add/change/deprecate for 1.0 which might otherwise have been 1.0. I think delaying 0.19.0 since we already pushed back 0.18.2, and there are some significant new patches (asof_merge and variable rolling windows), it would be good to get this into production before we declare a stable 1.0.
2) We will need to raise a significant amount of money for pandas (I estimate in the ballpark of US $300-500K -- better to have too much than too little) to be able to pursue the pandas 2.0 plan wholeheartedly. I would like to dedicate a minimum 5-10 hours per week to it in 2017 but this will not be sufficient to do everything (I am also a human being, and have a day job). It would be better to collaborate with one or two good freelance developers (with proven experience in C++ and Python) who are spending at least 50% of their time on pandas next year. I am going to start spending some time on design documentation so that we can start resolving some of the design questions and tradeoffs (not all of these decisions will be easy). We'll work on this offline and look to start soliciting funding (if anyone with the ability to write checks is reading, feel free to contact me offline).
3) I agree we will need to come up with a development process that facilitates both an invasive modification of pandas internals while also supporting production users of pandas 1.X. Cherry-picking bug fixes into the pandas 2.x branch will grow increasingly complicated; we need to factor this into our process (for example: we might collect all the unit tests for bug fixes -- assuming they rely on definitely stable behavior -- into a "to fix" folder so that we can return and adapt the bug fixes once the 2.x branch is getting more stable). To have developers both maintaining 1.x and trying to drive forward the 2.x branch at the same time does not seem realistic -- we should talk to the IPython/Jupyter devs to understand how they handled this through their long-lived IPython 1.0 branch IIRC (see http://ipython.org/news.html#ipython-1-0).
4) My goal, which I think we're all aligned on, would be for pandas 2.0 to be a drop-in replacement for 90-95% of normal pandas use. Many power users will have embraced some of the idiosyncrasies of pandas's implementation details, but I think some of the changes (e.g. missing data consistency, copy-on-write / improved semantics around memory ownership and views) will be welcomed. We should clearly document (in a dedicated "pandas's internal relationship with NumPy" document) and maintain very tight contracts around what kinds of zero-copy NumPy interoperability are supported -- it is not clear to me for example that arrays of Python string/unicode objects are a NumPy use case that is especially important to preserve, but most numeric data use cases are. This will also be helpful for power users to understand the nuances and how things are going to stay the same or change (for example: boolean and integer arrays with NAs will probably not be zero-copyable to NumPy arrays).
We should maybe start side threads about each of these items. Just deciding what we want to deprecate or do in 0.20 aka 1.0 is a large enough task.
Thanks all Wes
1) I would be in favour of releasing 0.19.0 in part because we already pushed back and actually forgone 0.18.2. I think these plans are better served for the release after this one to give more time to map this but also to push out the changes that have already been made in preparation for
release.
2) In terms of organisation, I wonder if we might be better served reorganising the way in which PR's are reviewed during the time period between one release and the next instead of having these parallel
development in light of the concern brought up by @jorisvanenbossche. Perhaps rather than just reviewing PR's as they come in, specify which types of PR's should be submitted during certain periods of time.
For example, a large chunk of the period could be devoted to accepting enhancements / new features after which the remaining time before a release could be devoted to just organisation / refactoring / deprecations / what have you (maybe include bug fixes too). That way we could have a contiguous block of time to focus on stabilising and tidying up the release. It would also allow for the refactoring to take place (perhaps incrementally) without the concern of being destabilised by a new feature.
For this to work, this would have to be clearly stated in the contributing docs as well as circulated in emails to pandas-dev AND other related groups so that way people know what's going on in terms of the development cycle.
On Wed, Jul 27, 2016 at 7:51 PM, Joris Van den Bossche <jorisvandenbossche@gmail.com> wrote:
Wes, thanks for your mail!
I like the idea of first releasing a pandas 1.0 before the 'big
refactor'.
We for sure know that this will take a while to stabilize (even with a lot of resources), and I think the idea was to provide a kind of LTS release. In that regard, it is just clearer to name this pandas 1.x then 0.19.x.
Maybe we can start a separate thread to discuss on this 1.0, as there are of course some questions to discuss: - do we first release 0.19 (we didn't specifically discuss this, but I think the rough idea was to have somewhere in august a release candidate), or do we directly aim at 1.0? - are there some certain changes we want to do before 1.0 that are feasible in the short term? - are there some of the current ideas of deprecations that we should exclude/include for this release? (eg I think deprecating PanelND (as just landed in master) is good, but the idea of deprecating Panel should rather wait until 2.0?) - ...
How exactly to tackle those bug fix releases / LTS branch, is also something that can be discussed, but I would not worry too much about
(there are enough examples of other projects to do something similar, we just have to search for a process that suits us).
What I think a more important issue or problem with this process is the community of contributors. If we would effectively have a period of about two years (before a final 2.0 release) where for the current (1.0) version only certain bug-fixes are considered, but on the other hand it is still difficult to contribute to the new version. We would maybe have to say no to many of the PRs or enhancement ideas. Such a situation could hinder the process of community contributions and participation. And there are currently a lot of contributions. As Jeff also said, the current active contributors are barely keeping up with managing all issues and pull requests. I have worked the last few weeks more on pandas (thanks to Continuum), and indeed I spent most of my time answering issues and reviewing PRs, and hardly have any time to do much coding myself. But of course this is also a choice that I currently make. And I (we) could also make the choice to focus more on pandas 1.0/2.0 related issues, or try to steer some of the active contributors to that.
I also have some concerns about the compatibility with the rest of the ecosystem, but at the same time it is clear I think that there should be some kind of refactor, and it is in the further elaboration of the roadmap that such concerns can be addressed.
Joris
2016-07-27 12:04 GMT+02:00 Jeff Reback <jeffreback@gmail.com>:
I applaud the vision and ambition for the roadmap of the future of pandas.
However, the resources are lacking for much of these changes.
Currently
pandas is just barely keeping up with the (recently increased) user flow of pull-requests, not to mention the issue reports. These are all great indicators of community use and exercising the edge cases.
A roadmap is an excellent start, but the resource question needs to be front and center.
The current process *could* evolve into LTS. In 0.19.0, lots of
towards removing older code (and of course deprecating things) is happening. An aggressive push of this into 0.20.0 will go a long ways towards de-facto establishing 1.0 / LTS. (and maybe that's what we simply call 0.20.0).
I would agree we could simply release 1.0 / LTS without adding any 'new' features (like fixed getitem indexing and such).
I would like to see 2.0 with a user facing API that is a drop-in replacement (though allowing for some breaking changes that are NOT back-compat, e.g. getitem indexing). I think it would be acceptable to break the back-end API (meaning to numpy) though.
For the resource question, as I have mentioned off-list, I will format this roadmap in order for pandas to support a fund-raising effort to garner resources for these changes.
Jeff
On Tue, Jul 26, 2016 at 5:13 PM, Stephan Hoyer <shoyer@gmail.com> wrote:
I know I expressed concerns about cross-compatibility with the rest
of
the SciPy ecosystem before (especially xarray), but this plan sounds very solid to me. Flexible data types in N-dimensional arrays are important for other use cases, but also not really a problem for pandas.
A separate 2.0 release will let us make the major breaking changes to the pandas data model necessary for it to work well in the long term. There are a few other API warts that will be able to clean up this way (detailed in github.com/pydata/pandas/issues/10000), indexing on DataFrames being the most obvious one.
On Tue, Jul 26, 2016 at 1:51 PM, Wes McKinney <wesmckinn@gmail.com> wrote: > > hi folks, > > As a continuation of ongoing discussions on GitHub and on the mailing > list around deprecations and future innovation and internal reworkings > of pandas, I had a couple of ideas to share that I am looking for > feedback on. > > As far as pandas 0.19.x today, I would like to propose that we > consider releasing the project as pandas 1.0 in the next major release > or the one after. The Python community does have a penchant for > "eternal betas", but after all the hard work of the core developers > and community over the last 5 years, I think we can safely consider > making a stable 1.X production release. > > If we do decide to release pandas 1.0, I also propose that we strongly > consider making 1.X an LTS / Long Term Support branch where we can > continue to make releases, but bug fixes and documentation > improvements only. Or, we can add new features, but on an extremely > conservative basis. This might require some changes to development > process, so looking for feedback on this. > > If we commit to this path, I would suggest that we start a
> integration branch where we can begin more seriously planning and > executing on > > - Cleanup and removal of years' worth of accumulated cruft / legacy > code > - Removal of deprecated features > - Series and DataFrame internals revamp. > > I had hoped that 2016 would offer me more time to work on the > internals revamp, but between my day job and the 2nd ed of "Python for > Data Analysis" that turned out to be a little too ambitious. I have > been almost continuously thinking about how to go about this though, > and it might be good to figure out a process where we can start > documenting and coming up with a more granular development roadmap for > this. Part of this will be carefully documenting any APIs we change or > unit tests we break along the way. > > We would want to give ample time for heavy pandas users to run their > 3rd-party code based on pandas 2.0-dev to give feedback on whether our > assumptions about the impact of changes affect real production code. > As a concrete example: integer and boolean Series would be able to > accommodate missing data without implicitly casting to float or object > NumPy dtype respectively. Since many users will have inserted > workarounds / data massaging code because of such rough edges, this > may cause code breakage or simply redundancy in some cases. As another > example: we should probably remove the .ix indexing attribute > altogether. I'm sure many users are still using .ix, but it would be > worthwhile to go through such code and decide whether it's really .loc > or .iloc. > > My hope would be (being a deadline-motivated person) that we could see > a pandas 2.0 alpha release sometime mid- or 2nd half 2017, with a > target beta / pre-production QA release in early 2018 or
> Part of this would be creating a 1.0 to 2.0 migration guide for users. > > My biggest concern with pandas in recent years is how not to be held > back by strict backwards compatibility and still be able to innovate > and stay relevant into the 2020s. > > For pandas 2.0 some of the most important issues I've been thinking > about are: > > - Logical type abstraction layer / decoupling. pandas-only data types > (Categorical, DatetimeTZ, Period, etc.) will become equal citizens as > compared with data types mapping 1-1 on NumPy numeric dtypes > > - Decoupling physical storage to permit non-NumPy data structures > inside Series > > - Removal of BlockManager and 2D block consolidation in DataFrame, in > favor of a native C++ internal table (vector-of-arrays) data structure > > - Consistent NA semantics across all data types > > - Significantly improved handling of string/UTF8 data (performance, > memory use -- elimination of PyObject boxes). From the above 2 items, > we could even make all string arrays internally categorical (with
On Wed, Jul 27, 2016 at 8:39 PM, G Young <gfyoung17@gmail.com> wrote: this tracks of that progress pandas-2.0 thereabouts. the
> option to explicitly cast to categorical) -- in the database world > this is often called dictionary encoding. > > - Refactor of most Cython algorithms into C++11/14 templates > > - Copy-on-write for Series and DataFrame > > - Removal of Panel, ndim > 3 data structures > > - Analytical expression VM (for example -- things like > df[boolean_arr].groupby(...).agg(...) could be evaluated by a small > Numexpr-like VM, not dissimilar to R's dplyr library, with > significantly improved memory use and maybe performance too) > > There's a lot to unpack here, but let me know what everyone thinks > about these things. The "pandas 2.0" / internals revamp discussion we > can tackle in a separate thread or in perhaps in a GitHub repo or > design folder in the pandas codebase. > > Thanks, > Wes > _______________________________________________ > Pandas-dev mailing list > Pandas-dev@python.org > https://mail.python.org/mailman/listinfo/pandas-dev
_______________________________________________ Pandas-dev mailing list Pandas-dev@python.org https://mail.python.org/mailman/listinfo/pandas-dev
_______________________________________________ Pandas-dev mailing list Pandas-dev@python.org https://mail.python.org/mailman/listinfo/pandas-dev
_______________________________________________ Pandas-dev mailing list Pandas-dev@python.org https://mail.python.org/mailman/listinfo/pandas-dev
_______________________________________________ Pandas-dev mailing list Pandas-dev@python.org https://mail.python.org/mailman/listinfo/pandas-dev
_______________________________________________ Pandas-dev mailing list Pandas-dev@python.org https://mail.python.org/mailman/listinfo/pandas-dev
hey Andy -- that makes sense to me. What I'm hoping to do this month is scope out a more granular plan for the specific things (problems and their possible solutions with lists of pros/cons of various approaches) we want to accomplish in a pandas 2.x effort and make sure we all agree (up to 70-80% of the big picture items). If we're going to raise a significant amount of money we owe it to the donors to explain how the money will be directed, and we won't want to be dealing with a lot of uncertainty about the roadmap once we have engaged FTEs beginning to help with moving things forward. - Wes On Mon, Aug 1, 2016 at 1:54 PM, Andy Ray Terrel <andy.terrel@gmail.com> wrote:
Crazy thought.
Perhaps ya'll could put together a road map and resources you will need to get it done (as in money for FTEs). I would like to see NumFOCUS try to push our sponsors to fund more FTEs for projects like this. If we have a road map in hand it makes the conversations much more tangible.
-- Andy
On Sun, Jul 31, 2016 at 5:03 PM, Joris Van den Bossche <jorisvandenbossche@gmail.com> wrote:
Regarding 1), I agree it is a good idea to push a 0.19.0 release soonish, en we can then discuss what we further want to do (or not to do) for the 1.0 release. I am on holidays the coming week and a half, but afterwards I will also focus on getting 0.19.0 out. A release candidate in the last week of August is maybe a good deadline?
Joris
2016-07-29 0:15 GMT+02:00 Wes McKinney <wesmckinn@gmail.com>:
OK, let me try to collect some of the feedback and give my thoughts
1) 0.19 and 0.20: I think we should push to release 0.19.0 soon and then plan what we want to add/change/deprecate for 1.0 which might otherwise have been 1.0. I think delaying 0.19.0 since we already pushed back 0.18.2, and there are some significant new patches (asof_merge and variable rolling windows), it would be good to get this into production before we declare a stable 1.0.
2) We will need to raise a significant amount of money for pandas (I estimate in the ballpark of US $300-500K -- better to have too much than too little) to be able to pursue the pandas 2.0 plan wholeheartedly. I would like to dedicate a minimum 5-10 hours per week to it in 2017 but this will not be sufficient to do everything (I am also a human being, and have a day job). It would be better to collaborate with one or two good freelance developers (with proven experience in C++ and Python) who are spending at least 50% of their time on pandas next year. I am going to start spending some time on design documentation so that we can start resolving some of the design questions and tradeoffs (not all of these decisions will be easy). We'll work on this offline and look to start soliciting funding (if anyone with the ability to write checks is reading, feel free to contact me offline).
3) I agree we will need to come up with a development process that facilitates both an invasive modification of pandas internals while also supporting production users of pandas 1.X. Cherry-picking bug fixes into the pandas 2.x branch will grow increasingly complicated; we need to factor this into our process (for example: we might collect all the unit tests for bug fixes -- assuming they rely on definitely stable behavior -- into a "to fix" folder so that we can return and adapt the bug fixes once the 2.x branch is getting more stable). To have developers both maintaining 1.x and trying to drive forward the 2.x branch at the same time does not seem realistic -- we should talk to the IPython/Jupyter devs to understand how they handled this through their long-lived IPython 1.0 branch IIRC (see http://ipython.org/news.html#ipython-1-0).
4) My goal, which I think we're all aligned on, would be for pandas 2.0 to be a drop-in replacement for 90-95% of normal pandas use. Many power users will have embraced some of the idiosyncrasies of pandas's implementation details, but I think some of the changes (e.g. missing data consistency, copy-on-write / improved semantics around memory ownership and views) will be welcomed. We should clearly document (in a dedicated "pandas's internal relationship with NumPy" document) and maintain very tight contracts around what kinds of zero-copy NumPy interoperability are supported -- it is not clear to me for example that arrays of Python string/unicode objects are a NumPy use case that is especially important to preserve, but most numeric data use cases are. This will also be helpful for power users to understand the nuances and how things are going to stay the same or change (for example: boolean and integer arrays with NAs will probably not be zero-copyable to NumPy arrays).
We should maybe start side threads about each of these items. Just deciding what we want to deprecate or do in 0.20 aka 1.0 is a large enough task.
Thanks all Wes
On Wed, Jul 27, 2016 at 8:39 PM, G Young <gfyoung17@gmail.com> wrote:
1) I would be in favour of releasing 0.19.0 in part because we already pushed back and actually forgone 0.18.2. I think these plans are better served for the release after this one to give more time to map this but also to push out the changes that have already been made in preparation for this release.
2) In terms of organisation, I wonder if we might be better served reorganising the way in which PR's are reviewed during the time period between one release and the next instead of having these parallel tracks of development in light of the concern brought up by @jorisvanenbossche. Perhaps rather than just reviewing PR's as they come in, specify which types of PR's should be submitted during certain periods of time.
For example, a large chunk of the period could be devoted to accepting enhancements / new features after which the remaining time before a release could be devoted to just organisation / refactoring / deprecations / what have you (maybe include bug fixes too). That way we could have a contiguous block of time to focus on stabilising and tidying up the release. It would also allow for the refactoring to take place (perhaps incrementally) without the concern of being destabilised by a new feature.
For this to work, this would have to be clearly stated in the contributing docs as well as circulated in emails to pandas-dev AND other related groups so that way people know what's going on in terms of the development cycle.
On Wed, Jul 27, 2016 at 7:51 PM, Joris Van den Bossche <jorisvandenbossche@gmail.com> wrote:
Wes, thanks for your mail!
I like the idea of first releasing a pandas 1.0 before the 'big refactor'. We for sure know that this will take a while to stabilize (even with a lot of resources), and I think the idea was to provide a kind of LTS release. In that regard, it is just clearer to name this pandas 1.x then 0.19.x.
Maybe we can start a separate thread to discuss on this 1.0, as there are of course some questions to discuss: - do we first release 0.19 (we didn't specifically discuss this, but I think the rough idea was to have somewhere in august a release candidate), or do we directly aim at 1.0? - are there some certain changes we want to do before 1.0 that are feasible in the short term? - are there some of the current ideas of deprecations that we should exclude/include for this release? (eg I think deprecating PanelND (as just landed in master) is good, but the idea of deprecating Panel should rather wait until 2.0?) - ...
How exactly to tackle those bug fix releases / LTS branch, is also something that can be discussed, but I would not worry too much about that (there are enough examples of other projects to do something similar, we just have to search for a process that suits us).
What I think a more important issue or problem with this process is the community of contributors. If we would effectively have a period of about two years (before a final 2.0 release) where for the current (1.0) version only certain bug-fixes are considered, but on the other hand it is still difficult to contribute to the new version. We would maybe have to say no to many of the PRs or enhancement ideas. Such a situation could hinder the process of community contributions and participation. And there are currently a lot of contributions. As Jeff also said, the current active contributors are barely keeping up with managing all issues and pull requests. I have worked the last few weeks more on pandas (thanks to Continuum), and indeed I spent most of my time answering issues and reviewing PRs, and hardly have any time to do much coding myself. But of course this is also a choice that I currently make. And I (we) could also make the choice to focus more on pandas 1.0/2.0 related issues, or try to steer some of the active contributors to that.
I also have some concerns about the compatibility with the rest of the ecosystem, but at the same time it is clear I think that there should be some kind of refactor, and it is in the further elaboration of the roadmap that such concerns can be addressed.
Joris
2016-07-27 12:04 GMT+02:00 Jeff Reback <jeffreback@gmail.com>:
I applaud the vision and ambition for the roadmap of the future of pandas.
However, the resources are lacking for much of these changes. Currently pandas is just barely keeping up with the (recently increased) user flow of pull-requests, not to mention the issue reports. These are all great indicators of community use and exercising the edge cases.
A roadmap is an excellent start, but the resource question needs to be front and center.
The current process *could* evolve into LTS. In 0.19.0, lots of progress towards removing older code (and of course deprecating things) is happening. An aggressive push of this into 0.20.0 will go a long ways towards de-facto establishing 1.0 / LTS. (and maybe that's what we simply call 0.20.0).
I would agree we could simply release 1.0 / LTS without adding any 'new' features (like fixed getitem indexing and such).
I would like to see 2.0 with a user facing API that is a drop-in replacement (though allowing for some breaking changes that are NOT back-compat, e.g. getitem indexing). I think it would be acceptable to break the back-end API (meaning to numpy) though.
For the resource question, as I have mentioned off-list, I will format this roadmap in order for pandas to support a fund-raising effort to garner resources for these changes.
Jeff
On Tue, Jul 26, 2016 at 5:13 PM, Stephan Hoyer <shoyer@gmail.com> wrote: > > I know I expressed concerns about cross-compatibility with the rest > of > the SciPy ecosystem before (especially xarray), but this plan sounds > very > solid to me. Flexible data types in N-dimensional arrays are > important for > other use cases, but also not really a problem for pandas. > > A separate 2.0 release will let us make the major breaking changes > to > the pandas data model necessary for it to work well in the long > term. There > are a few other API warts that will be able to clean up this way > (detailed > in github.com/pydata/pandas/issues/10000), indexing on DataFrames > being the > most obvious one. > > On Tue, Jul 26, 2016 at 1:51 PM, Wes McKinney <wesmckinn@gmail.com> > wrote: >> >> hi folks, >> >> As a continuation of ongoing discussions on GitHub and on the >> mailing >> list around deprecations and future innovation and internal >> reworkings >> of pandas, I had a couple of ideas to share that I am looking for >> feedback on. >> >> As far as pandas 0.19.x today, I would like to propose that we >> consider releasing the project as pandas 1.0 in the next major >> release >> or the one after. The Python community does have a penchant for >> "eternal betas", but after all the hard work of the core developers >> and community over the last 5 years, I think we can safely consider >> making a stable 1.X production release. >> >> If we do decide to release pandas 1.0, I also propose that we >> strongly >> consider making 1.X an LTS / Long Term Support branch where we can >> continue to make releases, but bug fixes and documentation >> improvements only. Or, we can add new features, but on an extremely >> conservative basis. This might require some changes to development >> process, so looking for feedback on this. >> >> If we commit to this path, I would suggest that we start a >> pandas-2.0 >> integration branch where we can begin more seriously planning and >> executing on >> >> - Cleanup and removal of years' worth of accumulated cruft / legacy >> code >> - Removal of deprecated features >> - Series and DataFrame internals revamp. >> >> I had hoped that 2016 would offer me more time to work on the >> internals revamp, but between my day job and the 2nd ed of "Python >> for >> Data Analysis" that turned out to be a little too ambitious. I have >> been almost continuously thinking about how to go about this >> though, >> and it might be good to figure out a process where we can start >> documenting and coming up with a more granular development roadmap >> for >> this. Part of this will be carefully documenting any APIs we change >> or >> unit tests we break along the way. >> >> We would want to give ample time for heavy pandas users to run >> their >> 3rd-party code based on pandas 2.0-dev to give feedback on whether >> our >> assumptions about the impact of changes affect real production >> code. >> As a concrete example: integer and boolean Series would be able to >> accommodate missing data without implicitly casting to float or >> object >> NumPy dtype respectively. Since many users will have inserted >> workarounds / data massaging code because of such rough edges, this >> may cause code breakage or simply redundancy in some cases. As >> another >> example: we should probably remove the .ix indexing attribute >> altogether. I'm sure many users are still using .ix, but it would >> be >> worthwhile to go through such code and decide whether it's really >> .loc >> or .iloc. >> >> My hope would be (being a deadline-motivated person) that we could >> see >> a pandas 2.0 alpha release sometime mid- or 2nd half 2017, with a >> target beta / pre-production QA release in early 2018 or >> thereabouts. >> Part of this would be creating a 1.0 to 2.0 migration guide for >> users. >> >> My biggest concern with pandas in recent years is how not to be >> held >> back by strict backwards compatibility and still be able to >> innovate >> and stay relevant into the 2020s. >> >> For pandas 2.0 some of the most important issues I've been thinking >> about are: >> >> - Logical type abstraction layer / decoupling. pandas-only data >> types >> (Categorical, DatetimeTZ, Period, etc.) will become equal citizens >> as >> compared with data types mapping 1-1 on NumPy numeric dtypes >> >> - Decoupling physical storage to permit non-NumPy data structures >> inside Series >> >> - Removal of BlockManager and 2D block consolidation in DataFrame, >> in >> favor of a native C++ internal table (vector-of-arrays) data >> structure >> >> - Consistent NA semantics across all data types >> >> - Significantly improved handling of string/UTF8 data (performance, >> memory use -- elimination of PyObject boxes). From the above 2 >> items, >> we could even make all string arrays internally categorical (with >> the >> option to explicitly cast to categorical) -- in the database world >> this is often called dictionary encoding. >> >> - Refactor of most Cython algorithms into C++11/14 templates >> >> - Copy-on-write for Series and DataFrame >> >> - Removal of Panel, ndim > 3 data structures >> >> - Analytical expression VM (for example -- things like >> df[boolean_arr].groupby(...).agg(...) could be evaluated by a small >> Numexpr-like VM, not dissimilar to R's dplyr library, with >> significantly improved memory use and maybe performance too) >> >> There's a lot to unpack here, but let me know what everyone thinks >> about these things. The "pandas 2.0" / internals revamp discussion >> we >> can tackle in a separate thread or in perhaps in a GitHub repo or >> design folder in the pandas codebase. >> >> Thanks, >> Wes >> _______________________________________________ >> Pandas-dev mailing list >> Pandas-dev@python.org >> https://mail.python.org/mailman/listinfo/pandas-dev > > > > _______________________________________________ > Pandas-dev mailing list > Pandas-dev@python.org > https://mail.python.org/mailman/listinfo/pandas-dev >
_______________________________________________ Pandas-dev mailing list Pandas-dev@python.org https://mail.python.org/mailman/listinfo/pandas-dev
_______________________________________________ Pandas-dev mailing list Pandas-dev@python.org https://mail.python.org/mailman/listinfo/pandas-dev
_______________________________________________ Pandas-dev mailing list Pandas-dev@python.org https://mail.python.org/mailman/listinfo/pandas-dev
_______________________________________________ Pandas-dev mailing list Pandas-dev@python.org https://mail.python.org/mailman/listinfo/pandas-dev
Masaaki -- on your point re: accepting new features into the 1.x branch. The main issue is how we can keep a pandas 2.0 branch (which will be unstable for the first 3-6 months of its life) relatively in sync with 1.x until the 2.0 branch stabilizes. The worst case scenario is that you have to do double the amount of work for each pull request (essentially: independent patches to 1.x and 2.x), but if it could be reduced to 1.5x as much work then perhaps that's OK. Even "forward-porting" bug fixes will be a challenge. We shouldn't allow these things to halt progress on advancing the library internals to a more sustainable / future-proof place. Our problem is not unlike the Python language moratorium instituted in 2009: https://www.python.org/dev/peps/pep-3003/. - Wes On Mon, Aug 1, 2016 at 2:01 PM, Wes McKinney <wesmckinn@gmail.com> wrote:
hey Andy -- that makes sense to me. What I'm hoping to do this month is scope out a more granular plan for the specific things (problems and their possible solutions with lists of pros/cons of various approaches) we want to accomplish in a pandas 2.x effort and make sure we all agree (up to 70-80% of the big picture items). If we're going to raise a significant amount of money we owe it to the donors to explain how the money will be directed, and we won't want to be dealing with a lot of uncertainty about the roadmap once we have engaged FTEs beginning to help with moving things forward.
- Wes
On Mon, Aug 1, 2016 at 1:54 PM, Andy Ray Terrel <andy.terrel@gmail.com> wrote:
Crazy thought.
Perhaps ya'll could put together a road map and resources you will need to get it done (as in money for FTEs). I would like to see NumFOCUS try to push our sponsors to fund more FTEs for projects like this. If we have a road map in hand it makes the conversations much more tangible.
-- Andy
On Sun, Jul 31, 2016 at 5:03 PM, Joris Van den Bossche <jorisvandenbossche@gmail.com> wrote:
Regarding 1), I agree it is a good idea to push a 0.19.0 release soonish, en we can then discuss what we further want to do (or not to do) for the 1.0 release. I am on holidays the coming week and a half, but afterwards I will also focus on getting 0.19.0 out. A release candidate in the last week of August is maybe a good deadline?
Joris
2016-07-29 0:15 GMT+02:00 Wes McKinney <wesmckinn@gmail.com>:
OK, let me try to collect some of the feedback and give my thoughts
1) 0.19 and 0.20: I think we should push to release 0.19.0 soon and then plan what we want to add/change/deprecate for 1.0 which might otherwise have been 1.0. I think delaying 0.19.0 since we already pushed back 0.18.2, and there are some significant new patches (asof_merge and variable rolling windows), it would be good to get this into production before we declare a stable 1.0.
2) We will need to raise a significant amount of money for pandas (I estimate in the ballpark of US $300-500K -- better to have too much than too little) to be able to pursue the pandas 2.0 plan wholeheartedly. I would like to dedicate a minimum 5-10 hours per week to it in 2017 but this will not be sufficient to do everything (I am also a human being, and have a day job). It would be better to collaborate with one or two good freelance developers (with proven experience in C++ and Python) who are spending at least 50% of their time on pandas next year. I am going to start spending some time on design documentation so that we can start resolving some of the design questions and tradeoffs (not all of these decisions will be easy). We'll work on this offline and look to start soliciting funding (if anyone with the ability to write checks is reading, feel free to contact me offline).
3) I agree we will need to come up with a development process that facilitates both an invasive modification of pandas internals while also supporting production users of pandas 1.X. Cherry-picking bug fixes into the pandas 2.x branch will grow increasingly complicated; we need to factor this into our process (for example: we might collect all the unit tests for bug fixes -- assuming they rely on definitely stable behavior -- into a "to fix" folder so that we can return and adapt the bug fixes once the 2.x branch is getting more stable). To have developers both maintaining 1.x and trying to drive forward the 2.x branch at the same time does not seem realistic -- we should talk to the IPython/Jupyter devs to understand how they handled this through their long-lived IPython 1.0 branch IIRC (see http://ipython.org/news.html#ipython-1-0).
4) My goal, which I think we're all aligned on, would be for pandas 2.0 to be a drop-in replacement for 90-95% of normal pandas use. Many power users will have embraced some of the idiosyncrasies of pandas's implementation details, but I think some of the changes (e.g. missing data consistency, copy-on-write / improved semantics around memory ownership and views) will be welcomed. We should clearly document (in a dedicated "pandas's internal relationship with NumPy" document) and maintain very tight contracts around what kinds of zero-copy NumPy interoperability are supported -- it is not clear to me for example that arrays of Python string/unicode objects are a NumPy use case that is especially important to preserve, but most numeric data use cases are. This will also be helpful for power users to understand the nuances and how things are going to stay the same or change (for example: boolean and integer arrays with NAs will probably not be zero-copyable to NumPy arrays).
We should maybe start side threads about each of these items. Just deciding what we want to deprecate or do in 0.20 aka 1.0 is a large enough task.
Thanks all Wes
On Wed, Jul 27, 2016 at 8:39 PM, G Young <gfyoung17@gmail.com> wrote:
1) I would be in favour of releasing 0.19.0 in part because we already pushed back and actually forgone 0.18.2. I think these plans are better served for the release after this one to give more time to map this but also to push out the changes that have already been made in preparation for this release.
2) In terms of organisation, I wonder if we might be better served reorganising the way in which PR's are reviewed during the time period between one release and the next instead of having these parallel tracks of development in light of the concern brought up by @jorisvanenbossche. Perhaps rather than just reviewing PR's as they come in, specify which types of PR's should be submitted during certain periods of time.
For example, a large chunk of the period could be devoted to accepting enhancements / new features after which the remaining time before a release could be devoted to just organisation / refactoring / deprecations / what have you (maybe include bug fixes too). That way we could have a contiguous block of time to focus on stabilising and tidying up the release. It would also allow for the refactoring to take place (perhaps incrementally) without the concern of being destabilised by a new feature.
For this to work, this would have to be clearly stated in the contributing docs as well as circulated in emails to pandas-dev AND other related groups so that way people know what's going on in terms of the development cycle.
On Wed, Jul 27, 2016 at 7:51 PM, Joris Van den Bossche <jorisvandenbossche@gmail.com> wrote:
Wes, thanks for your mail!
I like the idea of first releasing a pandas 1.0 before the 'big refactor'. We for sure know that this will take a while to stabilize (even with a lot of resources), and I think the idea was to provide a kind of LTS release. In that regard, it is just clearer to name this pandas 1.x then 0.19.x.
Maybe we can start a separate thread to discuss on this 1.0, as there are of course some questions to discuss: - do we first release 0.19 (we didn't specifically discuss this, but I think the rough idea was to have somewhere in august a release candidate), or do we directly aim at 1.0? - are there some certain changes we want to do before 1.0 that are feasible in the short term? - are there some of the current ideas of deprecations that we should exclude/include for this release? (eg I think deprecating PanelND (as just landed in master) is good, but the idea of deprecating Panel should rather wait until 2.0?) - ...
How exactly to tackle those bug fix releases / LTS branch, is also something that can be discussed, but I would not worry too much about that (there are enough examples of other projects to do something similar, we just have to search for a process that suits us).
What I think a more important issue or problem with this process is the community of contributors. If we would effectively have a period of about two years (before a final 2.0 release) where for the current (1.0) version only certain bug-fixes are considered, but on the other hand it is still difficult to contribute to the new version. We would maybe have to say no to many of the PRs or enhancement ideas. Such a situation could hinder the process of community contributions and participation. And there are currently a lot of contributions. As Jeff also said, the current active contributors are barely keeping up with managing all issues and pull requests. I have worked the last few weeks more on pandas (thanks to Continuum), and indeed I spent most of my time answering issues and reviewing PRs, and hardly have any time to do much coding myself. But of course this is also a choice that I currently make. And I (we) could also make the choice to focus more on pandas 1.0/2.0 related issues, or try to steer some of the active contributors to that.
I also have some concerns about the compatibility with the rest of the ecosystem, but at the same time it is clear I think that there should be some kind of refactor, and it is in the further elaboration of the roadmap that such concerns can be addressed.
Joris
2016-07-27 12:04 GMT+02:00 Jeff Reback <jeffreback@gmail.com>: > > I applaud the vision and ambition for the roadmap of the future of > pandas. > > However, the resources are lacking for much of these changes. > Currently > pandas is just barely keeping up with the (recently increased) user > flow > of pull-requests, not to mention the issue reports. These are all > great > indicators > of community use and exercising the edge cases. > > A roadmap is an excellent start, but the resource question needs to > be > front and center. > > The current process *could* evolve into LTS. In 0.19.0, lots of > progress > towards removing > older code (and of course deprecating things) is happening. An > aggressive > push of this into > 0.20.0 will go a long ways towards de-facto establishing 1.0 / LTS. > (and > maybe that's what we simply > call 0.20.0). > > I would agree we could simply release 1.0 / LTS without adding any > 'new' > features (like fixed getitem indexing > and such). > > I would like to see 2.0 with a user facing API that is a drop-in > replacement (though allowing for some breaking changes that are NOT > back-compat, e.g. getitem indexing). I think it would be acceptable > to break > the back-end API (meaning to numpy) though. > > For the resource question, as I have mentioned off-list, I will > format > this roadmap in order for pandas to support a fund-raising effort to > garner > resources for these changes. > > Jeff > > On Tue, Jul 26, 2016 at 5:13 PM, Stephan Hoyer <shoyer@gmail.com> > wrote: >> >> I know I expressed concerns about cross-compatibility with the rest >> of >> the SciPy ecosystem before (especially xarray), but this plan sounds >> very >> solid to me. Flexible data types in N-dimensional arrays are >> important for >> other use cases, but also not really a problem for pandas. >> >> A separate 2.0 release will let us make the major breaking changes >> to >> the pandas data model necessary for it to work well in the long >> term. There >> are a few other API warts that will be able to clean up this way >> (detailed >> in github.com/pydata/pandas/issues/10000), indexing on DataFrames >> being the >> most obvious one. >> >> On Tue, Jul 26, 2016 at 1:51 PM, Wes McKinney <wesmckinn@gmail.com> >> wrote: >>> >>> hi folks, >>> >>> As a continuation of ongoing discussions on GitHub and on the >>> mailing >>> list around deprecations and future innovation and internal >>> reworkings >>> of pandas, I had a couple of ideas to share that I am looking for >>> feedback on. >>> >>> As far as pandas 0.19.x today, I would like to propose that we >>> consider releasing the project as pandas 1.0 in the next major >>> release >>> or the one after. The Python community does have a penchant for >>> "eternal betas", but after all the hard work of the core developers >>> and community over the last 5 years, I think we can safely consider >>> making a stable 1.X production release. >>> >>> If we do decide to release pandas 1.0, I also propose that we >>> strongly >>> consider making 1.X an LTS / Long Term Support branch where we can >>> continue to make releases, but bug fixes and documentation >>> improvements only. Or, we can add new features, but on an extremely >>> conservative basis. This might require some changes to development >>> process, so looking for feedback on this. >>> >>> If we commit to this path, I would suggest that we start a >>> pandas-2.0 >>> integration branch where we can begin more seriously planning and >>> executing on >>> >>> - Cleanup and removal of years' worth of accumulated cruft / legacy >>> code >>> - Removal of deprecated features >>> - Series and DataFrame internals revamp. >>> >>> I had hoped that 2016 would offer me more time to work on the >>> internals revamp, but between my day job and the 2nd ed of "Python >>> for >>> Data Analysis" that turned out to be a little too ambitious. I have >>> been almost continuously thinking about how to go about this >>> though, >>> and it might be good to figure out a process where we can start >>> documenting and coming up with a more granular development roadmap >>> for >>> this. Part of this will be carefully documenting any APIs we change >>> or >>> unit tests we break along the way. >>> >>> We would want to give ample time for heavy pandas users to run >>> their >>> 3rd-party code based on pandas 2.0-dev to give feedback on whether >>> our >>> assumptions about the impact of changes affect real production >>> code. >>> As a concrete example: integer and boolean Series would be able to >>> accommodate missing data without implicitly casting to float or >>> object >>> NumPy dtype respectively. Since many users will have inserted >>> workarounds / data massaging code because of such rough edges, this >>> may cause code breakage or simply redundancy in some cases. As >>> another >>> example: we should probably remove the .ix indexing attribute >>> altogether. I'm sure many users are still using .ix, but it would >>> be >>> worthwhile to go through such code and decide whether it's really >>> .loc >>> or .iloc. >>> >>> My hope would be (being a deadline-motivated person) that we could >>> see >>> a pandas 2.0 alpha release sometime mid- or 2nd half 2017, with a >>> target beta / pre-production QA release in early 2018 or >>> thereabouts. >>> Part of this would be creating a 1.0 to 2.0 migration guide for >>> users. >>> >>> My biggest concern with pandas in recent years is how not to be >>> held >>> back by strict backwards compatibility and still be able to >>> innovate >>> and stay relevant into the 2020s. >>> >>> For pandas 2.0 some of the most important issues I've been thinking >>> about are: >>> >>> - Logical type abstraction layer / decoupling. pandas-only data >>> types >>> (Categorical, DatetimeTZ, Period, etc.) will become equal citizens >>> as >>> compared with data types mapping 1-1 on NumPy numeric dtypes >>> >>> - Decoupling physical storage to permit non-NumPy data structures >>> inside Series >>> >>> - Removal of BlockManager and 2D block consolidation in DataFrame, >>> in >>> favor of a native C++ internal table (vector-of-arrays) data >>> structure >>> >>> - Consistent NA semantics across all data types >>> >>> - Significantly improved handling of string/UTF8 data (performance, >>> memory use -- elimination of PyObject boxes). From the above 2 >>> items, >>> we could even make all string arrays internally categorical (with >>> the >>> option to explicitly cast to categorical) -- in the database world >>> this is often called dictionary encoding. >>> >>> - Refactor of most Cython algorithms into C++11/14 templates >>> >>> - Copy-on-write for Series and DataFrame >>> >>> - Removal of Panel, ndim > 3 data structures >>> >>> - Analytical expression VM (for example -- things like >>> df[boolean_arr].groupby(...).agg(...) could be evaluated by a small >>> Numexpr-like VM, not dissimilar to R's dplyr library, with >>> significantly improved memory use and maybe performance too) >>> >>> There's a lot to unpack here, but let me know what everyone thinks >>> about these things. The "pandas 2.0" / internals revamp discussion >>> we >>> can tackle in a separate thread or in perhaps in a GitHub repo or >>> design folder in the pandas codebase. >>> >>> Thanks, >>> Wes >>> _______________________________________________ >>> Pandas-dev mailing list >>> Pandas-dev@python.org >>> https://mail.python.org/mailman/listinfo/pandas-dev >> >> >> >> _______________________________________________ >> Pandas-dev mailing list >> Pandas-dev@python.org >> https://mail.python.org/mailman/listinfo/pandas-dev >> > > > _______________________________________________ > Pandas-dev mailing list > Pandas-dev@python.org > https://mail.python.org/mailman/listinfo/pandas-dev >
_______________________________________________ Pandas-dev mailing list Pandas-dev@python.org https://mail.python.org/mailman/listinfo/pandas-dev
_______________________________________________ Pandas-dev mailing list Pandas-dev@python.org https://mail.python.org/mailman/listinfo/pandas-dev
_______________________________________________ Pandas-dev mailing list Pandas-dev@python.org https://mail.python.org/mailman/listinfo/pandas-dev
ICYMI, we have a discussion going about some of the ideas referenced here (and in discussions earlier this year) for making changes to pandas's internals: https://github.com/pydata/pandas/pull/13944 There is also the discussion around what we may call "pandas 1.0", possibly (if we reach consensus about it) a stable maintenance release similar to the way that IPython / Jupyter approached its internal rearchitecture: https://github.com/pydata/pandas/issues/10000 Interested developers and users of pandas are highly encouraged to get involved in these discussions and contribute their perspectives, even if you don't plan to help do the actual coding work. cheers Wes On Mon, Aug 1, 2016 at 2:11 PM, Wes McKinney <wesmckinn@gmail.com> wrote:
Masaaki -- on your point re: accepting new features into the 1.x branch. The main issue is how we can keep a pandas 2.0 branch (which will be unstable for the first 3-6 months of its life) relatively in sync with 1.x until the 2.0 branch stabilizes.
The worst case scenario is that you have to do double the amount of work for each pull request (essentially: independent patches to 1.x and 2.x), but if it could be reduced to 1.5x as much work then perhaps that's OK. Even "forward-porting" bug fixes will be a challenge. We shouldn't allow these things to halt progress on advancing the library internals to a more sustainable / future-proof place.
Our problem is not unlike the Python language moratorium instituted in 2009: https://www.python.org/dev/peps/pep-3003/.
- Wes
On Mon, Aug 1, 2016 at 2:01 PM, Wes McKinney <wesmckinn@gmail.com> wrote:
hey Andy -- that makes sense to me. What I'm hoping to do this month is scope out a more granular plan for the specific things (problems and their possible solutions with lists of pros/cons of various approaches) we want to accomplish in a pandas 2.x effort and make sure we all agree (up to 70-80% of the big picture items). If we're going to raise a significant amount of money we owe it to the donors to explain how the money will be directed, and we won't want to be dealing with a lot of uncertainty about the roadmap once we have engaged FTEs beginning to help with moving things forward.
- Wes
On Mon, Aug 1, 2016 at 1:54 PM, Andy Ray Terrel <andy.terrel@gmail.com> wrote:
Crazy thought.
Perhaps ya'll could put together a road map and resources you will need to get it done (as in money for FTEs). I would like to see NumFOCUS try to push our sponsors to fund more FTEs for projects like this. If we have a road map in hand it makes the conversations much more tangible.
-- Andy
On Sun, Jul 31, 2016 at 5:03 PM, Joris Van den Bossche <jorisvandenbossche@gmail.com> wrote:
Regarding 1), I agree it is a good idea to push a 0.19.0 release soonish, en we can then discuss what we further want to do (or not to do) for the 1.0 release. I am on holidays the coming week and a half, but afterwards I will also focus on getting 0.19.0 out. A release candidate in the last week of August is maybe a good deadline?
Joris
2016-07-29 0:15 GMT+02:00 Wes McKinney <wesmckinn@gmail.com>:
OK, let me try to collect some of the feedback and give my thoughts
1) 0.19 and 0.20: I think we should push to release 0.19.0 soon and then plan what we want to add/change/deprecate for 1.0 which might otherwise have been 1.0. I think delaying 0.19.0 since we already pushed back 0.18.2, and there are some significant new patches (asof_merge and variable rolling windows), it would be good to get this into production before we declare a stable 1.0.
2) We will need to raise a significant amount of money for pandas (I estimate in the ballpark of US $300-500K -- better to have too much than too little) to be able to pursue the pandas 2.0 plan wholeheartedly. I would like to dedicate a minimum 5-10 hours per week to it in 2017 but this will not be sufficient to do everything (I am also a human being, and have a day job). It would be better to collaborate with one or two good freelance developers (with proven experience in C++ and Python) who are spending at least 50% of their time on pandas next year. I am going to start spending some time on design documentation so that we can start resolving some of the design questions and tradeoffs (not all of these decisions will be easy). We'll work on this offline and look to start soliciting funding (if anyone with the ability to write checks is reading, feel free to contact me offline).
3) I agree we will need to come up with a development process that facilitates both an invasive modification of pandas internals while also supporting production users of pandas 1.X. Cherry-picking bug fixes into the pandas 2.x branch will grow increasingly complicated; we need to factor this into our process (for example: we might collect all the unit tests for bug fixes -- assuming they rely on definitely stable behavior -- into a "to fix" folder so that we can return and adapt the bug fixes once the 2.x branch is getting more stable). To have developers both maintaining 1.x and trying to drive forward the 2.x branch at the same time does not seem realistic -- we should talk to the IPython/Jupyter devs to understand how they handled this through their long-lived IPython 1.0 branch IIRC (see http://ipython.org/news.html#ipython-1-0).
4) My goal, which I think we're all aligned on, would be for pandas 2.0 to be a drop-in replacement for 90-95% of normal pandas use. Many power users will have embraced some of the idiosyncrasies of pandas's implementation details, but I think some of the changes (e.g. missing data consistency, copy-on-write / improved semantics around memory ownership and views) will be welcomed. We should clearly document (in a dedicated "pandas's internal relationship with NumPy" document) and maintain very tight contracts around what kinds of zero-copy NumPy interoperability are supported -- it is not clear to me for example that arrays of Python string/unicode objects are a NumPy use case that is especially important to preserve, but most numeric data use cases are. This will also be helpful for power users to understand the nuances and how things are going to stay the same or change (for example: boolean and integer arrays with NAs will probably not be zero-copyable to NumPy arrays).
We should maybe start side threads about each of these items. Just deciding what we want to deprecate or do in 0.20 aka 1.0 is a large enough task.
Thanks all Wes
On Wed, Jul 27, 2016 at 8:39 PM, G Young <gfyoung17@gmail.com> wrote:
1) I would be in favour of releasing 0.19.0 in part because we already pushed back and actually forgone 0.18.2. I think these plans are better served for the release after this one to give more time to map this but also to push out the changes that have already been made in preparation for this release.
2) In terms of organisation, I wonder if we might be better served reorganising the way in which PR's are reviewed during the time period between one release and the next instead of having these parallel tracks of development in light of the concern brought up by @jorisvanenbossche. Perhaps rather than just reviewing PR's as they come in, specify which types of PR's should be submitted during certain periods of time.
For example, a large chunk of the period could be devoted to accepting enhancements / new features after which the remaining time before a release could be devoted to just organisation / refactoring / deprecations / what have you (maybe include bug fixes too). That way we could have a contiguous block of time to focus on stabilising and tidying up the release. It would also allow for the refactoring to take place (perhaps incrementally) without the concern of being destabilised by a new feature.
For this to work, this would have to be clearly stated in the contributing docs as well as circulated in emails to pandas-dev AND other related groups so that way people know what's going on in terms of the development cycle.
On Wed, Jul 27, 2016 at 7:51 PM, Joris Van den Bossche <jorisvandenbossche@gmail.com> wrote: > > Wes, thanks for your mail! > > I like the idea of first releasing a pandas 1.0 before the 'big > refactor'. > We for sure know that this will take a while to stabilize (even with a > lot > of resources), and I think the idea was to provide a kind of LTS > release. In > that regard, it is just clearer to name this pandas 1.x then 0.19.x. > > Maybe we can start a separate thread to discuss on this 1.0, as there > are > of course some questions to discuss: > - do we first release 0.19 (we didn't specifically discuss this, but I > think the rough idea was to have somewhere in august a release > candidate), > or do we directly aim at 1.0? > - are there some certain changes we want to do before 1.0 that are > feasible in the short term? > - are there some of the current ideas of deprecations that we should > exclude/include for this release? (eg I think deprecating PanelND (as > just > landed in master) is good, but the idea of deprecating Panel should > rather > wait until 2.0?) > - ... > > How exactly to tackle those bug fix releases / LTS branch, is also > something that can be discussed, but I would not worry too much about > that > (there are enough examples of other projects to do something similar, > we > just have to search for a process that suits us). > > What I think a more important issue or problem with this process is > the > community of contributors. If we would effectively have a period of > about > two years (before a final 2.0 release) where for the current (1.0) > version > only certain bug-fixes are considered, but on the other hand it is > still > difficult to contribute to the new version. We would maybe have to say > no to > many of the PRs or enhancement ideas. Such a situation could hinder > the > process of community contributions and participation. > And there are currently a lot of contributions. As Jeff also said, the > current active contributors are barely keeping up with managing all > issues > and pull requests. I have worked the last few weeks more on pandas > (thanks > to Continuum), and indeed I spent most of my time answering issues and > reviewing PRs, and hardly have any time to do much coding myself. But > of > course this is also a choice that I currently make. And I (we) could > also > make the choice to focus more on pandas 1.0/2.0 related issues, or try > to > steer some of the active contributors to that. > > I also have some concerns about the compatibility with the rest of the > ecosystem, but at the same time it is clear I think that there should > be > some kind of refactor, and it is in the further elaboration of the > roadmap > that such concerns can be addressed. > > Joris > > > > 2016-07-27 12:04 GMT+02:00 Jeff Reback <jeffreback@gmail.com>: >> >> I applaud the vision and ambition for the roadmap of the future of >> pandas. >> >> However, the resources are lacking for much of these changes. >> Currently >> pandas is just barely keeping up with the (recently increased) user >> flow >> of pull-requests, not to mention the issue reports. These are all >> great >> indicators >> of community use and exercising the edge cases. >> >> A roadmap is an excellent start, but the resource question needs to >> be >> front and center. >> >> The current process *could* evolve into LTS. In 0.19.0, lots of >> progress >> towards removing >> older code (and of course deprecating things) is happening. An >> aggressive >> push of this into >> 0.20.0 will go a long ways towards de-facto establishing 1.0 / LTS. >> (and >> maybe that's what we simply >> call 0.20.0). >> >> I would agree we could simply release 1.0 / LTS without adding any >> 'new' >> features (like fixed getitem indexing >> and such). >> >> I would like to see 2.0 with a user facing API that is a drop-in >> replacement (though allowing for some breaking changes that are NOT >> back-compat, e.g. getitem indexing). I think it would be acceptable >> to break >> the back-end API (meaning to numpy) though. >> >> For the resource question, as I have mentioned off-list, I will >> format >> this roadmap in order for pandas to support a fund-raising effort to >> garner >> resources for these changes. >> >> Jeff >> >> On Tue, Jul 26, 2016 at 5:13 PM, Stephan Hoyer <shoyer@gmail.com> >> wrote: >>> >>> I know I expressed concerns about cross-compatibility with the rest >>> of >>> the SciPy ecosystem before (especially xarray), but this plan sounds >>> very >>> solid to me. Flexible data types in N-dimensional arrays are >>> important for >>> other use cases, but also not really a problem for pandas. >>> >>> A separate 2.0 release will let us make the major breaking changes >>> to >>> the pandas data model necessary for it to work well in the long >>> term. There >>> are a few other API warts that will be able to clean up this way >>> (detailed >>> in github.com/pydata/pandas/issues/10000), indexing on DataFrames >>> being the >>> most obvious one. >>> >>> On Tue, Jul 26, 2016 at 1:51 PM, Wes McKinney <wesmckinn@gmail.com> >>> wrote: >>>> >>>> hi folks, >>>> >>>> As a continuation of ongoing discussions on GitHub and on the >>>> mailing >>>> list around deprecations and future innovation and internal >>>> reworkings >>>> of pandas, I had a couple of ideas to share that I am looking for >>>> feedback on. >>>> >>>> As far as pandas 0.19.x today, I would like to propose that we >>>> consider releasing the project as pandas 1.0 in the next major >>>> release >>>> or the one after. The Python community does have a penchant for >>>> "eternal betas", but after all the hard work of the core developers >>>> and community over the last 5 years, I think we can safely consider >>>> making a stable 1.X production release. >>>> >>>> If we do decide to release pandas 1.0, I also propose that we >>>> strongly >>>> consider making 1.X an LTS / Long Term Support branch where we can >>>> continue to make releases, but bug fixes and documentation >>>> improvements only. Or, we can add new features, but on an extremely >>>> conservative basis. This might require some changes to development >>>> process, so looking for feedback on this. >>>> >>>> If we commit to this path, I would suggest that we start a >>>> pandas-2.0 >>>> integration branch where we can begin more seriously planning and >>>> executing on >>>> >>>> - Cleanup and removal of years' worth of accumulated cruft / legacy >>>> code >>>> - Removal of deprecated features >>>> - Series and DataFrame internals revamp. >>>> >>>> I had hoped that 2016 would offer me more time to work on the >>>> internals revamp, but between my day job and the 2nd ed of "Python >>>> for >>>> Data Analysis" that turned out to be a little too ambitious. I have >>>> been almost continuously thinking about how to go about this >>>> though, >>>> and it might be good to figure out a process where we can start >>>> documenting and coming up with a more granular development roadmap >>>> for >>>> this. Part of this will be carefully documenting any APIs we change >>>> or >>>> unit tests we break along the way. >>>> >>>> We would want to give ample time for heavy pandas users to run >>>> their >>>> 3rd-party code based on pandas 2.0-dev to give feedback on whether >>>> our >>>> assumptions about the impact of changes affect real production >>>> code. >>>> As a concrete example: integer and boolean Series would be able to >>>> accommodate missing data without implicitly casting to float or >>>> object >>>> NumPy dtype respectively. Since many users will have inserted >>>> workarounds / data massaging code because of such rough edges, this >>>> may cause code breakage or simply redundancy in some cases. As >>>> another >>>> example: we should probably remove the .ix indexing attribute >>>> altogether. I'm sure many users are still using .ix, but it would >>>> be >>>> worthwhile to go through such code and decide whether it's really >>>> .loc >>>> or .iloc. >>>> >>>> My hope would be (being a deadline-motivated person) that we could >>>> see >>>> a pandas 2.0 alpha release sometime mid- or 2nd half 2017, with a >>>> target beta / pre-production QA release in early 2018 or >>>> thereabouts. >>>> Part of this would be creating a 1.0 to 2.0 migration guide for >>>> users. >>>> >>>> My biggest concern with pandas in recent years is how not to be >>>> held >>>> back by strict backwards compatibility and still be able to >>>> innovate >>>> and stay relevant into the 2020s. >>>> >>>> For pandas 2.0 some of the most important issues I've been thinking >>>> about are: >>>> >>>> - Logical type abstraction layer / decoupling. pandas-only data >>>> types >>>> (Categorical, DatetimeTZ, Period, etc.) will become equal citizens >>>> as >>>> compared with data types mapping 1-1 on NumPy numeric dtypes >>>> >>>> - Decoupling physical storage to permit non-NumPy data structures >>>> inside Series >>>> >>>> - Removal of BlockManager and 2D block consolidation in DataFrame, >>>> in >>>> favor of a native C++ internal table (vector-of-arrays) data >>>> structure >>>> >>>> - Consistent NA semantics across all data types >>>> >>>> - Significantly improved handling of string/UTF8 data (performance, >>>> memory use -- elimination of PyObject boxes). From the above 2 >>>> items, >>>> we could even make all string arrays internally categorical (with >>>> the >>>> option to explicitly cast to categorical) -- in the database world >>>> this is often called dictionary encoding. >>>> >>>> - Refactor of most Cython algorithms into C++11/14 templates >>>> >>>> - Copy-on-write for Series and DataFrame >>>> >>>> - Removal of Panel, ndim > 3 data structures >>>> >>>> - Analytical expression VM (for example -- things like >>>> df[boolean_arr].groupby(...).agg(...) could be evaluated by a small >>>> Numexpr-like VM, not dissimilar to R's dplyr library, with >>>> significantly improved memory use and maybe performance too) >>>> >>>> There's a lot to unpack here, but let me know what everyone thinks >>>> about these things. The "pandas 2.0" / internals revamp discussion >>>> we >>>> can tackle in a separate thread or in perhaps in a GitHub repo or >>>> design folder in the pandas codebase. >>>> >>>> Thanks, >>>> Wes >>>> _______________________________________________ >>>> Pandas-dev mailing list >>>> Pandas-dev@python.org >>>> https://mail.python.org/mailman/listinfo/pandas-dev >>> >>> >>> >>> _______________________________________________ >>> Pandas-dev mailing list >>> Pandas-dev@python.org >>> https://mail.python.org/mailman/listinfo/pandas-dev >>> >> >> >> _______________________________________________ >> Pandas-dev mailing list >> Pandas-dev@python.org >> https://mail.python.org/mailman/listinfo/pandas-dev >> > > > _______________________________________________ > Pandas-dev mailing list > Pandas-dev@python.org > https://mail.python.org/mailman/listinfo/pandas-dev >
_______________________________________________ Pandas-dev mailing list Pandas-dev@python.org https://mail.python.org/mailman/listinfo/pandas-dev
_______________________________________________ Pandas-dev mailing list Pandas-dev@python.org https://mail.python.org/mailman/listinfo/pandas-dev
Just created the repo https://github.com/pydata/pandas-design to house the design documents and discussion (possibly temporarily -- we may want to move the docs back to the main pandas repo after the process is near completion). I think this will help more people engage with the process (as they can watch this repo and only get notifications for the design discussion, rather than subscribing to the entire pandas issue/PR firehose). If you'd like to participate, definitely Watch the repo! thanks Wes On Thu, Aug 11, 2016 at 9:06 AM, Wes McKinney <wesmckinn@gmail.com> wrote:
ICYMI, we have a discussion going about some of the ideas referenced here (and in discussions earlier this year) for making changes to pandas's internals:
https://github.com/pydata/pandas/pull/13944
There is also the discussion around what we may call "pandas 1.0", possibly (if we reach consensus about it) a stable maintenance release similar to the way that IPython / Jupyter approached its internal rearchitecture:
https://github.com/pydata/pandas/issues/10000
Interested developers and users of pandas are highly encouraged to get involved in these discussions and contribute their perspectives, even if you don't plan to help do the actual coding work.
cheers Wes
participants (6)
-
Andy Ray Terrel -
G Young -
Jeff Reback -
Joris Van den Bossche -
Stephan Hoyer -
Wes McKinney