[Pandas-dev] Proposal: pandas 1.0 and ideas for pandas 2.0 / future

Tue Jul 26 16:51:15 EDT 2016

hi folks,

As a continuation of ongoing discussions on GitHub and on the mailing
list around deprecations and future innovation and internal reworkings
of pandas, I had a couple of ideas to share that I am looking for
feedback on.

As far as pandas 0.19.x today, I would like to propose that we
consider releasing the project as pandas 1.0 in the next major release
or the one after. The Python community does have a penchant for
"eternal betas", but after all the hard work of the core developers
and community over the last 5 years, I think we can safely consider
making a stable 1.X production release.

If we do decide to release pandas 1.0, I also propose that we strongly
consider making 1.X an LTS / Long Term Support branch where we can
continue to make releases, but bug fixes and documentation
improvements only. Or, we can add new features, but on an extremely
conservative basis. This might require some changes to development
process, so looking for feedback on this.

If we commit to this path, I would suggest that we start a pandas-2.0
integration branch where we can begin more seriously planning and
executing on

- Cleanup and removal of years' worth of accumulated cruft / legacy code
- Removal of deprecated features
- Series and DataFrame internals revamp.

I had hoped that 2016 would offer me more time to work on the
internals revamp, but between my day job and the 2nd ed of "Python for
Data Analysis" that turned out to be a little too ambitious. I have
been almost continuously thinking about how to go about this though,
and it might be good to figure out a process where we can start
documenting and coming up with a more granular development roadmap for
this. Part of this will be carefully documenting any APIs we change or
unit tests we break along the way.

We would want to give ample time for heavy pandas users to run their
3rd-party code based on pandas 2.0-dev to give feedback on whether our
assumptions about the impact of changes affect real production code.
As a concrete example: integer and boolean Series would be able to
accommodate missing data without implicitly casting to float or object
NumPy dtype respectively. Since many users will have inserted
workarounds / data massaging code because of such rough edges, this
may cause code breakage or simply redundancy in some cases. As another
example: we should probably remove the .ix indexing attribute
altogether. I'm sure many users are still using .ix, but it would be
worthwhile to go through such code and decide whether it's really .loc
or .iloc.

My hope would be (being a deadline-motivated person) that we could see
a pandas 2.0 alpha release sometime mid- or 2nd half 2017, with a
target beta / pre-production QA release in early 2018 or thereabouts.
Part of this would be creating a 1.0 to 2.0 migration guide for users.

My biggest concern with pandas in recent years is how not to be held
back by strict backwards compatibility and still be able to innovate
and stay relevant into the 2020s.

For pandas 2.0 some of the most important issues I've been thinking about are:

- Logical type abstraction layer / decoupling. pandas-only data types
(Categorical, DatetimeTZ, Period, etc.) will become equal citizens as
compared with data types mapping 1-1 on NumPy numeric dtypes

- Decoupling physical storage to permit non-NumPy data structures inside Series

- Removal of BlockManager and 2D block consolidation in DataFrame, in
favor of a native C++ internal table (vector-of-arrays) data structure

- Consistent NA semantics across all data types

- Significantly improved handling of string/UTF8 data (performance,
memory use -- elimination of PyObject boxes). From the above 2 items,
we could even make all string arrays internally categorical (with the
option to explicitly cast to categorical) -- in the database world
this is often called dictionary encoding.

- Refactor of most Cython algorithms into C++11/14 templates

- Copy-on-write for Series and DataFrame

- Removal of Panel, ndim > 3 data structures

- Analytical expression VM (for example -- things like
df[boolean_arr].groupby(...).agg(...) could be evaluated by a small
Numexpr-like VM, not dissimilar to R's dplyr library, with
significantly improved memory use and maybe performance too)

There's a lot to unpack here, but let me know what everyone thinks
about these things. The "pandas 2.0" / internals revamp discussion we
can tackle in a separate thread or in perhaps in a GitHub repo or
design folder in the pandas codebase.

Thanks,
Wes