Re: [Pandas-dev] Rewriting some of internals of pandas in C/C++? / Roadmap
Here are some of my thoughts about pandas Roadmap / status and some responses to Wes's thoughts. In the last few (and upcoming) major releases we have been made the following changes: - dtype enhancements (Categorical, Timedelta, Datetime w/tz) & making these first class objects - code refactoring to remove subclassing of ndarrays for Series & Index - carving out / deprecating non-core parts of pandas - datareader - SparsePanel, WidePanel & other aliases (TImeSeries) - rpy, rplot, irow et al. - google-analytics - API changes to make things more consistent - pd.rolling/expanding * -> .rolling/expanding (this is in master now) - .resample becoming a full defered like groupby. - multi-index slicing along any level (obviates need for .xs) and allows assignment - .loc/.iloc - for the most part obviates use of .ix - .pipe & .assign - plotting accessors - fixing of the sorting API - many performance enhancements both micro & macro (e.g. release GIL) Some on-deck enhancements are (meaning these are basically ready to go in): - IntervalIndex (and eventually make PeriodIndex just a sub-class of this) - RangeIndex so lots of changes, though nothing really earth shaking, just more convenience, reducing magicness somewhat and providing flexibility. Of course we are getting increasing issues, mostly bug reports (and lots of dupes), some edge case enhancements which can add to the existing API's and of course, requests to expand the (already) large code to other usecases. Balancing this are a good many pull-requests from many different users, some even deep into the internals. Here are some things that I have talked about and could be considered for the roadmap. Disclaimer: I do work for Continuum but these views are of course my own; furthermore obviously I am a bit more familiar with some of the 'sponsored' open-source libraries, but always open to new things. - integration / automatic deferral to numba for JIT (this would be thru .apply) - automatic deferal to dask from groubpy where appropriate / maybe a .to_parallel (to simply return a dask.DataFrame object) - incorporation of quantities / units (as part of the dtype) - use of DyND to allow missing values for int dtypes - make Period a first class dtype. - provide some copy-on-write semantics to alleviate the chained-indexing issues which occasionaly come up with the mis-use of the indexing API - allow a 'policy' to automatically provide column blocks for dict-like input (e.g. each column would be a block), this would allow a pass-thru API where you could put in numpy arrays where you have views and have them preserved rather than copied automatically. Note that this would also allow what I call 'split' where a passed in multi-dim numpy array could be split up to individual blocks (which actually gives a nice perf boost after the splitting costs). In working towards some of these goals. I have come to the opinion that it would make sense to have a neutral API protocol layer that would allow us to swap out different engines as needed, for particular dtypes, or *maybe* out-of-core type computations. E.g. imagine that we replaced the in-memory block structure with a bclolz / memap type; in theory this should be 'easy' and just work. I could also see us adopting *some* of the SFrame code to allow easier interop with this API layer. In practice, I think a nice API layer would need to be created to make this clean / nice. So this comes around to Wes's point about creating a c++ library for the internals (and possibly even some of the indexing routines). In an ideal world, or course this would be desirable. Getting there is a bit non-trivial I think, and IMHO might not be worth the effort. I don't really see big performance bottlenecks. We *already* defer much of the computation to libraries like numexpr & bottleneck (where appropriate). Adding numba / dask to the list would be helpful. I think that almost all performance issues are the result of: a) gross misuse of the pandas API. How much code have you seen that does df.apply(lambda x: x.sum()) b) routines which operate column-by-column rather block-by-block and are in python space (e.g. we have an issue right now about .quantile) So I am glossing over a big goal of having a c++ library that represents the pandas internals. This would by definition have a c-API that so you *could* use pandas like semantics in c/c++ and just have it work (and then pandas would be a thin wrapper around this library). I am not averse to this, but I think would be quite a big effort, and not a huge perf boost IMHO. Further there are a number of API issues w.r.t. indexing which need to be clarified / worked out (e.g. should we simply deprecate []) that are much easier to test / figure out in python space. I also thing that we have quite a large number of contributors. Moving to c++ might make the internals a bit more impenetrable that the current internals. (though this would allow c++ people to contribute, so that might balance out). We have a limited core of devs whom right now are familar with things. If someone happened to have a starting base for a c++ library, then I might change opinions here. my 4c. Jeff On Thu, Dec 24, 2015 at 7:18 PM, Wes McKinney <wesmckinn@gmail.com> wrote:
Deep thoughts during the holidays.
I might be out of line here, but the interpreter-heaviness of the inside of pandas objects is likely to be a long-term liability and source of performance problems and technical debt.
Has anyone put any thought into planning and beginning to execute on a rewrite that moves as much as possible of the internals into native / compiled code? I'm talking about:
- pandas/core/internals - indexing and assignment - much of pandas/core/common - categorical and custom dtypes - all indexing mechanisms
I'm concerned we've already exposed too much internals to users, so this might lead to a lot of API breakage, but it might be for the Greater Good. As a first step, beginning a partial migration of internals into some C++ classes that encapsulate the insides of DataFrame objects and implement indexing and block-level manipulations would be a good place to start. I think you could do this wouldn't too much disruption.
As part of this internal retooling we might give consideration to alternative data structures for representing data internal to pandas objects. Now in 2015/2016, continuing to be hamstrung by NumPy's limitations feels somewhat anachronistic. User code is riddled with workarounds for data type fidelity issues and the like. Like, really, why not add a bitndarray (similar to ilanschnell/bitarray) for storing nullness for problematic types and hide this from the user? =)
Since we are now a NumFOCUS-sponsored project, I feel like we might consider establishing some formal governance over pandas and publishing meetings notes and roadmap documents describing plans for the project and meetings notes from committers. There's no real "committer culture" for NumFOCUS projects like there is with the Apache Software Foundation, but we might try leading by example!
Also, I believe pandas as a project has reached a level of importance where we ought to consider planning and execution on larger scale undertakings such as this for safeguarding the future.
As for myself, well, I have my hands full in Big Data-land. I wish I could be helping more with pandas, but there a quite a few fundamental issues (like data interoperability nested data handling and file format support — e.g. Parquet, see
http://wesmckinney.com/blog/the-problem-with-the-data-science-language-wars/ ) preventing Python from being more useful in industry analytics applications.
Aside: one of the bigger mistakes I made with pandas's API design was making it acceptable to call class constructors — like pandas.DataFrame — directly (versus factory functions). Sorry about that! If we could convince everyone to start writing pandas.data_frame or dataframe instead of using the class reference it would help a lot with code cleanup. It's hard to plan for these things — NumPy interoperability seemed a lot more important in 2008 than it does now, so I forgive myself.
cheers and best wishes for 2016, Wes _______________________________________________ Pandas-dev mailing list Pandas-dev@python.org https://mail.python.org/mailman/listinfo/pandas-dev
I will write a more detailed response to some of these things after the new year, but, in particular, re: missing values, can you or someone tell me why creating an object that contains a NumPy array and a bitmap is not sufficient? If we we can add a lightweight C/C++ class layer between NumPy function calls (e.g. arithmetic) and pandas function calls, then I see no reason why we cannot have Int32Array->add and Float32Array->add do the right thing (the former would be responsible for bitmasking to propagate NA values; the latter would defer to NumPy). If we can put all the internals of pandas objects inside a black box, we can add layers of virtual function indirection without a performance penalty (e.g. adding more interpreter overhead with more abstraction layers does add up to a perf penalty). I don't think this is too scary -- I would be willing to create a small POC C++ library to prototype something like what I'm talking about. Since pandas has limited points of contact with NumPy I don't think this would end up being too onerous. For the record, I'm pretty allergic to "advanced C++"; I think it is a useful tool if you pick a sane 20% subset of the C++11 spec and follow Google C++ style it's not very inaccessible to intermediate developers. More or less "C plus OOP and easier object lifetime management (shared/unique_ptr, etc.)". As soon as you add a lot of template metaprogramming C++ library development quickly becomes inaccessible except to the C++-Jedi. Maybe let's start a Google document on "pandas roadmap" where we can break down the 1-2 year goals and some of these infrastructure issues and have our discussion there? (obviously publish this someplace once we're done) - Wes On Fri, Dec 25, 2015 at 2:14 PM, Jeff Reback <jeffreback@gmail.com> wrote:
Here are some of my thoughts about pandas Roadmap / status and some responses to Wes's thoughts.
In the last few (and upcoming) major releases we have been made the following changes:
- dtype enhancements (Categorical, Timedelta, Datetime w/tz) & making these first class objects - code refactoring to remove subclassing of ndarrays for Series & Index - carving out / deprecating non-core parts of pandas - datareader - SparsePanel, WidePanel & other aliases (TImeSeries) - rpy, rplot, irow et al. - google-analytics - API changes to make things more consistent - pd.rolling/expanding * -> .rolling/expanding (this is in master now) - .resample becoming a full defered like groupby. - multi-index slicing along any level (obviates need for .xs) and allows assignment - .loc/.iloc - for the most part obviates use of .ix - .pipe & .assign - plotting accessors - fixing of the sorting API - many performance enhancements both micro & macro (e.g. release GIL)
Some on-deck enhancements are (meaning these are basically ready to go in): - IntervalIndex (and eventually make PeriodIndex just a sub-class of this) - RangeIndex
so lots of changes, though nothing really earth shaking, just more convenience, reducing magicness somewhat and providing flexibility.
Of course we are getting increasing issues, mostly bug reports (and lots of dupes), some edge case enhancements which can add to the existing API's and of course, requests to expand the (already) large code to other usecases. Balancing this are a good many pull-requests from many different users, some even deep into the internals.
Here are some things that I have talked about and could be considered for the roadmap. Disclaimer: I do work for Continuum but these views are of course my own; furthermore obviously I am a bit more familiar with some of the 'sponsored' open-source libraries, but always open to new things.
- integration / automatic deferral to numba for JIT (this would be thru .apply) - automatic deferal to dask from groubpy where appropriate / maybe a .to_parallel (to simply return a dask.DataFrame object) - incorporation of quantities / units (as part of the dtype) - use of DyND to allow missing values for int dtypes - make Period a first class dtype. - provide some copy-on-write semantics to alleviate the chained-indexing issues which occasionaly come up with the mis-use of the indexing API - allow a 'policy' to automatically provide column blocks for dict-like input (e.g. each column would be a block), this would allow a pass-thru API where you could put in numpy arrays where you have views and have them preserved rather than copied automatically. Note that this would also allow what I call 'split' where a passed in multi-dim numpy array could be split up to individual blocks (which actually gives a nice perf boost after the splitting costs).
In working towards some of these goals. I have come to the opinion that it would make sense to have a neutral API protocol layer that would allow us to swap out different engines as needed, for particular dtypes, or *maybe* out-of-core type computations. E.g. imagine that we replaced the in-memory block structure with a bclolz / memap type; in theory this should be 'easy' and just work. I could also see us adopting *some* of the SFrame code to allow easier interop with this API layer.
In practice, I think a nice API layer would need to be created to make this clean / nice.
So this comes around to Wes's point about creating a c++ library for the internals (and possibly even some of the indexing routines). In an ideal world, or course this would be desirable. Getting there is a bit non-trivial I think, and IMHO might not be worth the effort. I don't really see big performance bottlenecks. We *already* defer much of the computation to libraries like numexpr & bottleneck (where appropriate). Adding numba / dask to the list would be helpful.
I think that almost all performance issues are the result of:
a) gross misuse of the pandas API. How much code have you seen that does df.apply(lambda x: x.sum()) b) routines which operate column-by-column rather block-by-block and are in python space (e.g. we have an issue right now about .quantile)
So I am glossing over a big goal of having a c++ library that represents the pandas internals. This would by definition have a c-API that so you *could* use pandas like semantics in c/c++ and just have it work (and then pandas would be a thin wrapper around this library).
I am not averse to this, but I think would be quite a big effort, and not a huge perf boost IMHO. Further there are a number of API issues w.r.t. indexing which need to be clarified / worked out (e.g. should we simply deprecate []) that are much easier to test / figure out in python space.
I also thing that we have quite a large number of contributors. Moving to c++ might make the internals a bit more impenetrable that the current internals. (though this would allow c++ people to contribute, so that might balance out).
We have a limited core of devs whom right now are familar with things. If someone happened to have a starting base for a c++ library, then I might change opinions here.
my 4c.
Jeff
On Thu, Dec 24, 2015 at 7:18 PM, Wes McKinney <wesmckinn@gmail.com> wrote:
Deep thoughts during the holidays.
I might be out of line here, but the interpreter-heaviness of the inside of pandas objects is likely to be a long-term liability and source of performance problems and technical debt.
Has anyone put any thought into planning and beginning to execute on a rewrite that moves as much as possible of the internals into native / compiled code? I'm talking about:
- pandas/core/internals - indexing and assignment - much of pandas/core/common - categorical and custom dtypes - all indexing mechanisms
I'm concerned we've already exposed too much internals to users, so this might lead to a lot of API breakage, but it might be for the Greater Good. As a first step, beginning a partial migration of internals into some C++ classes that encapsulate the insides of DataFrame objects and implement indexing and block-level manipulations would be a good place to start. I think you could do this wouldn't too much disruption.
As part of this internal retooling we might give consideration to alternative data structures for representing data internal to pandas objects. Now in 2015/2016, continuing to be hamstrung by NumPy's limitations feels somewhat anachronistic. User code is riddled with workarounds for data type fidelity issues and the like. Like, really, why not add a bitndarray (similar to ilanschnell/bitarray) for storing nullness for problematic types and hide this from the user? =)
Since we are now a NumFOCUS-sponsored project, I feel like we might consider establishing some formal governance over pandas and publishing meetings notes and roadmap documents describing plans for the project and meetings notes from committers. There's no real "committer culture" for NumFOCUS projects like there is with the Apache Software Foundation, but we might try leading by example!
Also, I believe pandas as a project has reached a level of importance where we ought to consider planning and execution on larger scale undertakings such as this for safeguarding the future.
As for myself, well, I have my hands full in Big Data-land. I wish I could be helping more with pandas, but there a quite a few fundamental issues (like data interoperability nested data handling and file format support — e.g. Parquet, see
http://wesmckinney.com/blog/the-problem-with-the-data-science-language-wars/) preventing Python from being more useful in industry analytics applications.
Aside: one of the bigger mistakes I made with pandas's API design was making it acceptable to call class constructors — like pandas.DataFrame — directly (versus factory functions). Sorry about that! If we could convince everyone to start writing pandas.data_frame or dataframe instead of using the class reference it would help a lot with code cleanup. It's hard to plan for these things — NumPy interoperability seemed a lot more important in 2008 than it does now, so I forgive myself.
cheers and best wishes for 2016, Wes _______________________________________________ Pandas-dev mailing list Pandas-dev@python.org https://mail.python.org/mailman/listinfo/pandas-dev
Ok certainly not averse to using bitfields. I agree that would solve the problem. In fact Stefan Hoyer and I briefly discussed this w.r.t. IntervalIndex. Turns out just as easy to use a sentinel. In fact that was my original idea (for int NA). really similar to how we handle Datetime et al. So will create a google doc for discussion points. I agree creating a minimalist c++ library is not too hard. But my original question stands, what are the use cases. I can enumerate some here: - 1) performance (I am not convinced of this, but could be wrong) - 2) c-api always a good thing & other lang bindings I suspect you are in the part 2 camp? On Tue, Dec 29, 2015 at 2:49 PM, Wes McKinney <wesmckinn@gmail.com> wrote:
I will write a more detailed response to some of these things after the new year, but, in particular, re: missing values, can you or someone tell me why creating an object that contains a NumPy array and a bitmap is not sufficient? If we we can add a lightweight C/C++ class layer between NumPy function calls (e.g. arithmetic) and pandas function calls, then I see no reason why we cannot have
Int32Array->add
and
Float32Array->add
do the right thing (the former would be responsible for bitmasking to propagate NA values; the latter would defer to NumPy). If we can put all the internals of pandas objects inside a black box, we can add layers of virtual function indirection without a performance penalty (e.g. adding more interpreter overhead with more abstraction layers does add up to a perf penalty).
I don't think this is too scary -- I would be willing to create a small POC C++ library to prototype something like what I'm talking about.
Since pandas has limited points of contact with NumPy I don't think this would end up being too onerous.
For the record, I'm pretty allergic to "advanced C++"; I think it is a useful tool if you pick a sane 20% subset of the C++11 spec and follow Google C++ style it's not very inaccessible to intermediate developers. More or less "C plus OOP and easier object lifetime management (shared/unique_ptr, etc.)". As soon as you add a lot of template metaprogramming C++ library development quickly becomes inaccessible except to the C++-Jedi.
Maybe let's start a Google document on "pandas roadmap" where we can break down the 1-2 year goals and some of these infrastructure issues and have our discussion there? (obviously publish this someplace once we're done)
- Wes
Here are some of my thoughts about pandas Roadmap / status and some responses to Wes's thoughts.
In the last few (and upcoming) major releases we have been made the following changes:
- dtype enhancements (Categorical, Timedelta, Datetime w/tz) & making
first class objects - code refactoring to remove subclassing of ndarrays for Series & Index - carving out / deprecating non-core parts of pandas - datareader - SparsePanel, WidePanel & other aliases (TImeSeries) - rpy, rplot, irow et al. - google-analytics - API changes to make things more consistent - pd.rolling/expanding * -> .rolling/expanding (this is in master now) - .resample becoming a full defered like groupby. - multi-index slicing along any level (obviates need for .xs) and allows assignment - .loc/.iloc - for the most part obviates use of .ix - .pipe & .assign - plotting accessors - fixing of the sorting API - many performance enhancements both micro & macro (e.g. release GIL)
Some on-deck enhancements are (meaning these are basically ready to go in): - IntervalIndex (and eventually make PeriodIndex just a sub-class of
- RangeIndex
so lots of changes, though nothing really earth shaking, just more convenience, reducing magicness somewhat and providing flexibility.
Of course we are getting increasing issues, mostly bug reports (and lots of dupes), some edge case enhancements which can add to the existing API's and of course, requests to expand the (already) large code to other usecases. Balancing this are a good many pull-requests from many different users, some even deep into the internals.
Here are some things that I have talked about and could be considered for the roadmap. Disclaimer: I do work for Continuum but these views are of course my own; furthermore obviously I am a bit more familiar with some of the 'sponsored' open-source libraries, but always open to new things.
- integration / automatic deferral to numba for JIT (this would be thru .apply) - automatic deferal to dask from groubpy where appropriate / maybe a .to_parallel (to simply return a dask.DataFrame object) - incorporation of quantities / units (as part of the dtype) - use of DyND to allow missing values for int dtypes - make Period a first class dtype. - provide some copy-on-write semantics to alleviate the chained-indexing issues which occasionaly come up with the mis-use of the indexing API - allow a 'policy' to automatically provide column blocks for dict-like input (e.g. each column would be a block), this would allow a pass-thru API where you could put in numpy arrays where you have views and have them preserved rather
copied automatically. Note that this would also allow what I call 'split' where a passed in multi-dim numpy array could be split up to individual blocks (which actually gives a nice perf boost after the splitting costs).
In working towards some of these goals. I have come to the opinion that it would make sense to have a neutral API protocol layer that would allow us to swap out different engines as needed, for
dtypes, or *maybe* out-of-core type computations. E.g. imagine that we replaced the in-memory block structure with a bclolz / memap type; in theory this should be 'easy' and just work. I could also see us adopting *some* of the SFrame code to allow easier interop with this API layer.
In practice, I think a nice API layer would need to be created to make
clean / nice.
So this comes around to Wes's point about creating a c++ library for the internals (and possibly even some of the indexing routines). In an ideal world, or course this would be desirable. Getting there is a bit non-trivial I think, and IMHO might not be worth the effort. I don't really see big performance bottlenecks. We *already* defer much of the computation to libraries like numexpr & bottleneck (where appropriate). Adding numba / dask to the list would be helpful.
I think that almost all performance issues are the result of:
a) gross misuse of the pandas API. How much code have you seen that does df.apply(lambda x: x.sum()) b) routines which operate column-by-column rather block-by-block and are in python space (e.g. we have an issue right now about .quantile)
So I am glossing over a big goal of having a c++ library that represents
On Fri, Dec 25, 2015 at 2:14 PM, Jeff Reback <jeffreback@gmail.com> wrote: these this) than particular this the
pandas internals. This would by definition have a c-API that so you *could* use pandas like semantics in c/c++ and just have it work (and then pandas would be a thin wrapper around this library).
I am not averse to this, but I think would be quite a big effort, and not a huge perf boost IMHO. Further there are a number of API issues w.r.t. indexing which need to be clarified / worked out (e.g. should we simply deprecate []) that are much easier to test / figure out in python space.
I also thing that we have quite a large number of contributors. Moving to c++ might make the internals a bit more impenetrable that the current internals. (though this would allow c++ people to contribute, so that might balance out).
We have a limited core of devs whom right now are familar with things. If someone happened to have a starting base for a c++ library, then I might change opinions here.
my 4c.
Jeff
On Thu, Dec 24, 2015 at 7:18 PM, Wes McKinney <wesmckinn@gmail.com> wrote:
Deep thoughts during the holidays.
I might be out of line here, but the interpreter-heaviness of the inside of pandas objects is likely to be a long-term liability and source of performance problems and technical debt.
Has anyone put any thought into planning and beginning to execute on a rewrite that moves as much as possible of the internals into native / compiled code? I'm talking about:
- pandas/core/internals - indexing and assignment - much of pandas/core/common - categorical and custom dtypes - all indexing mechanisms
I'm concerned we've already exposed too much internals to users, so this might lead to a lot of API breakage, but it might be for the Greater Good. As a first step, beginning a partial migration of internals into some C++ classes that encapsulate the insides of DataFrame objects and implement indexing and block-level manipulations would be a good place to start. I think you could do this wouldn't too much disruption.
As part of this internal retooling we might give consideration to alternative data structures for representing data internal to pandas objects. Now in 2015/2016, continuing to be hamstrung by NumPy's limitations feels somewhat anachronistic. User code is riddled with workarounds for data type fidelity issues and the like. Like, really, why not add a bitndarray (similar to ilanschnell/bitarray) for storing nullness for problematic types and hide this from the user? =)
Since we are now a NumFOCUS-sponsored project, I feel like we might consider establishing some formal governance over pandas and publishing meetings notes and roadmap documents describing plans for the project and meetings notes from committers. There's no real "committer culture" for NumFOCUS projects like there is with the Apache Software Foundation, but we might try leading by example!
Also, I believe pandas as a project has reached a level of importance where we ought to consider planning and execution on larger scale undertakings such as this for safeguarding the future.
As for myself, well, I have my hands full in Big Data-land. I wish I could be helping more with pandas, but there a quite a few fundamental issues (like data interoperability nested data handling and file format support — e.g. Parquet, see
http://wesmckinney.com/blog/the-problem-with-the-data-science-language-wars/ )
preventing Python from being more useful in industry analytics applications.
Aside: one of the bigger mistakes I made with pandas's API design was making it acceptable to call class constructors — like pandas.DataFrame — directly (versus factory functions). Sorry about that! If we could convince everyone to start writing pandas.data_frame or dataframe instead of using the class reference it would help a lot with code cleanup. It's hard to plan for these things — NumPy interoperability seemed a lot more important in 2008 than it does now, so I forgive myself.
cheers and best wishes for 2016, Wes _______________________________________________ Pandas-dev mailing list Pandas-dev@python.org https://mail.python.org/mailman/listinfo/pandas-dev
_______________________________________________ Pandas-dev mailing list Pandas-dev@python.org https://mail.python.org/mailman/listinfo/pandas-dev
Here's a link where we can discuss the roadmap: https://docs.google.com/document/d/151ct8jcZWwh7XStptjbLsda6h2b3C0IuiH_hfZnU... On Tue, Dec 29, 2015 at 2:56 PM, Jeff Reback <jeffreback@gmail.com> wrote:
Ok certainly not averse to using bitfields. I agree that would solve the problem. In fact Stefan Hoyer and I briefly discussed this w.r.t. IntervalIndex. Turns out just as easy to use a sentinel. In fact that was my original idea (for int NA). really similar to how we handle Datetime et al.
So will create a google doc for discussion points.
I agree creating a minimalist c++ library is not too hard. But my original question stands, what are the use cases. I can enumerate some here:
- 1) performance (I am not convinced of this, but could be wrong) - 2) c-api always a good thing & other lang bindings
I suspect you are in the part 2 camp?
On Tue, Dec 29, 2015 at 2:49 PM, Wes McKinney <wesmckinn@gmail.com> wrote:
I will write a more detailed response to some of these things after the new year, but, in particular, re: missing values, can you or someone tell me why creating an object that contains a NumPy array and a bitmap is not sufficient? If we we can add a lightweight C/C++ class layer between NumPy function calls (e.g. arithmetic) and pandas function calls, then I see no reason why we cannot have
Int32Array->add
and
Float32Array->add
do the right thing (the former would be responsible for bitmasking to propagate NA values; the latter would defer to NumPy). If we can put all the internals of pandas objects inside a black box, we can add layers of virtual function indirection without a performance penalty (e.g. adding more interpreter overhead with more abstraction layers does add up to a perf penalty).
I don't think this is too scary -- I would be willing to create a small POC C++ library to prototype something like what I'm talking about.
Since pandas has limited points of contact with NumPy I don't think this would end up being too onerous.
For the record, I'm pretty allergic to "advanced C++"; I think it is a useful tool if you pick a sane 20% subset of the C++11 spec and follow Google C++ style it's not very inaccessible to intermediate developers. More or less "C plus OOP and easier object lifetime management (shared/unique_ptr, etc.)". As soon as you add a lot of template metaprogramming C++ library development quickly becomes inaccessible except to the C++-Jedi.
Maybe let's start a Google document on "pandas roadmap" where we can break down the 1-2 year goals and some of these infrastructure issues and have our discussion there? (obviously publish this someplace once we're done)
- Wes
Here are some of my thoughts about pandas Roadmap / status and some responses to Wes's thoughts.
In the last few (and upcoming) major releases we have been made the following changes:
- dtype enhancements (Categorical, Timedelta, Datetime w/tz) & making
first class objects - code refactoring to remove subclassing of ndarrays for Series & Index - carving out / deprecating non-core parts of pandas - datareader - SparsePanel, WidePanel & other aliases (TImeSeries) - rpy, rplot, irow et al. - google-analytics - API changes to make things more consistent - pd.rolling/expanding * -> .rolling/expanding (this is in master now) - .resample becoming a full defered like groupby. - multi-index slicing along any level (obviates need for .xs) and allows assignment - .loc/.iloc - for the most part obviates use of .ix - .pipe & .assign - plotting accessors - fixing of the sorting API - many performance enhancements both micro & macro (e.g. release GIL)
Some on-deck enhancements are (meaning these are basically ready to go in): - IntervalIndex (and eventually make PeriodIndex just a sub-class of
- RangeIndex
so lots of changes, though nothing really earth shaking, just more convenience, reducing magicness somewhat and providing flexibility.
Of course we are getting increasing issues, mostly bug reports (and lots of dupes), some edge case enhancements which can add to the existing API's and of course, requests to expand
(already) large code to other usecases. Balancing this are a good many pull-requests from many different users, some even deep into the internals.
Here are some things that I have talked about and could be considered for the roadmap. Disclaimer: I do work for Continuum but these views are of course my own; furthermore obviously I am a bit more familiar with some of the 'sponsored' open-source libraries, but always open to new things.
- integration / automatic deferral to numba for JIT (this would be thru .apply) - automatic deferal to dask from groubpy where appropriate / maybe a .to_parallel (to simply return a dask.DataFrame object) - incorporation of quantities / units (as part of the dtype) - use of DyND to allow missing values for int dtypes - make Period a first class dtype. - provide some copy-on-write semantics to alleviate the chained-indexing issues which occasionaly come up with the mis-use of the indexing API - allow a 'policy' to automatically provide column blocks for dict-like input (e.g. each column would be a block), this would allow a pass-thru API where you could put in numpy arrays where you have views and have them preserved rather
copied automatically. Note that this would also allow what I call 'split' where a passed in multi-dim numpy array could be split up to individual blocks (which actually gives a nice perf boost after the splitting costs).
In working towards some of these goals. I have come to the opinion that it would make sense to have a neutral API protocol layer that would allow us to swap out different engines as needed, for
dtypes, or *maybe* out-of-core type computations. E.g. imagine that we replaced the in-memory block structure with a bclolz / memap type; in theory this should be 'easy' and just work. I could also see us adopting *some* of the SFrame code to allow easier interop with this API layer.
In practice, I think a nice API layer would need to be created to make
On Fri, Dec 25, 2015 at 2:14 PM, Jeff Reback <jeffreback@gmail.com> wrote: these this) the than particular this
clean / nice.
So this comes around to Wes's point about creating a c++ library for the internals (and possibly even some of the indexing routines). In an ideal world, or course this would be desirable. Getting there is a bit non-trivial I think, and IMHO might not be worth the effort. I don't really see big performance bottlenecks. We *already* defer much of the computation to libraries like numexpr & bottleneck (where appropriate). Adding numba / dask to the list would be helpful.
I think that almost all performance issues are the result of:
a) gross misuse of the pandas API. How much code have you seen that does df.apply(lambda x: x.sum()) b) routines which operate column-by-column rather block-by-block and are in python space (e.g. we have an issue right now about .quantile)
So I am glossing over a big goal of having a c++ library that represents the pandas internals. This would by definition have a c-API that so you *could* use pandas like semantics in c/c++ and just have it work (and then pandas would be a thin wrapper around this library).
I am not averse to this, but I think would be quite a big effort, and not a huge perf boost IMHO. Further there are a number of API issues w.r.t. indexing which need to be clarified / worked out (e.g. should we simply deprecate []) that are much easier to test / figure out in python space.
I also thing that we have quite a large number of contributors. Moving to c++ might make the internals a bit more impenetrable that the current internals. (though this would allow c++ people to contribute, so that might balance out).
We have a limited core of devs whom right now are familar with things. If someone happened to have a starting base for a c++ library, then I might change opinions here.
my 4c.
Jeff
On Thu, Dec 24, 2015 at 7:18 PM, Wes McKinney <wesmckinn@gmail.com> wrote:
Deep thoughts during the holidays.
I might be out of line here, but the interpreter-heaviness of the inside of pandas objects is likely to be a long-term liability and source of performance problems and technical debt.
Has anyone put any thought into planning and beginning to execute on a rewrite that moves as much as possible of the internals into native / compiled code? I'm talking about:
- pandas/core/internals - indexing and assignment - much of pandas/core/common - categorical and custom dtypes - all indexing mechanisms
I'm concerned we've already exposed too much internals to users, so this might lead to a lot of API breakage, but it might be for the Greater Good. As a first step, beginning a partial migration of internals into some C++ classes that encapsulate the insides of DataFrame objects and implement indexing and block-level manipulations would be a good place to start. I think you could do this wouldn't too much disruption.
As part of this internal retooling we might give consideration to alternative data structures for representing data internal to pandas objects. Now in 2015/2016, continuing to be hamstrung by NumPy's limitations feels somewhat anachronistic. User code is riddled with workarounds for data type fidelity issues and the like. Like, really, why not add a bitndarray (similar to ilanschnell/bitarray) for storing nullness for problematic types and hide this from the user? =)
Since we are now a NumFOCUS-sponsored project, I feel like we might consider establishing some formal governance over pandas and publishing meetings notes and roadmap documents describing plans for the project and meetings notes from committers. There's no real "committer culture" for NumFOCUS projects like there is with the Apache Software Foundation, but we might try leading by example!
Also, I believe pandas as a project has reached a level of importance where we ought to consider planning and execution on larger scale undertakings such as this for safeguarding the future.
As for myself, well, I have my hands full in Big Data-land. I wish I could be helping more with pandas, but there a quite a few fundamental issues (like data interoperability nested data handling and file format support — e.g. Parquet, see
http://wesmckinney.com/blog/the-problem-with-the-data-science-language-wars/ )
preventing Python from being more useful in industry analytics applications.
Aside: one of the bigger mistakes I made with pandas's API design was making it acceptable to call class constructors — like pandas.DataFrame — directly (versus factory functions). Sorry about that! If we could convince everyone to start writing pandas.data_frame or dataframe instead of using the class reference it would help a lot with code cleanup. It's hard to plan for these things — NumPy interoperability seemed a lot more important in 2008 than it does now, so I forgive myself.
cheers and best wishes for 2016, Wes _______________________________________________ Pandas-dev mailing list Pandas-dev@python.org https://mail.python.org/mailman/listinfo/pandas-dev
_______________________________________________ Pandas-dev mailing list Pandas-dev@python.org https://mail.python.org/mailman/listinfo/pandas-dev
Yeah, basically creating a "libpandas" with a C API for Series and DataFrame objects (and maybe a roadmap for more interchangeable internals) is definitely what I'm talking about. We can probably move a lot of the Cython guts there, too. I think better microperformance will fall out of this naturally but the big goal is a more maintainable and extensible core. I try to find some time to hack together a CMake file that creates a libpandas suitable for static linking with a Cython extension and that links dynamically with NumPy's multiarray.so and libpythonXX. The library setup is honestly the most tedious part. Aside: I'm working a lot on nested / Parquet-type data these days, and this is not a "pandas problem", but I want to make sure the tooling develops a reasonable C API so that interoperability between pandas and systems with different non-NumPy-like data models will have minimal performance overhead. - Wes On Tue, Dec 29, 2015 at 11:59 AM, Jeff Reback <jeffreback@gmail.com> wrote:
Here's a link where we can discuss the roadmap:
https://docs.google.com/document/d/151ct8jcZWwh7XStptjbLsda6h2b3C0IuiH_hfZnU...
On Tue, Dec 29, 2015 at 2:56 PM, Jeff Reback <jeffreback@gmail.com> wrote:
Ok certainly not averse to using bitfields. I agree that would solve the problem. In fact Stefan Hoyer and I briefly discussed this w.r.t. IntervalIndex. Turns out just as easy to use a sentinel. In fact that was my original idea (for int NA). really similar to how we handle Datetime et al.
So will create a google doc for discussion points.
I agree creating a minimalist c++ library is not too hard. But my original question stands, what are the use cases. I can enumerate some here:
- 1) performance (I am not convinced of this, but could be wrong) - 2) c-api always a good thing & other lang bindings
I suspect you are in the part 2 camp?
On Tue, Dec 29, 2015 at 2:49 PM, Wes McKinney <wesmckinn@gmail.com> wrote:
I will write a more detailed response to some of these things after the new year, but, in particular, re: missing values, can you or someone tell me why creating an object that contains a NumPy array and a bitmap is not sufficient? If we we can add a lightweight C/C++ class layer between NumPy function calls (e.g. arithmetic) and pandas function calls, then I see no reason why we cannot have
Int32Array->add
and
Float32Array->add
do the right thing (the former would be responsible for bitmasking to propagate NA values; the latter would defer to NumPy). If we can put all the internals of pandas objects inside a black box, we can add layers of virtual function indirection without a performance penalty (e.g. adding more interpreter overhead with more abstraction layers does add up to a perf penalty).
I don't think this is too scary -- I would be willing to create a small POC C++ library to prototype something like what I'm talking about.
Since pandas has limited points of contact with NumPy I don't think this would end up being too onerous.
For the record, I'm pretty allergic to "advanced C++"; I think it is a useful tool if you pick a sane 20% subset of the C++11 spec and follow Google C++ style it's not very inaccessible to intermediate developers. More or less "C plus OOP and easier object lifetime management (shared/unique_ptr, etc.)". As soon as you add a lot of template metaprogramming C++ library development quickly becomes inaccessible except to the C++-Jedi.
Maybe let's start a Google document on "pandas roadmap" where we can break down the 1-2 year goals and some of these infrastructure issues and have our discussion there? (obviously publish this someplace once we're done)
- Wes
On Fri, Dec 25, 2015 at 2:14 PM, Jeff Reback <jeffreback@gmail.com> wrote:
Here are some of my thoughts about pandas Roadmap / status and some responses to Wes's thoughts.
In the last few (and upcoming) major releases we have been made the following changes:
- dtype enhancements (Categorical, Timedelta, Datetime w/tz) & making these first class objects - code refactoring to remove subclassing of ndarrays for Series & Index - carving out / deprecating non-core parts of pandas - datareader - SparsePanel, WidePanel & other aliases (TImeSeries) - rpy, rplot, irow et al. - google-analytics - API changes to make things more consistent - pd.rolling/expanding * -> .rolling/expanding (this is in master now) - .resample becoming a full defered like groupby. - multi-index slicing along any level (obviates need for .xs) and allows assignment - .loc/.iloc - for the most part obviates use of .ix - .pipe & .assign - plotting accessors - fixing of the sorting API - many performance enhancements both micro & macro (e.g. release GIL)
Some on-deck enhancements are (meaning these are basically ready to go in): - IntervalIndex (and eventually make PeriodIndex just a sub-class of this) - RangeIndex
so lots of changes, though nothing really earth shaking, just more convenience, reducing magicness somewhat and providing flexibility.
Of course we are getting increasing issues, mostly bug reports (and lots of dupes), some edge case enhancements which can add to the existing API's and of course, requests to expand the (already) large code to other usecases. Balancing this are a good many pull-requests from many different users, some even deep into the internals.
Here are some things that I have talked about and could be considered for the roadmap. Disclaimer: I do work for Continuum but these views are of course my own; furthermore obviously I am a bit more familiar with some of the 'sponsored' open-source libraries, but always open to new things.
- integration / automatic deferral to numba for JIT (this would be thru .apply) - automatic deferal to dask from groubpy where appropriate / maybe a .to_parallel (to simply return a dask.DataFrame object) - incorporation of quantities / units (as part of the dtype) - use of DyND to allow missing values for int dtypes - make Period a first class dtype. - provide some copy-on-write semantics to alleviate the chained-indexing issues which occasionaly come up with the mis-use of the indexing API - allow a 'policy' to automatically provide column blocks for dict-like input (e.g. each column would be a block), this would allow a pass-thru API where you could put in numpy arrays where you have views and have them preserved rather than copied automatically. Note that this would also allow what I call 'split' where a passed in multi-dim numpy array could be split up to individual blocks (which actually gives a nice perf boost after the splitting costs).
In working towards some of these goals. I have come to the opinion that it would make sense to have a neutral API protocol layer that would allow us to swap out different engines as needed, for particular dtypes, or *maybe* out-of-core type computations. E.g. imagine that we replaced the in-memory block structure with a bclolz / memap type; in theory this should be 'easy' and just work. I could also see us adopting *some* of the SFrame code to allow easier interop with this API layer.
In practice, I think a nice API layer would need to be created to make this clean / nice.
So this comes around to Wes's point about creating a c++ library for the internals (and possibly even some of the indexing routines). In an ideal world, or course this would be desirable. Getting there is a bit non-trivial I think, and IMHO might not be worth the effort. I don't really see big performance bottlenecks. We *already* defer much of the computation to libraries like numexpr & bottleneck (where appropriate). Adding numba / dask to the list would be helpful.
I think that almost all performance issues are the result of:
a) gross misuse of the pandas API. How much code have you seen that does df.apply(lambda x: x.sum()) b) routines which operate column-by-column rather block-by-block and are in python space (e.g. we have an issue right now about .quantile)
So I am glossing over a big goal of having a c++ library that represents the pandas internals. This would by definition have a c-API that so you *could* use pandas like semantics in c/c++ and just have it work (and then pandas would be a thin wrapper around this library).
I am not averse to this, but I think would be quite a big effort, and not a huge perf boost IMHO. Further there are a number of API issues w.r.t. indexing which need to be clarified / worked out (e.g. should we simply deprecate []) that are much easier to test / figure out in python space.
I also thing that we have quite a large number of contributors. Moving to c++ might make the internals a bit more impenetrable that the current internals. (though this would allow c++ people to contribute, so that might balance out).
We have a limited core of devs whom right now are familar with things. If someone happened to have a starting base for a c++ library, then I might change opinions here.
my 4c.
Jeff
On Thu, Dec 24, 2015 at 7:18 PM, Wes McKinney <wesmckinn@gmail.com> wrote:
Deep thoughts during the holidays.
I might be out of line here, but the interpreter-heaviness of the inside of pandas objects is likely to be a long-term liability and source of performance problems and technical debt.
Has anyone put any thought into planning and beginning to execute on a rewrite that moves as much as possible of the internals into native / compiled code? I'm talking about:
- pandas/core/internals - indexing and assignment - much of pandas/core/common - categorical and custom dtypes - all indexing mechanisms
I'm concerned we've already exposed too much internals to users, so this might lead to a lot of API breakage, but it might be for the Greater Good. As a first step, beginning a partial migration of internals into some C++ classes that encapsulate the insides of DataFrame objects and implement indexing and block-level manipulations would be a good place to start. I think you could do this wouldn't too much disruption.
As part of this internal retooling we might give consideration to alternative data structures for representing data internal to pandas objects. Now in 2015/2016, continuing to be hamstrung by NumPy's limitations feels somewhat anachronistic. User code is riddled with workarounds for data type fidelity issues and the like. Like, really, why not add a bitndarray (similar to ilanschnell/bitarray) for storing nullness for problematic types and hide this from the user? =)
Since we are now a NumFOCUS-sponsored project, I feel like we might consider establishing some formal governance over pandas and publishing meetings notes and roadmap documents describing plans for the project and meetings notes from committers. There's no real "committer culture" for NumFOCUS projects like there is with the Apache Software Foundation, but we might try leading by example!
Also, I believe pandas as a project has reached a level of importance where we ought to consider planning and execution on larger scale undertakings such as this for safeguarding the future.
As for myself, well, I have my hands full in Big Data-land. I wish I could be helping more with pandas, but there a quite a few fundamental issues (like data interoperability nested data handling and file format support — e.g. Parquet, see
http://wesmckinney.com/blog/the-problem-with-the-data-science-language-wars/) preventing Python from being more useful in industry analytics applications.
Aside: one of the bigger mistakes I made with pandas's API design was making it acceptable to call class constructors — like pandas.DataFrame — directly (versus factory functions). Sorry about that! If we could convince everyone to start writing pandas.data_frame or dataframe instead of using the class reference it would help a lot with code cleanup. It's hard to plan for these things — NumPy interoperability seemed a lot more important in 2008 than it does now, so I forgive myself.
cheers and best wishes for 2016, Wes _______________________________________________ Pandas-dev mailing list Pandas-dev@python.org https://mail.python.org/mailman/listinfo/pandas-dev
_______________________________________________ Pandas-dev mailing list Pandas-dev@python.org https://mail.python.org/mailman/listinfo/pandas-dev
_______________________________________________ Pandas-dev mailing list Pandas-dev@python.org https://mail.python.org/mailman/listinfo/pandas-dev
Maybe this is saying the same thing as Wes, but how far would something like this get us? // warning: things are probably not this simple struct data_array_t { void *primitive; // scalar data data_array_t *nested; // nested data boost::dynamic_bitset isnull; // might have to create our own to avoid boost schema_t schema; // not sure exactly what this looks like }; typedef std::map<string, data_array_t> data_frame_t; // probably not this simple To answer Jeff’s use-case question: I think that the use cases are 1) freedom from numpy (mostly) 2) no more block manager which frees us from the limitations of the block memory layout. In particular, the ability to take advantage of memory mapped IO would be a big win IMO. On Tue, Dec 29, 2015 at 2:50 PM Wes McKinney <wesmckinn@gmail.com> wrote:
I will write a more detailed response to some of these things after the new year, but, in particular, re: missing values, can you or someone tell me why creating an object that contains a NumPy array and a bitmap is not sufficient? If we we can add a lightweight C/C++ class layer between NumPy function calls (e.g. arithmetic) and pandas function calls, then I see no reason why we cannot have
Int32Array->add
and
Float32Array->add
do the right thing (the former would be responsible for bitmasking to propagate NA values; the latter would defer to NumPy). If we can put all the internals of pandas objects inside a black box, we can add layers of virtual function indirection without a performance penalty (e.g. adding more interpreter overhead with more abstraction layers does add up to a perf penalty).
I don't think this is too scary -- I would be willing to create a small POC C++ library to prototype something like what I'm talking about.
Since pandas has limited points of contact with NumPy I don't think this would end up being too onerous.
For the record, I'm pretty allergic to "advanced C++"; I think it is a useful tool if you pick a sane 20% subset of the C++11 spec and follow Google C++ style it's not very inaccessible to intermediate developers. More or less "C plus OOP and easier object lifetime management (shared/unique_ptr, etc.)". As soon as you add a lot of template metaprogramming C++ library development quickly becomes inaccessible except to the C++-Jedi.
Maybe let's start a Google document on "pandas roadmap" where we can break down the 1-2 year goals and some of these infrastructure issues and have our discussion there? (obviously publish this someplace once we're done)
- Wes
Here are some of my thoughts about pandas Roadmap / status and some responses to Wes's thoughts.
In the last few (and upcoming) major releases we have been made the following changes:
- dtype enhancements (Categorical, Timedelta, Datetime w/tz) & making
first class objects - code refactoring to remove subclassing of ndarrays for Series & Index - carving out / deprecating non-core parts of pandas - datareader - SparsePanel, WidePanel & other aliases (TImeSeries) - rpy, rplot, irow et al. - google-analytics - API changes to make things more consistent - pd.rolling/expanding * -> .rolling/expanding (this is in master now) - .resample becoming a full defered like groupby. - multi-index slicing along any level (obviates need for .xs) and allows assignment - .loc/.iloc - for the most part obviates use of .ix - .pipe & .assign - plotting accessors - fixing of the sorting API - many performance enhancements both micro & macro (e.g. release GIL)
Some on-deck enhancements are (meaning these are basically ready to go in): - IntervalIndex (and eventually make PeriodIndex just a sub-class of
- RangeIndex
so lots of changes, though nothing really earth shaking, just more convenience, reducing magicness somewhat and providing flexibility.
Of course we are getting increasing issues, mostly bug reports (and lots of dupes), some edge case enhancements which can add to the existing API's and of course, requests to expand the (already) large code to other usecases. Balancing this are a good many pull-requests from many different users, some even deep into the internals.
Here are some things that I have talked about and could be considered for the roadmap. Disclaimer: I do work for Continuum but these views are of course my own; furthermore obviously I am a bit more familiar with some of the 'sponsored' open-source libraries, but always open to new things.
- integration / automatic deferral to numba for JIT (this would be thru .apply) - automatic deferal to dask from groubpy where appropriate / maybe a .to_parallel (to simply return a dask.DataFrame object) - incorporation of quantities / units (as part of the dtype) - use of DyND to allow missing values for int dtypes - make Period a first class dtype. - provide some copy-on-write semantics to alleviate the chained-indexing issues which occasionaly come up with the mis-use of the indexing API - allow a 'policy' to automatically provide column blocks for dict-like input (e.g. each column would be a block), this would allow a pass-thru API where you could put in numpy arrays where you have views and have them preserved rather
copied automatically. Note that this would also allow what I call 'split' where a passed in multi-dim numpy array could be split up to individual blocks (which actually gives a nice perf boost after the splitting costs).
In working towards some of these goals. I have come to the opinion that it would make sense to have a neutral API protocol layer that would allow us to swap out different engines as needed, for
dtypes, or *maybe* out-of-core type computations. E.g. imagine that we replaced the in-memory block structure with a bclolz / memap type; in theory this should be 'easy' and just work. I could also see us adopting *some* of the SFrame code to allow easier interop with this API layer.
In practice, I think a nice API layer would need to be created to make
clean / nice.
So this comes around to Wes's point about creating a c++ library for the internals (and possibly even some of the indexing routines). In an ideal world, or course this would be desirable. Getting there is a bit non-trivial I think, and IMHO might not be worth the effort. I don't really see big performance bottlenecks. We *already* defer much of the computation to libraries like numexpr & bottleneck (where appropriate). Adding numba / dask to the list would be helpful.
I think that almost all performance issues are the result of:
a) gross misuse of the pandas API. How much code have you seen that does df.apply(lambda x: x.sum()) b) routines which operate column-by-column rather block-by-block and are in python space (e.g. we have an issue right now about .quantile)
So I am glossing over a big goal of having a c++ library that represents
On Fri, Dec 25, 2015 at 2:14 PM, Jeff Reback <jeffreback@gmail.com> wrote: these this) than particular this the
pandas internals. This would by definition have a c-API that so you *could* use pandas like semantics in c/c++ and just have it work (and then pandas would be a thin wrapper around this library).
I am not averse to this, but I think would be quite a big effort, and not a huge perf boost IMHO. Further there are a number of API issues w.r.t. indexing which need to be clarified / worked out (e.g. should we simply deprecate []) that are much easier to test / figure out in python space.
I also thing that we have quite a large number of contributors. Moving to c++ might make the internals a bit more impenetrable that the current internals. (though this would allow c++ people to contribute, so that might balance out).
We have a limited core of devs whom right now are familar with things. If someone happened to have a starting base for a c++ library, then I might change opinions here.
my 4c.
Jeff
On Thu, Dec 24, 2015 at 7:18 PM, Wes McKinney <wesmckinn@gmail.com> wrote:
Deep thoughts during the holidays.
I might be out of line here, but the interpreter-heaviness of the inside of pandas objects is likely to be a long-term liability and source of performance problems and technical debt.
Has anyone put any thought into planning and beginning to execute on a rewrite that moves as much as possible of the internals into native / compiled code? I'm talking about:
- pandas/core/internals - indexing and assignment - much of pandas/core/common - categorical and custom dtypes - all indexing mechanisms
I'm concerned we've already exposed too much internals to users, so this might lead to a lot of API breakage, but it might be for the Greater Good. As a first step, beginning a partial migration of internals into some C++ classes that encapsulate the insides of DataFrame objects and implement indexing and block-level manipulations would be a good place to start. I think you could do this wouldn't too much disruption.
As part of this internal retooling we might give consideration to alternative data structures for representing data internal to pandas objects. Now in 2015/2016, continuing to be hamstrung by NumPy's limitations feels somewhat anachronistic. User code is riddled with workarounds for data type fidelity issues and the like. Like, really, why not add a bitndarray (similar to ilanschnell/bitarray) for storing nullness for problematic types and hide this from the user? =)
Since we are now a NumFOCUS-sponsored project, I feel like we might consider establishing some formal governance over pandas and publishing meetings notes and roadmap documents describing plans for the project and meetings notes from committers. There's no real "committer culture" for NumFOCUS projects like there is with the Apache Software Foundation, but we might try leading by example!
Also, I believe pandas as a project has reached a level of importance where we ought to consider planning and execution on larger scale undertakings such as this for safeguarding the future.
As for myself, well, I have my hands full in Big Data-land. I wish I could be helping more with pandas, but there a quite a few fundamental issues (like data interoperability nested data handling and file format support — e.g. Parquet, see
http://wesmckinney.com/blog/the-problem-with-the-data-science-language-wars/ )
preventing Python from being more useful in industry analytics applications.
Aside: one of the bigger mistakes I made with pandas's API design was making it acceptable to call class constructors — like pandas.DataFrame — directly (versus factory functions). Sorry about that! If we could convince everyone to start writing pandas.data_frame or dataframe instead of using the class reference it would help a lot with code cleanup. It's hard to plan for these things — NumPy interoperability seemed a lot more important in 2008 than it does now, so I forgive myself.
cheers and best wishes for 2016, Wes _______________________________________________ Pandas-dev mailing list Pandas-dev@python.org https://mail.python.org/mailman/listinfo/pandas-dev
_______________________________________________ Pandas-dev mailing list Pandas-dev@python.org https://mail.python.org/mailman/listinfo/pandas-dev
Basically the approach is 1) Base dtype type 2) Base array type with K >= 1 dimensions 3) Base scalar type 4) Base index type 5) "Wrapper" subclasses for all NumPy types fitting into categories #1, #2, #3, #4 6) Subclasses for pandas-specific types like category, datetimeTZ, etc. 7) NDFrame as cpcloud wrote is just a list of these Indexes and axis labels / column names can get layered on top. After we do all this we can look at adding nested types (arrays, maps, structs) to better support JSON. - Wes On Tue, Dec 29, 2015 at 12:14 PM, Phillip Cloud <cpcloud@gmail.com> wrote:
Maybe this is saying the same thing as Wes, but how far would something like this get us?
// warning: things are probably not this simple
struct data_array_t { void *primitive; // scalar data data_array_t *nested; // nested data boost::dynamic_bitset isnull; // might have to create our own to avoid boost schema_t schema; // not sure exactly what this looks like };
typedef std::map<string, data_array_t> data_frame_t; // probably not this simple
To answer Jeff’s use-case question: I think that the use cases are 1) freedom from numpy (mostly) 2) no more block manager which frees us from the limitations of the block memory layout. In particular, the ability to take advantage of memory mapped IO would be a big win IMO.
On Tue, Dec 29, 2015 at 2:50 PM Wes McKinney <wesmckinn@gmail.com> wrote:
I will write a more detailed response to some of these things after the new year, but, in particular, re: missing values, can you or someone tell me why creating an object that contains a NumPy array and a bitmap is not sufficient? If we we can add a lightweight C/C++ class layer between NumPy function calls (e.g. arithmetic) and pandas function calls, then I see no reason why we cannot have
Int32Array->add
and
Float32Array->add
do the right thing (the former would be responsible for bitmasking to propagate NA values; the latter would defer to NumPy). If we can put all the internals of pandas objects inside a black box, we can add layers of virtual function indirection without a performance penalty (e.g. adding more interpreter overhead with more abstraction layers does add up to a perf penalty).
I don't think this is too scary -- I would be willing to create a small POC C++ library to prototype something like what I'm talking about.
Since pandas has limited points of contact with NumPy I don't think this would end up being too onerous.
For the record, I'm pretty allergic to "advanced C++"; I think it is a useful tool if you pick a sane 20% subset of the C++11 spec and follow Google C++ style it's not very inaccessible to intermediate developers. More or less "C plus OOP and easier object lifetime management (shared/unique_ptr, etc.)". As soon as you add a lot of template metaprogramming C++ library development quickly becomes inaccessible except to the C++-Jedi.
Maybe let's start a Google document on "pandas roadmap" where we can break down the 1-2 year goals and some of these infrastructure issues and have our discussion there? (obviously publish this someplace once we're done)
- Wes
On Fri, Dec 25, 2015 at 2:14 PM, Jeff Reback <jeffreback@gmail.com> wrote:
Here are some of my thoughts about pandas Roadmap / status and some responses to Wes's thoughts.
In the last few (and upcoming) major releases we have been made the following changes:
- dtype enhancements (Categorical, Timedelta, Datetime w/tz) & making these first class objects - code refactoring to remove subclassing of ndarrays for Series & Index - carving out / deprecating non-core parts of pandas - datareader - SparsePanel, WidePanel & other aliases (TImeSeries) - rpy, rplot, irow et al. - google-analytics - API changes to make things more consistent - pd.rolling/expanding * -> .rolling/expanding (this is in master now) - .resample becoming a full defered like groupby. - multi-index slicing along any level (obviates need for .xs) and allows assignment - .loc/.iloc - for the most part obviates use of .ix - .pipe & .assign - plotting accessors - fixing of the sorting API - many performance enhancements both micro & macro (e.g. release GIL)
Some on-deck enhancements are (meaning these are basically ready to go in): - IntervalIndex (and eventually make PeriodIndex just a sub-class of this) - RangeIndex
so lots of changes, though nothing really earth shaking, just more convenience, reducing magicness somewhat and providing flexibility.
Of course we are getting increasing issues, mostly bug reports (and lots of dupes), some edge case enhancements which can add to the existing API's and of course, requests to expand the (already) large code to other usecases. Balancing this are a good many pull-requests from many different users, some even deep into the internals.
Here are some things that I have talked about and could be considered for the roadmap. Disclaimer: I do work for Continuum but these views are of course my own; furthermore obviously I am a bit more familiar with some of the 'sponsored' open-source libraries, but always open to new things.
- integration / automatic deferral to numba for JIT (this would be thru .apply) - automatic deferal to dask from groubpy where appropriate / maybe a .to_parallel (to simply return a dask.DataFrame object) - incorporation of quantities / units (as part of the dtype) - use of DyND to allow missing values for int dtypes - make Period a first class dtype. - provide some copy-on-write semantics to alleviate the chained-indexing issues which occasionaly come up with the mis-use of the indexing API - allow a 'policy' to automatically provide column blocks for dict-like input (e.g. each column would be a block), this would allow a pass-thru API where you could put in numpy arrays where you have views and have them preserved rather than copied automatically. Note that this would also allow what I call 'split' where a passed in multi-dim numpy array could be split up to individual blocks (which actually gives a nice perf boost after the splitting costs).
In working towards some of these goals. I have come to the opinion that it would make sense to have a neutral API protocol layer that would allow us to swap out different engines as needed, for particular dtypes, or *maybe* out-of-core type computations. E.g. imagine that we replaced the in-memory block structure with a bclolz / memap type; in theory this should be 'easy' and just work. I could also see us adopting *some* of the SFrame code to allow easier interop with this API layer.
In practice, I think a nice API layer would need to be created to make this clean / nice.
So this comes around to Wes's point about creating a c++ library for the internals (and possibly even some of the indexing routines). In an ideal world, or course this would be desirable. Getting there is a bit non-trivial I think, and IMHO might not be worth the effort. I don't really see big performance bottlenecks. We *already* defer much of the computation to libraries like numexpr & bottleneck (where appropriate). Adding numba / dask to the list would be helpful.
I think that almost all performance issues are the result of:
a) gross misuse of the pandas API. How much code have you seen that does df.apply(lambda x: x.sum()) b) routines which operate column-by-column rather block-by-block and are in python space (e.g. we have an issue right now about .quantile)
So I am glossing over a big goal of having a c++ library that represents the pandas internals. This would by definition have a c-API that so you *could* use pandas like semantics in c/c++ and just have it work (and then pandas would be a thin wrapper around this library).
I am not averse to this, but I think would be quite a big effort, and not a huge perf boost IMHO. Further there are a number of API issues w.r.t. indexing which need to be clarified / worked out (e.g. should we simply deprecate []) that are much easier to test / figure out in python space.
I also thing that we have quite a large number of contributors. Moving to c++ might make the internals a bit more impenetrable that the current internals. (though this would allow c++ people to contribute, so that might balance out).
We have a limited core of devs whom right now are familar with things. If someone happened to have a starting base for a c++ library, then I might change opinions here.
my 4c.
Jeff
On Thu, Dec 24, 2015 at 7:18 PM, Wes McKinney <wesmckinn@gmail.com> wrote:
Deep thoughts during the holidays.
I might be out of line here, but the interpreter-heaviness of the inside of pandas objects is likely to be a long-term liability and source of performance problems and technical debt.
Has anyone put any thought into planning and beginning to execute on a rewrite that moves as much as possible of the internals into native / compiled code? I'm talking about:
- pandas/core/internals - indexing and assignment - much of pandas/core/common - categorical and custom dtypes - all indexing mechanisms
I'm concerned we've already exposed too much internals to users, so this might lead to a lot of API breakage, but it might be for the Greater Good. As a first step, beginning a partial migration of internals into some C++ classes that encapsulate the insides of DataFrame objects and implement indexing and block-level manipulations would be a good place to start. I think you could do this wouldn't too much disruption.
As part of this internal retooling we might give consideration to alternative data structures for representing data internal to pandas objects. Now in 2015/2016, continuing to be hamstrung by NumPy's limitations feels somewhat anachronistic. User code is riddled with workarounds for data type fidelity issues and the like. Like, really, why not add a bitndarray (similar to ilanschnell/bitarray) for storing nullness for problematic types and hide this from the user? =)
Since we are now a NumFOCUS-sponsored project, I feel like we might consider establishing some formal governance over pandas and publishing meetings notes and roadmap documents describing plans for the project and meetings notes from committers. There's no real "committer culture" for NumFOCUS projects like there is with the Apache Software Foundation, but we might try leading by example!
Also, I believe pandas as a project has reached a level of importance where we ought to consider planning and execution on larger scale undertakings such as this for safeguarding the future.
As for myself, well, I have my hands full in Big Data-land. I wish I could be helping more with pandas, but there a quite a few fundamental issues (like data interoperability nested data handling and file format support — e.g. Parquet, see
http://wesmckinney.com/blog/the-problem-with-the-data-science-language-wars/) preventing Python from being more useful in industry analytics applications.
Aside: one of the bigger mistakes I made with pandas's API design was making it acceptable to call class constructors — like pandas.DataFrame — directly (versus factory functions). Sorry about that! If we could convince everyone to start writing pandas.data_frame or dataframe instead of using the class reference it would help a lot with code cleanup. It's hard to plan for these things — NumPy interoperability seemed a lot more important in 2008 than it does now, so I forgive myself.
cheers and best wishes for 2016, Wes _______________________________________________ Pandas-dev mailing list Pandas-dev@python.org https://mail.python.org/mailman/listinfo/pandas-dev
_______________________________________________ Pandas-dev mailing list Pandas-dev@python.org https://mail.python.org/mailman/listinfo/pandas-dev
The other huge thing this will enable is to do is copy-on-write for various kinds of views, which should cut down on some of the defensive copying in the library and reduce memory usage. On Tue, Dec 29, 2015 at 1:02 PM, Wes McKinney <wesmckinn@gmail.com> wrote:
Basically the approach is
1) Base dtype type 2) Base array type with K >= 1 dimensions 3) Base scalar type 4) Base index type 5) "Wrapper" subclasses for all NumPy types fitting into categories #1, #2, #3, #4 6) Subclasses for pandas-specific types like category, datetimeTZ, etc. 7) NDFrame as cpcloud wrote is just a list of these
Indexes and axis labels / column names can get layered on top.
After we do all this we can look at adding nested types (arrays, maps, structs) to better support JSON.
- Wes
On Tue, Dec 29, 2015 at 12:14 PM, Phillip Cloud <cpcloud@gmail.com> wrote:
Maybe this is saying the same thing as Wes, but how far would something like this get us?
// warning: things are probably not this simple
struct data_array_t { void *primitive; // scalar data data_array_t *nested; // nested data boost::dynamic_bitset isnull; // might have to create our own to avoid boost schema_t schema; // not sure exactly what this looks like };
typedef std::map<string, data_array_t> data_frame_t; // probably not this simple
To answer Jeff’s use-case question: I think that the use cases are 1) freedom from numpy (mostly) 2) no more block manager which frees us from the limitations of the block memory layout. In particular, the ability to take advantage of memory mapped IO would be a big win IMO.
On Tue, Dec 29, 2015 at 2:50 PM Wes McKinney <wesmckinn@gmail.com> wrote:
I will write a more detailed response to some of these things after the new year, but, in particular, re: missing values, can you or someone tell me why creating an object that contains a NumPy array and a bitmap is not sufficient? If we we can add a lightweight C/C++ class layer between NumPy function calls (e.g. arithmetic) and pandas function calls, then I see no reason why we cannot have
Int32Array->add
and
Float32Array->add
do the right thing (the former would be responsible for bitmasking to propagate NA values; the latter would defer to NumPy). If we can put all the internals of pandas objects inside a black box, we can add layers of virtual function indirection without a performance penalty (e.g. adding more interpreter overhead with more abstraction layers does add up to a perf penalty).
I don't think this is too scary -- I would be willing to create a small POC C++ library to prototype something like what I'm talking about.
Since pandas has limited points of contact with NumPy I don't think this would end up being too onerous.
For the record, I'm pretty allergic to "advanced C++"; I think it is a useful tool if you pick a sane 20% subset of the C++11 spec and follow Google C++ style it's not very inaccessible to intermediate developers. More or less "C plus OOP and easier object lifetime management (shared/unique_ptr, etc.)". As soon as you add a lot of template metaprogramming C++ library development quickly becomes inaccessible except to the C++-Jedi.
Maybe let's start a Google document on "pandas roadmap" where we can break down the 1-2 year goals and some of these infrastructure issues and have our discussion there? (obviously publish this someplace once we're done)
- Wes
On Fri, Dec 25, 2015 at 2:14 PM, Jeff Reback <jeffreback@gmail.com> wrote:
Here are some of my thoughts about pandas Roadmap / status and some responses to Wes's thoughts.
In the last few (and upcoming) major releases we have been made the following changes:
- dtype enhancements (Categorical, Timedelta, Datetime w/tz) & making these first class objects - code refactoring to remove subclassing of ndarrays for Series & Index - carving out / deprecating non-core parts of pandas - datareader - SparsePanel, WidePanel & other aliases (TImeSeries) - rpy, rplot, irow et al. - google-analytics - API changes to make things more consistent - pd.rolling/expanding * -> .rolling/expanding (this is in master now) - .resample becoming a full defered like groupby. - multi-index slicing along any level (obviates need for .xs) and allows assignment - .loc/.iloc - for the most part obviates use of .ix - .pipe & .assign - plotting accessors - fixing of the sorting API - many performance enhancements both micro & macro (e.g. release GIL)
Some on-deck enhancements are (meaning these are basically ready to go in): - IntervalIndex (and eventually make PeriodIndex just a sub-class of this) - RangeIndex
so lots of changes, though nothing really earth shaking, just more convenience, reducing magicness somewhat and providing flexibility.
Of course we are getting increasing issues, mostly bug reports (and lots of dupes), some edge case enhancements which can add to the existing API's and of course, requests to expand the (already) large code to other usecases. Balancing this are a good many pull-requests from many different users, some even deep into the internals.
Here are some things that I have talked about and could be considered for the roadmap. Disclaimer: I do work for Continuum but these views are of course my own; furthermore obviously I am a bit more familiar with some of the 'sponsored' open-source libraries, but always open to new things.
- integration / automatic deferral to numba for JIT (this would be thru .apply) - automatic deferal to dask from groubpy where appropriate / maybe a .to_parallel (to simply return a dask.DataFrame object) - incorporation of quantities / units (as part of the dtype) - use of DyND to allow missing values for int dtypes - make Period a first class dtype. - provide some copy-on-write semantics to alleviate the chained-indexing issues which occasionaly come up with the mis-use of the indexing API - allow a 'policy' to automatically provide column blocks for dict-like input (e.g. each column would be a block), this would allow a pass-thru API where you could put in numpy arrays where you have views and have them preserved rather than copied automatically. Note that this would also allow what I call 'split' where a passed in multi-dim numpy array could be split up to individual blocks (which actually gives a nice perf boost after the splitting costs).
In working towards some of these goals. I have come to the opinion that it would make sense to have a neutral API protocol layer that would allow us to swap out different engines as needed, for particular dtypes, or *maybe* out-of-core type computations. E.g. imagine that we replaced the in-memory block structure with a bclolz / memap type; in theory this should be 'easy' and just work. I could also see us adopting *some* of the SFrame code to allow easier interop with this API layer.
In practice, I think a nice API layer would need to be created to make this clean / nice.
So this comes around to Wes's point about creating a c++ library for the internals (and possibly even some of the indexing routines). In an ideal world, or course this would be desirable. Getting there is a bit non-trivial I think, and IMHO might not be worth the effort. I don't really see big performance bottlenecks. We *already* defer much of the computation to libraries like numexpr & bottleneck (where appropriate). Adding numba / dask to the list would be helpful.
I think that almost all performance issues are the result of:
a) gross misuse of the pandas API. How much code have you seen that does df.apply(lambda x: x.sum()) b) routines which operate column-by-column rather block-by-block and are in python space (e.g. we have an issue right now about .quantile)
So I am glossing over a big goal of having a c++ library that represents the pandas internals. This would by definition have a c-API that so you *could* use pandas like semantics in c/c++ and just have it work (and then pandas would be a thin wrapper around this library).
I am not averse to this, but I think would be quite a big effort, and not a huge perf boost IMHO. Further there are a number of API issues w.r.t. indexing which need to be clarified / worked out (e.g. should we simply deprecate []) that are much easier to test / figure out in python space.
I also thing that we have quite a large number of contributors. Moving to c++ might make the internals a bit more impenetrable that the current internals. (though this would allow c++ people to contribute, so that might balance out).
We have a limited core of devs whom right now are familar with things. If someone happened to have a starting base for a c++ library, then I might change opinions here.
my 4c.
Jeff
On Thu, Dec 24, 2015 at 7:18 PM, Wes McKinney <wesmckinn@gmail.com> wrote:
Deep thoughts during the holidays.
I might be out of line here, but the interpreter-heaviness of the inside of pandas objects is likely to be a long-term liability and source of performance problems and technical debt.
Has anyone put any thought into planning and beginning to execute on a rewrite that moves as much as possible of the internals into native / compiled code? I'm talking about:
- pandas/core/internals - indexing and assignment - much of pandas/core/common - categorical and custom dtypes - all indexing mechanisms
I'm concerned we've already exposed too much internals to users, so this might lead to a lot of API breakage, but it might be for the Greater Good. As a first step, beginning a partial migration of internals into some C++ classes that encapsulate the insides of DataFrame objects and implement indexing and block-level manipulations would be a good place to start. I think you could do this wouldn't too much disruption.
As part of this internal retooling we might give consideration to alternative data structures for representing data internal to pandas objects. Now in 2015/2016, continuing to be hamstrung by NumPy's limitations feels somewhat anachronistic. User code is riddled with workarounds for data type fidelity issues and the like. Like, really, why not add a bitndarray (similar to ilanschnell/bitarray) for storing nullness for problematic types and hide this from the user? =)
Since we are now a NumFOCUS-sponsored project, I feel like we might consider establishing some formal governance over pandas and publishing meetings notes and roadmap documents describing plans for the project and meetings notes from committers. There's no real "committer culture" for NumFOCUS projects like there is with the Apache Software Foundation, but we might try leading by example!
Also, I believe pandas as a project has reached a level of importance where we ought to consider planning and execution on larger scale undertakings such as this for safeguarding the future.
As for myself, well, I have my hands full in Big Data-land. I wish I could be helping more with pandas, but there a quite a few fundamental issues (like data interoperability nested data handling and file format support — e.g. Parquet, see
http://wesmckinney.com/blog/the-problem-with-the-data-science-language-wars/) preventing Python from being more useful in industry analytics applications.
Aside: one of the bigger mistakes I made with pandas's API design was making it acceptable to call class constructors — like pandas.DataFrame — directly (versus factory functions). Sorry about that! If we could convince everyone to start writing pandas.data_frame or dataframe instead of using the class reference it would help a lot with code cleanup. It's hard to plan for these things — NumPy interoperability seemed a lot more important in 2008 than it does now, so I forgive myself.
cheers and best wishes for 2016, Wes _______________________________________________ Pandas-dev mailing list Pandas-dev@python.org https://mail.python.org/mailman/listinfo/pandas-dev
_______________________________________________ Pandas-dev mailing list Pandas-dev@python.org https://mail.python.org/mailman/listinfo/pandas-dev
Wes your last is noted as well. I *think* we can actually do this now (well there is a PR out there). On Tue, Dec 29, 2015 at 4:12 PM, Wes McKinney <wesmckinn@gmail.com> wrote:
The other huge thing this will enable is to do is copy-on-write for various kinds of views, which should cut down on some of the defensive copying in the library and reduce memory usage.
Basically the approach is
1) Base dtype type 2) Base array type with K >= 1 dimensions 3) Base scalar type 4) Base index type 5) "Wrapper" subclasses for all NumPy types fitting into categories #1, #2, #3, #4 6) Subclasses for pandas-specific types like category, datetimeTZ, etc. 7) NDFrame as cpcloud wrote is just a list of these
Indexes and axis labels / column names can get layered on top.
After we do all this we can look at adding nested types (arrays, maps, structs) to better support JSON.
- Wes
On Tue, Dec 29, 2015 at 12:14 PM, Phillip Cloud <cpcloud@gmail.com> wrote:
Maybe this is saying the same thing as Wes, but how far would something
this get us?
// warning: things are probably not this simple
struct data_array_t { void *primitive; // scalar data data_array_t *nested; // nested data boost::dynamic_bitset isnull; // might have to create our own to avoid boost schema_t schema; // not sure exactly what this looks like };
typedef std::map<string, data_array_t> data_frame_t; // probably not
simple
To answer Jeff’s use-case question: I think that the use cases are 1) freedom from numpy (mostly) 2) no more block manager which frees us from the limitations of the block memory layout. In particular, the ability to take advantage of memory mapped IO would be a big win IMO.
On Tue, Dec 29, 2015 at 2:50 PM Wes McKinney <wesmckinn@gmail.com> wrote:
I will write a more detailed response to some of these things after the new year, but, in particular, re: missing values, can you or someone tell me why creating an object that contains a NumPy array and a bitmap is not sufficient? If we we can add a lightweight C/C++ class layer between NumPy function calls (e.g. arithmetic) and pandas function calls, then I see no reason why we cannot have
Int32Array->add
and
Float32Array->add
do the right thing (the former would be responsible for bitmasking to propagate NA values; the latter would defer to NumPy). If we can put all the internals of pandas objects inside a black box, we can add layers of virtual function indirection without a performance penalty (e.g. adding more interpreter overhead with more abstraction layers does add up to a perf penalty).
I don't think this is too scary -- I would be willing to create a small POC C++ library to prototype something like what I'm talking about.
Since pandas has limited points of contact with NumPy I don't think this would end up being too onerous.
For the record, I'm pretty allergic to "advanced C++"; I think it is a useful tool if you pick a sane 20% subset of the C++11 spec and follow Google C++ style it's not very inaccessible to intermediate developers. More or less "C plus OOP and easier object lifetime management (shared/unique_ptr, etc.)". As soon as you add a lot of template metaprogramming C++ library development quickly becomes inaccessible except to the C++-Jedi.
Maybe let's start a Google document on "pandas roadmap" where we can break down the 1-2 year goals and some of these infrastructure issues and have our discussion there? (obviously publish this someplace once we're done)
- Wes
On Fri, Dec 25, 2015 at 2:14 PM, Jeff Reback <jeffreback@gmail.com>
wrote:
Here are some of my thoughts about pandas Roadmap / status and some responses to Wes's thoughts.
In the last few (and upcoming) major releases we have been made the following changes:
- dtype enhancements (Categorical, Timedelta, Datetime w/tz) & making these first class objects - code refactoring to remove subclassing of ndarrays for Series & Index - carving out / deprecating non-core parts of pandas - datareader - SparsePanel, WidePanel & other aliases (TImeSeries) - rpy, rplot, irow et al. - google-analytics - API changes to make things more consistent - pd.rolling/expanding * -> .rolling/expanding (this is in master now) - .resample becoming a full defered like groupby. - multi-index slicing along any level (obviates need for .xs) and allows assignment - .loc/.iloc - for the most part obviates use of .ix - .pipe & .assign - plotting accessors - fixing of the sorting API - many performance enhancements both micro & macro (e.g. release GIL)
Some on-deck enhancements are (meaning these are basically ready to go in): - IntervalIndex (and eventually make PeriodIndex just a sub-class of this) - RangeIndex
so lots of changes, though nothing really earth shaking, just more convenience, reducing magicness somewhat and providing flexibility.
Of course we are getting increasing issues, mostly bug reports (and lots of dupes), some edge case enhancements which can add to the existing API's and of course, requests to expand the (already) large code to other usecases. Balancing this are a good many pull-requests from many different users, some even deep into the internals.
Here are some things that I have talked about and could be considered for the roadmap. Disclaimer: I do work for Continuum but these views are of course my own; furthermore obviously I am a bit more familiar with some of the 'sponsored' open-source libraries, but always open to new things.
- integration / automatic deferral to numba for JIT (this would be
.apply) - automatic deferal to dask from groubpy where appropriate / maybe a .to_parallel (to simply return a dask.DataFrame object) - incorporation of quantities / units (as part of the dtype) - use of DyND to allow missing values for int dtypes - make Period a first class dtype. - provide some copy-on-write semantics to alleviate the chained-indexing issues which occasionaly come up with the mis-use of the indexing API - allow a 'policy' to automatically provide column blocks for dict-like input (e.g. each column would be a block), this would allow a
API where you could put in numpy arrays where you have views and have them preserved rather than copied automatically. Note that this would also allow what I call 'split' where a passed in multi-dim numpy array could be split up to individual blocks (which actually gives a nice perf boost after the splitting costs).
In working towards some of these goals. I have come to the opinion
it would make sense to have a neutral API protocol layer that would allow us to swap out different engines as needed, for particular dtypes, or *maybe* out-of-core type computations. E.g. imagine that we replaced the in-memory block structure with a bclolz / memap type; in theory this should be 'easy' and just work. I could also see us adopting *some* of the SFrame code to allow easier interop with this API layer.
In practice, I think a nice API layer would need to be created to make this clean / nice.
So this comes around to Wes's point about creating a c++ library for
internals (and possibly even some of the indexing routines). In an ideal world, or course this would be desirable. Getting there is a bit non-trivial I think, and IMHO might not be worth the effort. I don't really see big performance bottlenecks. We *already* defer much of
computation to libraries like numexpr & bottleneck (where appropriate). Adding numba / dask to the list would be helpful.
I think that almost all performance issues are the result of:
a) gross misuse of the pandas API. How much code have you seen that does df.apply(lambda x: x.sum()) b) routines which operate column-by-column rather block-by-block and are in python space (e.g. we have an issue right now about .quantile)
So I am glossing over a big goal of having a c++ library that represents the pandas internals. This would by definition have a c-API that so you *could* use pandas like semantics in c/c++ and just have it work (and then pandas would be a thin wrapper around this library).
I am not averse to this, but I think would be quite a big effort, and not a huge perf boost IMHO. Further there are a number of API issues w.r.t. indexing which need to be clarified / worked out (e.g. should we simply deprecate []) that are much easier to test / figure out in python space.
I also thing that we have quite a large number of contributors. Moving to c++ might make the internals a bit more impenetrable that the current internals. (though this would allow c++ people to contribute, so that might balance out).
We have a limited core of devs whom right now are familar with
If someone happened to have a starting base for a c++ library, then I might change opinions here.
my 4c.
Jeff
On Thu, Dec 24, 2015 at 7:18 PM, Wes McKinney <wesmckinn@gmail.com> wrote:
Deep thoughts during the holidays.
I might be out of line here, but the interpreter-heaviness of the inside of pandas objects is likely to be a long-term liability and source of performance problems and technical debt.
Has anyone put any thought into planning and beginning to execute
on a
rewrite that moves as much as possible of the internals into native / compiled code? I'm talking about:
- pandas/core/internals - indexing and assignment - much of pandas/core/common - categorical and custom dtypes - all indexing mechanisms
I'm concerned we've already exposed too much internals to users, so this might lead to a lot of API breakage, but it might be for the Greater Good. As a first step, beginning a partial migration of internals into some C++ classes that encapsulate the insides of DataFrame objects and implement indexing and block-level manipulations would be a good place to start. I think you could do this wouldn't too much disruption.
As part of this internal retooling we might give consideration to alternative data structures for representing data internal to pandas objects. Now in 2015/2016, continuing to be hamstrung by NumPy's limitations feels somewhat anachronistic. User code is riddled with workarounds for data type fidelity issues and the like. Like, really, why not add a bitndarray (similar to ilanschnell/bitarray) for storing nullness for problematic types and hide this from the user? =)
Since we are now a NumFOCUS-sponsored project, I feel like we might consider establishing some formal governance over pandas and publishing meetings notes and roadmap documents describing plans for the project and meetings notes from committers. There's no real "committer culture" for NumFOCUS projects like there is with the Apache Software Foundation, but we might try leading by example!
Also, I believe pandas as a project has reached a level of importance where we ought to consider planning and execution on larger scale undertakings such as this for safeguarding the future.
As for myself, well, I have my hands full in Big Data-land. I wish I could be helping more with pandas, but there a quite a few fundamental issues (like data interoperability nested data handling and file format support — e.g. Parquet, see
http://wesmckinney.com/blog/the-problem-with-the-data-science-language-wars/ )
preventing Python from being more useful in industry analytics applications.
Aside: one of the bigger mistakes I made with pandas's API design was making it acceptable to call class constructors — like pandas.DataFrame — directly (versus factory functions). Sorry about that! If we could convince everyone to start writing
On Tue, Dec 29, 2015 at 1:02 PM, Wes McKinney <wesmckinn@gmail.com> wrote: like this thru pass-thru that the the things. pandas.data_frame
or dataframe instead of using the class reference it would help a lot with code cleanup. It's hard to plan for these things — NumPy interoperability seemed a lot more important in 2008 than it does now, so I forgive myself.
cheers and best wishes for 2016, Wes _______________________________________________ Pandas-dev mailing list Pandas-dev@python.org https://mail.python.org/mailman/listinfo/pandas-dev
_______________________________________________ Pandas-dev mailing list Pandas-dev@python.org https://mail.python.org/mailman/listinfo/pandas-dev
Pandas-dev mailing list Pandas-dev@python.org https://mail.python.org/mailman/listinfo/pandas-dev
Can you link to the PR you're talking about? I will see about spending a few hours setting up a libpandas.so as a C++ shared library where we can run some experiments and validate whether it can solve the integer-NA problem and be a place to put new data types (categorical and friends). I'm +1 on targeting Would it also be worth making a wish list of APIs we might consider breaking in a pandas 1.0 release that also features this new "native core"? Might as well right some wrongs while we're doing some invasive work on the internals; some breakage might be unavoidable. We can always maintain a pandas legacy 0.x.x maintenance branch (providing a conda binary build) for legacy users where showstopper bugs can get fixed. On Tue, Dec 29, 2015 at 1:20 PM, Jeff Reback <jeffreback@gmail.com <javascript:;>> wrote:
Wes your last is noted as well. I *think* we can actually do this now (well there is a PR out there).
On Tue, Dec 29, 2015 at 4:12 PM, Wes McKinney <wesmckinn@gmail.com <javascript:;>> wrote:
The other huge thing this will enable is to do is copy-on-write for various kinds of views, which should cut down on some of the defensive copying in the library and reduce memory usage.
On Tue, Dec 29, 2015 at 1:02 PM, Wes McKinney <wesmckinn@gmail.com
<javascript:;>> wrote:
Basically the approach is
1) Base dtype type 2) Base array type with K >= 1 dimensions 3) Base scalar type 4) Base index type 5) "Wrapper" subclasses for all NumPy types fitting into categories #1, #2, #3, #4 6) Subclasses for pandas-specific types like category, datetimeTZ, etc. 7) NDFrame as cpcloud wrote is just a list of these
Indexes and axis labels / column names can get layered on top.
After we do all this we can look at adding nested types (arrays, maps, structs) to better support JSON.
- Wes
On Tue, Dec 29, 2015 at 12:14 PM, Phillip Cloud <cpcloud@gmail.com <javascript:;>> wrote:
Maybe this is saying the same thing as Wes, but how far would something like this get us?
// warning: things are probably not this simple
struct data_array_t { void *primitive; // scalar data data_array_t *nested; // nested data boost::dynamic_bitset isnull; // might have to create our own to avoid boost schema_t schema; // not sure exactly what this looks like };
typedef std::map<string, data_array_t> data_frame_t; // probably not this simple
To answer Jeff’s use-case question: I think that the use cases are 1) freedom from numpy (mostly) 2) no more block manager which frees us from the limitations of the block memory layout. In particular, the ability to take advantage of memory mapped IO would be a big win IMO.
On Tue, Dec 29, 2015 at 2:50 PM Wes McKinney <wesmckinn@gmail.com <javascript:;>> wrote:
I will write a more detailed response to some of these things after the new year, but, in particular, re: missing values, can you or someone tell me why creating an object that contains a NumPy array
and
a bitmap is not sufficient? If we we can add a lightweight C/C++ class layer between NumPy function calls (e.g. arithmetic) and pandas function calls, then I see no reason why we cannot have
Int32Array->add
and
Float32Array->add
do the right thing (the former would be responsible for bitmasking to propagate NA values; the latter would defer to NumPy). If we can put all the internals of pandas objects inside a black box, we can add layers of virtual function indirection without a performance penalty (e.g. adding more interpreter overhead with more abstraction layers does add up to a perf penalty).
I don't think this is too scary -- I would be willing to create a small POC C++ library to prototype something like what I'm talking about.
Since pandas has limited points of contact with NumPy I don't think this would end up being too onerous.
For the record, I'm pretty allergic to "advanced C++"; I think it is a useful tool if you pick a sane 20% subset of the C++11 spec and follow Google C++ style it's not very inaccessible to intermediate developers. More or less "C plus OOP and easier object lifetime management (shared/unique_ptr, etc.)". As soon as you add a lot of template metaprogramming C++ library development quickly becomes inaccessible except to the C++-Jedi.
Maybe let's start a Google document on "pandas roadmap" where we can break down the 1-2 year goals and some of these infrastructure issues and have our discussion there? (obviously publish this someplace once we're done)
- Wes
On Fri, Dec 25, 2015 at 2:14 PM, Jeff Reback <jeffreback@gmail.com <javascript:;>> wrote:
Here are some of my thoughts about pandas Roadmap / status and some responses to Wes's thoughts.
In the last few (and upcoming) major releases we have been made the following changes:
- dtype enhancements (Categorical, Timedelta, Datetime w/tz) & making these first class objects - code refactoring to remove subclassing of ndarrays for Series & Index - carving out / deprecating non-core parts of pandas - datareader - SparsePanel, WidePanel & other aliases (TImeSeries) - rpy, rplot, irow et al. - google-analytics - API changes to make things more consistent - pd.rolling/expanding * -> .rolling/expanding (this is in master now) - .resample becoming a full defered like groupby. - multi-index slicing along any level (obviates need for .xs) and allows assignment - .loc/.iloc - for the most part obviates use of .ix - .pipe & .assign - plotting accessors - fixing of the sorting API - many performance enhancements both micro & macro (e.g. release GIL)
Some on-deck enhancements are (meaning these are basically ready to go in): - IntervalIndex (and eventually make PeriodIndex just a sub-class of this) - RangeIndex
so lots of changes, though nothing really earth shaking, just more convenience, reducing magicness somewhat and providing flexibility.
Of course we are getting increasing issues, mostly bug reports (and lots of dupes), some edge case enhancements which can add to the existing API's and of course, requests to expand the (already) large code to other usecases. Balancing this are a good many pull-requests from many different users, some even deep into the internals.
Here are some things that I have talked about and could be considered for the roadmap. Disclaimer: I do work for Continuum but these views are of course my own; furthermore obviously I am a bit more familiar with some of the 'sponsored' open-source libraries, but always open to new things.
- integration / automatic deferral to numba for JIT (this would be thru .apply) - automatic deferal to dask from groubpy where appropriate / maybe a .to_parallel (to simply return a dask.DataFrame object) - incorporation of quantities / units (as part of the dtype) - use of DyND to allow missing values for int dtypes - make Period a first class dtype. - provide some copy-on-write semantics to alleviate the chained-indexing issues which occasionaly come up with the mis-use of the indexing API - allow a 'policy' to automatically provide column blocks for dict-like input (e.g. each column would be a block), this would allow a pass-thru API where you could put in numpy arrays where you have views and have them preserved rather than copied automatically. Note that this would also allow what I call 'split' where a passed in multi-dim numpy array could be split up to individual blocks (which actually gives a nice perf boost after the splitting costs).
In working towards some of these goals. I have come to the opinion that it would make sense to have a neutral API protocol layer that would allow us to swap out different engines as needed, for particular dtypes, or *maybe* out-of-core type computations. E.g. imagine that we replaced the in-memory block structure with a bclolz / memap type; in theory this should be 'easy' and just work. I could also see us adopting *some* of the SFrame code to allow easier interop with this API layer.
In practice, I think a nice API layer would need to be created to make this clean / nice.
So this comes around to Wes's point about creating a c++ library for the internals (and possibly even some of the indexing routines). In an ideal world, or course this would be desirable. Getting there is a bit non-trivial I think, and IMHO might not be worth the effort. I don't really see big performance bottlenecks. We *already* defer much of the computation to libraries like numexpr & bottleneck (where appropriate). Adding numba / dask to the list would be helpful.
I think that almost all performance issues are the result of:
a) gross misuse of the pandas API. How much code have you seen that does df.apply(lambda x: x.sum()) b) routines which operate column-by-column rather block-by-block and are in python space (e.g. we have an issue right now about .quantile)
So I am glossing over a big goal of having a c++ library that represents the pandas internals. This would by definition have a c-API that so you *could* use pandas like semantics in c/c++ and just have it work (and then pandas would be a thin wrapper around this library).
I am not averse to this, but I think would be quite a big effort, and not a huge perf boost IMHO. Further there are a number of API issues w.r.t. indexing which need to be clarified / worked out (e.g. should we simply deprecate []) that are much easier to test / figure out in python space.
I also thing that we have quite a large number of contributors. Moving to c++ might make the internals a bit more impenetrable that the current internals. (though this would allow c++ people to contribute, so that might balance out).
We have a limited core of devs whom right now are familar with things. If someone happened to have a starting base for a c++ library, then I might change opinions here.
my 4c.
Jeff
On Thu, Dec 24, 2015 at 7:18 PM, Wes McKinney <wesmckinn@gmail.com <javascript:;>> wrote: > > Deep thoughts during the holidays. > > I might be out of line here, but the interpreter-heaviness of the > inside of pandas objects is likely to be a long-term liability and > source of performance problems and technical debt. > > Has anyone put any thought into planning and beginning to execute > on a > rewrite that moves as much as possible of the internals into native > / > compiled code? I'm talking about: > > - pandas/core/internals > - indexing and assignment > - much of pandas/core/common > - categorical and custom dtypes > - all indexing mechanisms > > I'm concerned we've already exposed too much internals to users, so > this might lead to a lot of API breakage, but it might be for the > Greater Good. As a first step, beginning a partial migration of > internals into some C++ classes that encapsulate the insides of > DataFrame objects and implement indexing and block-level > manipulations > would be a good place to start. I think you could do this wouldn't > too > much disruption. > > As part of this internal retooling we might give consideration to > alternative data structures for representing data internal to > pandas > objects. Now in 2015/2016, continuing to be hamstrung by NumPy's > limitations feels somewhat anachronistic. User code is riddled with > workarounds for data type fidelity issues and the like. Like, > really, > why not add a bitndarray (similar to ilanschnell/bitarray) for > storing > nullness for problematic types and hide this from the user? =) > > Since we are now a NumFOCUS-sponsored project, I feel like we might > consider establishing some formal governance over pandas and > publishing meetings notes and roadmap documents describing plans > for > the project and meetings notes from committers. There's no real > "committer culture" for NumFOCUS projects like there is with the > Apache Software Foundation, but we might try leading by example! > > Also, I believe pandas as a project has reached a level of > importance > where we ought to consider planning and execution on larger scale > undertakings such as this for safeguarding the future. > > As for myself, well, I have my hands full in Big Data-land. I wish > I > could be helping more with pandas, but there a quite a few > fundamental > issues (like data interoperability nested data handling and file > format support — e.g. Parquet, see > > > > http://wesmckinney.com/blog/the-problem-with-the-data-science-language-wars/ ) > preventing Python from being more useful in industry analytics > applications. > > Aside: one of the bigger mistakes I made with pandas's API design > was > making it acceptable to call class constructors — like > pandas.DataFrame — directly (versus factory functions). Sorry about > that! If we could convince everyone to start writing > pandas.data_frame > or dataframe instead of using the class reference it would help a > lot > with code cleanup. It's hard to plan for these things — NumPy > interoperability seemed a lot more important in 2008 than it does > now, > so I forgive myself. > > cheers and best wishes for 2016, > Wes > _______________________________________________ > Pandas-dev mailing list > Pandas-dev@python.org <javascript:;> > https://mail.python.org/mailman/listinfo/pandas-dev
_______________________________________________ Pandas-dev mailing list Pandas-dev@python.org <javascript:;> https://mail.python.org/mailman/listinfo/pandas-dev
Pandas-dev mailing list Pandas-dev@python.org <javascript:;> https://mail.python.org/mailman/listinfo/pandas-dev
https://github.com/pydata/pandas/pull/11500. I annotated in the shared google doc as well. There is a section on some pandas 1.0 things to do. On Tue, Dec 29, 2015 at 6:18 PM, Wes McKinney <wesmckinn@gmail.com> wrote:
Can you link to the PR you're talking about?
I will see about spending a few hours setting up a libpandas.so as a C++ shared library where we can run some experiments and validate whether it can solve the integer-NA problem and be a place to put new data types (categorical and friends). I'm +1 on targeting
Would it also be worth making a wish list of APIs we might consider breaking in a pandas 1.0 release that also features this new "native core"? Might as well right some wrongs while we're doing some invasive work on the internals; some breakage might be unavoidable. We can always maintain a pandas legacy 0.x.x maintenance branch (providing a conda binary build) for legacy users where showstopper bugs can get fixed.
Wes your last is noted as well. I *think* we can actually do this now (well there is a PR out there).
On Tue, Dec 29, 2015 at 4:12 PM, Wes McKinney <wesmckinn@gmail.com> wrote:
The other huge thing this will enable is to do is copy-on-write for various kinds of views, which should cut down on some of the defensive copying in the library and reduce memory usage.
On Tue, Dec 29, 2015 at 1:02 PM, Wes McKinney <wesmckinn@gmail.com>
wrote:
Basically the approach is
1) Base dtype type 2) Base array type with K >= 1 dimensions 3) Base scalar type 4) Base index type 5) "Wrapper" subclasses for all NumPy types fitting into categories #1, #2, #3, #4 6) Subclasses for pandas-specific types like category, datetimeTZ, etc. 7) NDFrame as cpcloud wrote is just a list of these
Indexes and axis labels / column names can get layered on top.
After we do all this we can look at adding nested types (arrays, maps, structs) to better support JSON.
- Wes
On Tue, Dec 29, 2015 at 12:14 PM, Phillip Cloud <cpcloud@gmail.com> wrote:
Maybe this is saying the same thing as Wes, but how far would something like this get us?
// warning: things are probably not this simple
struct data_array_t { void *primitive; // scalar data data_array_t *nested; // nested data boost::dynamic_bitset isnull; // might have to create our own to avoid boost schema_t schema; // not sure exactly what this looks like };
typedef std::map<string, data_array_t> data_frame_t; // probably not this simple
To answer Jeff’s use-case question: I think that the use cases are 1) freedom from numpy (mostly) 2) no more block manager which frees us from the limitations of the block memory layout. In particular, the ability to take advantage of memory mapped IO would be a big win IMO.
On Tue, Dec 29, 2015 at 2:50 PM Wes McKinney <wesmckinn@gmail.com> wrote:
I will write a more detailed response to some of these things after the new year, but, in particular, re: missing values, can you or someone tell me why creating an object that contains a NumPy array
and
a bitmap is not sufficient? If we we can add a lightweight C/C++ class layer between NumPy function calls (e.g. arithmetic) and pandas function calls, then I see no reason why we cannot have
Int32Array->add
and
Float32Array->add
do the right thing (the former would be responsible for bitmasking to propagate NA values; the latter would defer to NumPy). If we can put all the internals of pandas objects inside a black box, we can add layers of virtual function indirection without a performance penalty (e.g. adding more interpreter overhead with more abstraction layers does add up to a perf penalty).
I don't think this is too scary -- I would be willing to create a small POC C++ library to prototype something like what I'm talking about.
Since pandas has limited points of contact with NumPy I don't think this would end up being too onerous.
For the record, I'm pretty allergic to "advanced C++"; I think it is a useful tool if you pick a sane 20% subset of the C++11 spec and follow Google C++ style it's not very inaccessible to intermediate developers. More or less "C plus OOP and easier object lifetime management (shared/unique_ptr, etc.)". As soon as you add a lot of template metaprogramming C++ library development quickly becomes inaccessible except to the C++-Jedi.
Maybe let's start a Google document on "pandas roadmap" where we can break down the 1-2 year goals and some of these infrastructure issues and have our discussion there? (obviously publish this someplace once we're done)
- Wes
On Fri, Dec 25, 2015 at 2:14 PM, Jeff Reback <jeffreback@gmail.com> wrote: > Here are some of my thoughts about pandas Roadmap / status and some > responses to Wes's thoughts. > > In the last few (and upcoming) major releases we have been made
> following changes: > > - dtype enhancements (Categorical, Timedelta, Datetime w/tz) & > making > these > first class objects > - code refactoring to remove subclassing of ndarrays for Series & > Index > - carving out / deprecating non-core parts of pandas > - datareader > - SparsePanel, WidePanel & other aliases (TImeSeries) > - rpy, rplot, irow et al. > - google-analytics > - API changes to make things more consistent > - pd.rolling/expanding * -> .rolling/expanding (this is in master > now) > - .resample becoming a full defered like groupby. > - multi-index slicing along any level (obviates need for .xs) and > allows > assignment > - .loc/.iloc - for the most part obviates use of .ix > - .pipe & .assign > - plotting accessors > - fixing of the sorting API > - many performance enhancements both micro & macro (e.g. release > GIL) > > Some on-deck enhancements are (meaning these are basically ready to > go > in): > - IntervalIndex (and eventually make PeriodIndex just a sub-class > of > this) > - RangeIndex > > so lots of changes, though nothing really earth shaking, just more > convenience, reducing magicness somewhat > and providing flexibility. > > Of course we are getting increasing issues, mostly bug reports (and > lots > of > dupes), some edge case enhancements > which can add to the existing API's and of course, requests to > expand > the > (already) large code to other usecases. > Balancing this are a good many pull-requests from many different > users, > some > even deep into the internals. > > Here are some things that I have talked about and could be > considered > for > the roadmap. Disclaimer: I do work for Continuum > but these views are of course my own; furthermore obviously I am a > bit > more > familiar with some of the 'sponsored' open-source > libraries, but always open to new things. > > - integration / automatic deferral to numba for JIT (this would be > thru > .apply) > - automatic deferal to dask from groubpy where appropriate / maybe a > .to_parallel (to simply return a dask.DataFrame object) > - incorporation of quantities / units (as part of the dtype) > - use of DyND to allow missing values for int dtypes > - make Period a first class dtype. > - provide some copy-on-write semantics to alleviate the > chained-indexing > issues which occasionaly come up with the mis-use of the indexing > API > - allow a 'policy' to automatically provide column blocks for > dict-like > input (e.g. each column would be a block), this would allow a > pass-thru > API > where you could > put in numpy arrays where you have views and have them preserved > rather > than > copied automatically. Note that this would also allow what I call > 'split' > where a passed in > multi-dim numpy array could be split up to individual blocks (which > actually > gives a nice perf boost after the splitting costs). > > In working towards some of these goals. I have come to the opinion > that > it > would make sense to have a neutral API protocol layer > that would allow us to swap out different engines as needed, for > particular > dtypes, or *maybe* out-of-core type computations. E.g. > imagine that we replaced the in-memory block structure with a bclolz > / > memap > type; in theory this should be 'easy' and just work. > I could also see us adopting *some* of the SFrame code to allow > easier > interop with this API layer. > > In practice, I think a nice API layer would need to be created to > make > this > clean / nice. > > So this comes around to Wes's point about creating a c++ library for > the > internals (and possibly even some of the indexing routines). > In an ideal world, or course this would be desirable. Getting
> is a > bit > non-trivial I think, and IMHO might not be worth the effort. I don't > really see big performance bottlenecks. We *already* defer much of > the > computation to libraries like numexpr & bottleneck (where > appropriate). > Adding numba / dask to the list would be helpful. > > I think that almost all performance issues are the result of: > > a) gross misuse of the pandas API. How much code have you seen
On Tue, Dec 29, 2015 at 1:20 PM, Jeff Reback <jeffreback@gmail.com> wrote: the there that
> does > df.apply(lambda x: x.sum()) > b) routines which operate column-by-column rather block-by-block and > are > in > python space (e.g. we have an issue right now about .quantile) > > So I am glossing over a big goal of having a c++ library that > represents > the > pandas internals. This would by definition have a c-API that so > you *could* use pandas like semantics in c/c++ and just have it work > (and > then pandas would be a thin wrapper around this library). > > I am not averse to this, but I think would be quite a big effort, > and > not a > huge perf boost IMHO. Further there are a number of API issues > w.r.t. > indexing > which need to be clarified / worked out (e.g. should we simply > deprecate > []) > that are much easier to test / figure out in python space. > > I also thing that we have quite a large number of contributors. > Moving > to > c++ might make the internals a bit more impenetrable that the > current > internals. > (though this would allow c++ people to contribute, so that might > balance > out). > > We have a limited core of devs whom right now are familar with > things. > If > someone happened to have a starting base for a c++ library, then I > might > change > opinions here. > > > my 4c. > > Jeff > > > > > On Thu, Dec 24, 2015 at 7:18 PM, Wes McKinney < wesmckinn@gmail.com> > wrote: >> >> Deep thoughts during the holidays. >> >> I might be out of line here, but the interpreter-heaviness of the >> inside of pandas objects is likely to be a long-term liability and >> source of performance problems and technical debt. >> >> Has anyone put any thought into planning and beginning to execute >> on a >> rewrite that moves as much as possible of the internals into native >> / >> compiled code? I'm talking about: >> >> - pandas/core/internals >> - indexing and assignment >> - much of pandas/core/common >> - categorical and custom dtypes >> - all indexing mechanisms >> >> I'm concerned we've already exposed too much internals to users, so >> this might lead to a lot of API breakage, but it might be for the >> Greater Good. As a first step, beginning a partial migration of >> internals into some C++ classes that encapsulate the insides of >> DataFrame objects and implement indexing and block-level >> manipulations >> would be a good place to start. I think you could do this wouldn't >> too >> much disruption. >> >> As part of this internal retooling we might give consideration to >> alternative data structures for representing data internal to >> pandas >> objects. Now in 2015/2016, continuing to be hamstrung by NumPy's >> limitations feels somewhat anachronistic. User code is riddled with >> workarounds for data type fidelity issues and the like. Like, >> really, >> why not add a bitndarray (similar to ilanschnell/bitarray) for >> storing >> nullness for problematic types and hide this from the user? =) >> >> Since we are now a NumFOCUS-sponsored project, I feel like we might >> consider establishing some formal governance over pandas and >> publishing meetings notes and roadmap documents describing plans >> for >> the project and meetings notes from committers. There's no real >> "committer culture" for NumFOCUS projects like there is with the >> Apache Software Foundation, but we might try leading by example! >> >> Also, I believe pandas as a project has reached a level of >> importance >> where we ought to consider planning and execution on larger scale >> undertakings such as this for safeguarding the future. >> >> As for myself, well, I have my hands full in Big Data-land. I wish >> I >> could be helping more with pandas, but there a quite a few >> fundamental >> issues (like data interoperability nested data handling and file >> format support — e.g. Parquet, see >> >> >> >> http://wesmckinney.com/blog/the-problem-with-the-data-science-language-wars/ ) >> preventing Python from being more useful in industry analytics >> applications. >> >> Aside: one of the bigger mistakes I made with pandas's API design >> was >> making it acceptable to call class constructors — like >> pandas.DataFrame — directly (versus factory functions). Sorry about >> that! If we could convince everyone to start writing >> pandas.data_frame >> or dataframe instead of using the class reference it would help a >> lot >> with code cleanup. It's hard to plan for these things — NumPy >> interoperability seemed a lot more important in 2008 than it does >> now, >> so I forgive myself. >> >> cheers and best wishes for 2016, >> Wes >> _______________________________________________ >> Pandas-dev mailing list >> Pandas-dev@python.org >> https://mail.python.org/mailman/listinfo/pandas-dev > > _______________________________________________ Pandas-dev mailing list Pandas-dev@python.org https://mail.python.org/mailman/listinfo/pandas-dev
Pandas-dev mailing list Pandas-dev@python.org https://mail.python.org/mailman/listinfo/pandas-dev
_______________________________________________ Pandas-dev mailing list Pandas-dev@python.org https://mail.python.org/mailman/listinfo/pandas-dev
Hit send by accident. I meant to say targeting pandas/core/internals.py with the initial explorations. On Tuesday, December 29, 2015, Wes McKinney <wesmckinn@gmail.com> wrote:
Can you link to the PR you're talking about?
I will see about spending a few hours setting up a libpandas.so as a C++ shared library where we can run some experiments and validate whether it can solve the integer-NA problem and be a place to put new data types (categorical and friends). I'm +1 on targeting
Would it also be worth making a wish list of APIs we might consider breaking in a pandas 1.0 release that also features this new "native core"? Might as well right some wrongs while we're doing some invasive work on the internals; some breakage might be unavoidable. We can always maintain a pandas legacy 0.x.x maintenance branch (providing a conda binary build) for legacy users where showstopper bugs can get fixed.
Hi Wes (and others), I've been following this conversation with interest. I do think it would be worth exploring DyND, rather than setting up yet another rewrite of NumPy-functionality. Especially because DyND is already an optional dependency of Pandas. For things like Integer NA and new dtypes, DyND is there and ready to do this. Irwin On Tue, Dec 29, 2015 at 5:18 PM, Wes McKinney <wesmckinn@gmail.com> wrote:
Can you link to the PR you're talking about?
I will see about spending a few hours setting up a libpandas.so as a C++ shared library where we can run some experiments and validate whether it can solve the integer-NA problem and be a place to put new data types (categorical and friends). I'm +1 on targeting
Would it also be worth making a wish list of APIs we might consider breaking in a pandas 1.0 release that also features this new "native core"? Might as well right some wrongs while we're doing some invasive work on the internals; some breakage might be unavoidable. We can always maintain a pandas legacy 0.x.x maintenance branch (providing a conda binary build) for legacy users where showstopper bugs can get fixed.
Wes your last is noted as well. I *think* we can actually do this now (well there is a PR out there).
On Tue, Dec 29, 2015 at 4:12 PM, Wes McKinney <wesmckinn@gmail.com> wrote:
The other huge thing this will enable is to do is copy-on-write for various kinds of views, which should cut down on some of the defensive copying in the library and reduce memory usage.
On Tue, Dec 29, 2015 at 1:02 PM, Wes McKinney <wesmckinn@gmail.com>
wrote:
Basically the approach is
1) Base dtype type 2) Base array type with K >= 1 dimensions 3) Base scalar type 4) Base index type 5) "Wrapper" subclasses for all NumPy types fitting into categories #1, #2, #3, #4 6) Subclasses for pandas-specific types like category, datetimeTZ, etc. 7) NDFrame as cpcloud wrote is just a list of these
Indexes and axis labels / column names can get layered on top.
After we do all this we can look at adding nested types (arrays, maps, structs) to better support JSON.
- Wes
On Tue, Dec 29, 2015 at 12:14 PM, Phillip Cloud <cpcloud@gmail.com> wrote:
Maybe this is saying the same thing as Wes, but how far would something like this get us?
// warning: things are probably not this simple
struct data_array_t { void *primitive; // scalar data data_array_t *nested; // nested data boost::dynamic_bitset isnull; // might have to create our own to avoid boost schema_t schema; // not sure exactly what this looks like };
typedef std::map<string, data_array_t> data_frame_t; // probably not this simple
To answer Jeff’s use-case question: I think that the use cases are 1) freedom from numpy (mostly) 2) no more block manager which frees us from the limitations of the block memory layout. In particular, the ability to take advantage of memory mapped IO would be a big win IMO.
On Tue, Dec 29, 2015 at 2:50 PM Wes McKinney <wesmckinn@gmail.com> wrote:
I will write a more detailed response to some of these things after the new year, but, in particular, re: missing values, can you or someone tell me why creating an object that contains a NumPy array
and
a bitmap is not sufficient? If we we can add a lightweight C/C++ class layer between NumPy function calls (e.g. arithmetic) and pandas function calls, then I see no reason why we cannot have
Int32Array->add
and
Float32Array->add
do the right thing (the former would be responsible for bitmasking to propagate NA values; the latter would defer to NumPy). If we can put all the internals of pandas objects inside a black box, we can add layers of virtual function indirection without a performance penalty (e.g. adding more interpreter overhead with more abstraction layers does add up to a perf penalty).
I don't think this is too scary -- I would be willing to create a small POC C++ library to prototype something like what I'm talking about.
Since pandas has limited points of contact with NumPy I don't think this would end up being too onerous.
For the record, I'm pretty allergic to "advanced C++"; I think it is a useful tool if you pick a sane 20% subset of the C++11 spec and follow Google C++ style it's not very inaccessible to intermediate developers. More or less "C plus OOP and easier object lifetime management (shared/unique_ptr, etc.)". As soon as you add a lot of template metaprogramming C++ library development quickly becomes inaccessible except to the C++-Jedi.
Maybe let's start a Google document on "pandas roadmap" where we can break down the 1-2 year goals and some of these infrastructure issues and have our discussion there? (obviously publish this someplace once we're done)
- Wes
On Fri, Dec 25, 2015 at 2:14 PM, Jeff Reback <jeffreback@gmail.com> wrote: > Here are some of my thoughts about pandas Roadmap / status and some > responses to Wes's thoughts. > > In the last few (and upcoming) major releases we have been made
> following changes: > > - dtype enhancements (Categorical, Timedelta, Datetime w/tz) & > making > these > first class objects > - code refactoring to remove subclassing of ndarrays for Series & > Index > - carving out / deprecating non-core parts of pandas > - datareader > - SparsePanel, WidePanel & other aliases (TImeSeries) > - rpy, rplot, irow et al. > - google-analytics > - API changes to make things more consistent > - pd.rolling/expanding * -> .rolling/expanding (this is in master > now) > - .resample becoming a full defered like groupby. > - multi-index slicing along any level (obviates need for .xs) and > allows > assignment > - .loc/.iloc - for the most part obviates use of .ix > - .pipe & .assign > - plotting accessors > - fixing of the sorting API > - many performance enhancements both micro & macro (e.g. release > GIL) > > Some on-deck enhancements are (meaning these are basically ready to > go > in): > - IntervalIndex (and eventually make PeriodIndex just a sub-class > of > this) > - RangeIndex > > so lots of changes, though nothing really earth shaking, just more > convenience, reducing magicness somewhat > and providing flexibility. > > Of course we are getting increasing issues, mostly bug reports (and > lots > of > dupes), some edge case enhancements > which can add to the existing API's and of course, requests to > expand > the > (already) large code to other usecases. > Balancing this are a good many pull-requests from many different > users, > some > even deep into the internals. > > Here are some things that I have talked about and could be > considered > for > the roadmap. Disclaimer: I do work for Continuum > but these views are of course my own; furthermore obviously I am a > bit > more > familiar with some of the 'sponsored' open-source > libraries, but always open to new things. > > - integration / automatic deferral to numba for JIT (this would be > thru > .apply) > - automatic deferal to dask from groubpy where appropriate / maybe a > .to_parallel (to simply return a dask.DataFrame object) > - incorporation of quantities / units (as part of the dtype) > - use of DyND to allow missing values for int dtypes > - make Period a first class dtype. > - provide some copy-on-write semantics to alleviate the > chained-indexing > issues which occasionaly come up with the mis-use of the indexing > API > - allow a 'policy' to automatically provide column blocks for > dict-like > input (e.g. each column would be a block), this would allow a > pass-thru > API > where you could > put in numpy arrays where you have views and have them preserved > rather > than > copied automatically. Note that this would also allow what I call > 'split' > where a passed in > multi-dim numpy array could be split up to individual blocks (which > actually > gives a nice perf boost after the splitting costs). > > In working towards some of these goals. I have come to the opinion > that > it > would make sense to have a neutral API protocol layer > that would allow us to swap out different engines as needed, for > particular > dtypes, or *maybe* out-of-core type computations. E.g. > imagine that we replaced the in-memory block structure with a bclolz > / > memap > type; in theory this should be 'easy' and just work. > I could also see us adopting *some* of the SFrame code to allow > easier > interop with this API layer. > > In practice, I think a nice API layer would need to be created to > make > this > clean / nice. > > So this comes around to Wes's point about creating a c++ library for > the > internals (and possibly even some of the indexing routines). > In an ideal world, or course this would be desirable. Getting
> is a > bit > non-trivial I think, and IMHO might not be worth the effort. I don't > really see big performance bottlenecks. We *already* defer much of > the > computation to libraries like numexpr & bottleneck (where > appropriate). > Adding numba / dask to the list would be helpful. > > I think that almost all performance issues are the result of: > > a) gross misuse of the pandas API. How much code have you seen
On Tue, Dec 29, 2015 at 1:20 PM, Jeff Reback <jeffreback@gmail.com> wrote: the there that
> does > df.apply(lambda x: x.sum()) > b) routines which operate column-by-column rather block-by-block and > are > in > python space (e.g. we have an issue right now about .quantile) > > So I am glossing over a big goal of having a c++ library that > represents > the > pandas internals. This would by definition have a c-API that so > you *could* use pandas like semantics in c/c++ and just have it work > (and > then pandas would be a thin wrapper around this library). > > I am not averse to this, but I think would be quite a big effort, > and > not a > huge perf boost IMHO. Further there are a number of API issues > w.r.t. > indexing > which need to be clarified / worked out (e.g. should we simply > deprecate > []) > that are much easier to test / figure out in python space. > > I also thing that we have quite a large number of contributors. > Moving > to > c++ might make the internals a bit more impenetrable that the > current > internals. > (though this would allow c++ people to contribute, so that might > balance > out). > > We have a limited core of devs whom right now are familar with > things. > If > someone happened to have a starting base for a c++ library, then I > might > change > opinions here. > > > my 4c. > > Jeff > > > > > On Thu, Dec 24, 2015 at 7:18 PM, Wes McKinney < wesmckinn@gmail.com> > wrote: >> >> Deep thoughts during the holidays. >> >> I might be out of line here, but the interpreter-heaviness of the >> inside of pandas objects is likely to be a long-term liability and >> source of performance problems and technical debt. >> >> Has anyone put any thought into planning and beginning to execute >> on a >> rewrite that moves as much as possible of the internals into native >> / >> compiled code? I'm talking about: >> >> - pandas/core/internals >> - indexing and assignment >> - much of pandas/core/common >> - categorical and custom dtypes >> - all indexing mechanisms >> >> I'm concerned we've already exposed too much internals to users, so >> this might lead to a lot of API breakage, but it might be for the >> Greater Good. As a first step, beginning a partial migration of >> internals into some C++ classes that encapsulate the insides of >> DataFrame objects and implement indexing and block-level >> manipulations >> would be a good place to start. I think you could do this wouldn't >> too >> much disruption. >> >> As part of this internal retooling we might give consideration to >> alternative data structures for representing data internal to >> pandas >> objects. Now in 2015/2016, continuing to be hamstrung by NumPy's >> limitations feels somewhat anachronistic. User code is riddled with >> workarounds for data type fidelity issues and the like. Like, >> really, >> why not add a bitndarray (similar to ilanschnell/bitarray) for >> storing >> nullness for problematic types and hide this from the user? =) >> >> Since we are now a NumFOCUS-sponsored project, I feel like we might >> consider establishing some formal governance over pandas and >> publishing meetings notes and roadmap documents describing plans >> for >> the project and meetings notes from committers. There's no real >> "committer culture" for NumFOCUS projects like there is with the >> Apache Software Foundation, but we might try leading by example! >> >> Also, I believe pandas as a project has reached a level of >> importance >> where we ought to consider planning and execution on larger scale >> undertakings such as this for safeguarding the future. >> >> As for myself, well, I have my hands full in Big Data-land. I wish >> I >> could be helping more with pandas, but there a quite a few >> fundamental >> issues (like data interoperability nested data handling and file >> format support — e.g. Parquet, see >> >> >> >> http://wesmckinney.com/blog/the-problem-with-the-data-science-language-wars/ ) >> preventing Python from being more useful in industry analytics >> applications. >> >> Aside: one of the bigger mistakes I made with pandas's API design >> was >> making it acceptable to call class constructors — like >> pandas.DataFrame — directly (versus factory functions). Sorry about >> that! If we could convince everyone to start writing >> pandas.data_frame >> or dataframe instead of using the class reference it would help a >> lot >> with code cleanup. It's hard to plan for these things — NumPy >> interoperability seemed a lot more important in 2008 than it does >> now, >> so I forgive myself. >> >> cheers and best wishes for 2016, >> Wes >> _______________________________________________ >> Pandas-dev mailing list >> Pandas-dev@python.org >> https://mail.python.org/mailman/listinfo/pandas-dev > > _______________________________________________ Pandas-dev mailing list Pandas-dev@python.org https://mail.python.org/mailman/listinfo/pandas-dev
Pandas-dev mailing list Pandas-dev@python.org https://mail.python.org/mailman/listinfo/pandas-dev
_______________________________________________ Pandas-dev mailing list Pandas-dev@python.org https://mail.python.org/mailman/listinfo/pandas-dev
I'm not suggesting a rewrite of NumPy functionality but rather pandas functionality that is currently written in a mishmash of Cython and Python. Happy to experiment with changing the internal compute infrastructure and data representation to DyND after this first stage of cleanup is done. Even if we use DyND a pretty extensive pandas wrapper layer will be necessary. On Tuesday, December 29, 2015, Irwin Zaid <izaid@continuum.io> wrote:
Hi Wes (and others),
I've been following this conversation with interest. I do think it would be worth exploring DyND, rather than setting up yet another rewrite of NumPy-functionality. Especially because DyND is already an optional dependency of Pandas.
For things like Integer NA and new dtypes, DyND is there and ready to do this.
Irwin
On Tue, Dec 29, 2015 at 5:18 PM, Wes McKinney <wesmckinn@gmail.com <javascript:_e(%7B%7D,'cvml','wesmckinn@gmail.com');>> wrote:
Can you link to the PR you're talking about?
I will see about spending a few hours setting up a libpandas.so as a C++ shared library where we can run some experiments and validate whether it can solve the integer-NA problem and be a place to put new data types (categorical and friends). I'm +1 on targeting
Would it also be worth making a wish list of APIs we might consider breaking in a pandas 1.0 release that also features this new "native core"? Might as well right some wrongs while we're doing some invasive work on the internals; some breakage might be unavoidable. We can always maintain a pandas legacy 0.x.x maintenance branch (providing a conda binary build) for legacy users where showstopper bugs can get fixed.
Wes your last is noted as well. I *think* we can actually do this now (well there is a PR out there).
On Tue, Dec 29, 2015 at 4:12 PM, Wes McKinney <wesmckinn@gmail.com> wrote:
The other huge thing this will enable is to do is copy-on-write for various kinds of views, which should cut down on some of the defensive copying in the library and reduce memory usage.
On Tue, Dec 29, 2015 at 1:02 PM, Wes McKinney <wesmckinn@gmail.com>
wrote:
Basically the approach is
1) Base dtype type 2) Base array type with K >= 1 dimensions 3) Base scalar type 4) Base index type 5) "Wrapper" subclasses for all NumPy types fitting into categories #1, #2, #3, #4 6) Subclasses for pandas-specific types like category, datetimeTZ, etc. 7) NDFrame as cpcloud wrote is just a list of these
Indexes and axis labels / column names can get layered on top.
After we do all this we can look at adding nested types (arrays, maps, structs) to better support JSON.
- Wes
On Tue, Dec 29, 2015 at 12:14 PM, Phillip Cloud <cpcloud@gmail.com> wrote:
Maybe this is saying the same thing as Wes, but how far would something like this get us?
// warning: things are probably not this simple
struct data_array_t { void *primitive; // scalar data data_array_t *nested; // nested data boost::dynamic_bitset isnull; // might have to create our own to avoid boost schema_t schema; // not sure exactly what this looks like };
typedef std::map<string, data_array_t> data_frame_t; // probably not this simple
To answer Jeff’s use-case question: I think that the use cases are
freedom from numpy (mostly) 2) no more block manager which frees us from the limitations of the block memory layout. In particular, the ability to take advantage of memory mapped IO would be a big win IMO.
On Tue, Dec 29, 2015 at 2:50 PM Wes McKinney <wesmckinn@gmail.com> wrote: > > I will write a more detailed response to some of these things after > the new year, but, in particular, re: missing values, can you or > someone tell me why creating an object that contains a NumPy array and > a bitmap is not sufficient? If we we can add a lightweight C/C++ class > layer between NumPy function calls (e.g. arithmetic) and pandas > function calls, then I see no reason why we cannot have > > Int32Array->add > > and > > Float32Array->add > > do the right thing (the former would be responsible for bitmasking to > propagate NA values; the latter would defer to NumPy). If we can
> all the internals of pandas objects inside a black box, we can add > layers of virtual function indirection without a performance
> (e.g. adding more interpreter overhead with more abstraction layers > does add up to a perf penalty). > > I don't think this is too scary -- I would be willing to create a > small POC C++ library to prototype something like what I'm talking > about. > > Since pandas has limited points of contact with NumPy I don't think > this would end up being too onerous. > > For the record, I'm pretty allergic to "advanced C++"; I think it is a > useful tool if you pick a sane 20% subset of the C++11 spec and follow > Google C++ style it's not very inaccessible to intermediate > developers. More or less "C plus OOP and easier object lifetime > management (shared/unique_ptr, etc.)". As soon as you add a lot of > template metaprogramming C++ library development quickly becomes > inaccessible except to the C++-Jedi. > > Maybe let's start a Google document on "pandas roadmap" where we can > break down the 1-2 year goals and some of these infrastructure issues > and have our discussion there? (obviously publish this someplace once > we're done) > > - Wes > > On Fri, Dec 25, 2015 at 2:14 PM, Jeff Reback <jeffreback@gmail.com
> wrote: > > Here are some of my thoughts about pandas Roadmap / status and some > > responses to Wes's thoughts. > > > > In the last few (and upcoming) major releases we have been made
> > following changes: > > > > - dtype enhancements (Categorical, Timedelta, Datetime w/tz) & > > making > > these > > first class objects > > - code refactoring to remove subclassing of ndarrays for Series & > > Index > > - carving out / deprecating non-core parts of pandas > > - datareader > > - SparsePanel, WidePanel & other aliases (TImeSeries) > > - rpy, rplot, irow et al. > > - google-analytics > > - API changes to make things more consistent > > - pd.rolling/expanding * -> .rolling/expanding (this is in master > > now) > > - .resample becoming a full defered like groupby. > > - multi-index slicing along any level (obviates need for .xs) and > > allows > > assignment > > - .loc/.iloc - for the most part obviates use of .ix > > - .pipe & .assign > > - plotting accessors > > - fixing of the sorting API > > - many performance enhancements both micro & macro (e.g. release > > GIL) > > > > Some on-deck enhancements are (meaning these are basically ready to > > go > > in): > > - IntervalIndex (and eventually make PeriodIndex just a sub-class > > of > > this) > > - RangeIndex > > > > so lots of changes, though nothing really earth shaking, just more > > convenience, reducing magicness somewhat > > and providing flexibility. > > > > Of course we are getting increasing issues, mostly bug reports (and > > lots > > of > > dupes), some edge case enhancements > > which can add to the existing API's and of course, requests to > > expand > > the > > (already) large code to other usecases. > > Balancing this are a good many pull-requests from many different > > users, > > some > > even deep into the internals. > > > > Here are some things that I have talked about and could be > > considered > > for > > the roadmap. Disclaimer: I do work for Continuum > > but these views are of course my own; furthermore obviously I am a > > bit > > more > > familiar with some of the 'sponsored' open-source > > libraries, but always open to new things. > > > > - integration / automatic deferral to numba for JIT (this would be > > thru > > .apply) > > - automatic deferal to dask from groubpy where appropriate / maybe a > > .to_parallel (to simply return a dask.DataFrame object) > > - incorporation of quantities / units (as part of the dtype) > > - use of DyND to allow missing values for int dtypes > > - make Period a first class dtype. > > - provide some copy-on-write semantics to alleviate the > > chained-indexing > > issues which occasionaly come up with the mis-use of the indexing > > API > > - allow a 'policy' to automatically provide column blocks for > > dict-like > > input (e.g. each column would be a block), this would allow a > > pass-thru > > API > > where you could > > put in numpy arrays where you have views and have them preserved > > rather > > than > > copied automatically. Note that this would also allow what I call > > 'split' > > where a passed in > > multi-dim numpy array could be split up to individual blocks (which > > actually > > gives a nice perf boost after the splitting costs). > > > > In working towards some of these goals. I have come to the opinion > > that > > it > > would make sense to have a neutral API protocol layer > > that would allow us to swap out different engines as needed, for > > particular > > dtypes, or *maybe* out-of-core type computations. E.g. > > imagine that we replaced the in-memory block structure with a bclolz > > / > > memap > > type; in theory this should be 'easy' and just work. > > I could also see us adopting *some* of the SFrame code to allow > > easier > > interop with this API layer. > > > > In practice, I think a nice API layer would need to be created to > > make > > this > > clean / nice. > > > > So this comes around to Wes's point about creating a c++ library for > > the > > internals (and possibly even some of the indexing routines). > > In an ideal world, or course this would be desirable. Getting
> > is a > > bit > > non-trivial I think, and IMHO might not be worth the effort. I don't > > really see big performance bottlenecks. We *already* defer much of > > the > > computation to libraries like numexpr & bottleneck (where > > appropriate). > > Adding numba / dask to the list would be helpful. > > > > I think that almost all performance issues are the result of: > > > > a) gross misuse of the pandas API. How much code have you seen
> > does > > df.apply(lambda x: x.sum()) > > b) routines which operate column-by-column rather block-by-block and > > are > > in > > python space (e.g. we have an issue right now about .quantile) > > > > So I am glossing over a big goal of having a c++ library that > > represents > > the > > pandas internals. This would by definition have a c-API that so > > you *could* use pandas like semantics in c/c++ and just have it work > > (and > > then pandas would be a thin wrapper around this library). > > > > I am not averse to this, but I think would be quite a big effort, > > and > > not a > > huge perf boost IMHO. Further there are a number of API issues > > w.r.t. > > indexing > > which need to be clarified / worked out (e.g. should we simply > > deprecate > > []) > > that are much easier to test / figure out in python space. > > > > I also thing that we have quite a large number of contributors. > > Moving > > to > > c++ might make the internals a bit more impenetrable that the > > current > > internals. > > (though this would allow c++ people to contribute, so that might > > balance > > out). > > > > We have a limited core of devs whom right now are familar with > > things. > > If > > someone happened to have a starting base for a c++ library, then I > > might > > change > > opinions here. > > > > > > my 4c. > > > > Jeff > > > > > > > > > > On Thu, Dec 24, 2015 at 7:18 PM, Wes McKinney < wesmckinn@gmail.com> > > wrote: > >> > >> Deep thoughts during the holidays. > >> > >> I might be out of line here, but the interpreter-heaviness of
> >> inside of pandas objects is likely to be a long-term liability and > >> source of performance problems and technical debt. > >> > >> Has anyone put any thought into planning and beginning to execute > >> on a > >> rewrite that moves as much as possible of the internals into native > >> / > >> compiled code? I'm talking about: > >> > >> - pandas/core/internals > >> - indexing and assignment > >> - much of pandas/core/common > >> - categorical and custom dtypes > >> - all indexing mechanisms > >> > >> I'm concerned we've already exposed too much internals to users, so > >> this might lead to a lot of API breakage, but it might be for
On Tue, Dec 29, 2015 at 1:20 PM, Jeff Reback <jeffreback@gmail.com> wrote: 1) put penalty the there that the the
> >> Greater Good. As a first step, beginning a partial migration of > >> internals into some C++ classes that encapsulate the insides of > >> DataFrame objects and implement indexing and block-level > >> manipulations > >> would be a good place to start. I think you could do this wouldn't > >> too > >> much disruption. > >> > >> As part of this internal retooling we might give consideration to > >> alternative data structures for representing data internal to > >> pandas > >> objects. Now in 2015/2016, continuing to be hamstrung by NumPy's > >> limitations feels somewhat anachronistic. User code is riddled with > >> workarounds for data type fidelity issues and the like. Like, > >> really, > >> why not add a bitndarray (similar to ilanschnell/bitarray) for > >> storing > >> nullness for problematic types and hide this from the user? =) > >> > >> Since we are now a NumFOCUS-sponsored project, I feel like we might > >> consider establishing some formal governance over pandas and > >> publishing meetings notes and roadmap documents describing plans > >> for > >> the project and meetings notes from committers. There's no real > >> "committer culture" for NumFOCUS projects like there is with the > >> Apache Software Foundation, but we might try leading by example! > >> > >> Also, I believe pandas as a project has reached a level of > >> importance > >> where we ought to consider planning and execution on larger scale > >> undertakings such as this for safeguarding the future. > >> > >> As for myself, well, I have my hands full in Big Data-land. I wish > >> I > >> could be helping more with pandas, but there a quite a few > >> fundamental > >> issues (like data interoperability nested data handling and file > >> format support — e.g. Parquet, see > >> > >> > >> > >> http://wesmckinney.com/blog/the-problem-with-the-data-science-language-wars/ ) > >> preventing Python from being more useful in industry analytics > >> applications. > >> > >> Aside: one of the bigger mistakes I made with pandas's API design > >> was > >> making it acceptable to call class constructors — like > >> pandas.DataFrame — directly (versus factory functions). Sorry about > >> that! If we could convince everyone to start writing > >> pandas.data_frame > >> or dataframe instead of using the class reference it would help a > >> lot > >> with code cleanup. It's hard to plan for these things — NumPy > >> interoperability seemed a lot more important in 2008 than it does > >> now, > >> so I forgive myself. > >> > >> cheers and best wishes for 2016, > >> Wes > >> _______________________________________________ > >> Pandas-dev mailing list > >> Pandas-dev@python.org > >> https://mail.python.org/mailman/listinfo/pandas-dev > > > > > _______________________________________________ > Pandas-dev mailing list > Pandas-dev@python.org > https://mail.python.org/mailman/listinfo/pandas-dev
Pandas-dev mailing list Pandas-dev@python.org https://mail.python.org/mailman/listinfo/pandas-dev
_______________________________________________ Pandas-dev mailing list Pandas-dev@python.org <javascript:_e(%7B%7D,'cvml','Pandas-dev@python.org');> https://mail.python.org/mailman/listinfo/pandas-dev
Yeah, that seems reasonable and I totally agree a Pandas wrapper layer would be necessary. I'll keep an eye on this and I'd like to help if I can. Irwin On Tue, Dec 29, 2015 at 6:01 PM, Wes McKinney <wesmckinn@gmail.com> wrote:
I'm not suggesting a rewrite of NumPy functionality but rather pandas functionality that is currently written in a mishmash of Cython and Python. Happy to experiment with changing the internal compute infrastructure and data representation to DyND after this first stage of cleanup is done. Even if we use DyND a pretty extensive pandas wrapper layer will be necessary.
On Tuesday, December 29, 2015, Irwin Zaid <izaid@continuum.io> wrote:
Hi Wes (and others),
I've been following this conversation with interest. I do think it would be worth exploring DyND, rather than setting up yet another rewrite of NumPy-functionality. Especially because DyND is already an optional dependency of Pandas.
For things like Integer NA and new dtypes, DyND is there and ready to do this.
Irwin
On Tue, Dec 29, 2015 at 5:18 PM, Wes McKinney <wesmckinn@gmail.com> wrote:
Can you link to the PR you're talking about?
I will see about spending a few hours setting up a libpandas.so as a C++ shared library where we can run some experiments and validate whether it can solve the integer-NA problem and be a place to put new data types (categorical and friends). I'm +1 on targeting
Would it also be worth making a wish list of APIs we might consider breaking in a pandas 1.0 release that also features this new "native core"? Might as well right some wrongs while we're doing some invasive work on the internals; some breakage might be unavoidable. We can always maintain a pandas legacy 0.x.x maintenance branch (providing a conda binary build) for legacy users where showstopper bugs can get fixed.
Wes your last is noted as well. I *think* we can actually do this now (well there is a PR out there).
On Tue, Dec 29, 2015 at 4:12 PM, Wes McKinney <wesmckinn@gmail.com> wrote:
The other huge thing this will enable is to do is copy-on-write for various kinds of views, which should cut down on some of the defensive copying in the library and reduce memory usage.
On Tue, Dec 29, 2015 at 1:02 PM, Wes McKinney <wesmckinn@gmail.com>
wrote:
Basically the approach is
1) Base dtype type 2) Base array type with K >= 1 dimensions 3) Base scalar type 4) Base index type 5) "Wrapper" subclasses for all NumPy types fitting into categories #1, #2, #3, #4 6) Subclasses for pandas-specific types like category, datetimeTZ, etc. 7) NDFrame as cpcloud wrote is just a list of these
Indexes and axis labels / column names can get layered on top.
After we do all this we can look at adding nested types (arrays, maps, structs) to better support JSON.
- Wes
On Tue, Dec 29, 2015 at 12:14 PM, Phillip Cloud <cpcloud@gmail.com> wrote: > Maybe this is saying the same thing as Wes, but how far would something > like > this get us? > > // warning: things are probably not this simple > > struct data_array_t { > void *primitive; // scalar data > data_array_t *nested; // nested data > boost::dynamic_bitset isnull; // might have to create our own to > avoid > boost > schema_t schema; // not sure exactly what this looks like > }; > > typedef std::map<string, data_array_t> data_frame_t; // probably not > this > simple > > To answer Jeff’s use-case question: I think that the use cases are
> freedom from numpy (mostly) 2) no more block manager which frees us > from the > limitations of the block memory layout. In particular, the ability to > take > advantage of memory mapped IO would be a big win IMO. > > > On Tue, Dec 29, 2015 at 2:50 PM Wes McKinney <wesmckinn@gmail.com> > wrote: >> >> I will write a more detailed response to some of these things after >> the new year, but, in particular, re: missing values, can you or >> someone tell me why creating an object that contains a NumPy array and >> a bitmap is not sufficient? If we we can add a lightweight C/C++ class >> layer between NumPy function calls (e.g. arithmetic) and pandas >> function calls, then I see no reason why we cannot have >> >> Int32Array->add >> >> and >> >> Float32Array->add >> >> do the right thing (the former would be responsible for bitmasking to >> propagate NA values; the latter would defer to NumPy). If we can
>> all the internals of pandas objects inside a black box, we can add >> layers of virtual function indirection without a performance
>> (e.g. adding more interpreter overhead with more abstraction layers >> does add up to a perf penalty). >> >> I don't think this is too scary -- I would be willing to create a >> small POC C++ library to prototype something like what I'm talking >> about. >> >> Since pandas has limited points of contact with NumPy I don't
>> this would end up being too onerous. >> >> For the record, I'm pretty allergic to "advanced C++"; I think it is a >> useful tool if you pick a sane 20% subset of the C++11 spec and follow >> Google C++ style it's not very inaccessible to intermediate >> developers. More or less "C plus OOP and easier object lifetime >> management (shared/unique_ptr, etc.)". As soon as you add a lot of >> template metaprogramming C++ library development quickly becomes >> inaccessible except to the C++-Jedi. >> >> Maybe let's start a Google document on "pandas roadmap" where we can >> break down the 1-2 year goals and some of these infrastructure issues >> and have our discussion there? (obviously publish this someplace once >> we're done) >> >> - Wes >> >> On Fri, Dec 25, 2015 at 2:14 PM, Jeff Reback < jeffreback@gmail.com> >> wrote: >> > Here are some of my thoughts about pandas Roadmap / status and some >> > responses to Wes's thoughts. >> > >> > In the last few (and upcoming) major releases we have been made
>> > following changes: >> > >> > - dtype enhancements (Categorical, Timedelta, Datetime w/tz) & >> > making >> > these >> > first class objects >> > - code refactoring to remove subclassing of ndarrays for Series & >> > Index >> > - carving out / deprecating non-core parts of pandas >> > - datareader >> > - SparsePanel, WidePanel & other aliases (TImeSeries) >> > - rpy, rplot, irow et al. >> > - google-analytics >> > - API changes to make things more consistent >> > - pd.rolling/expanding * -> .rolling/expanding (this is in master >> > now) >> > - .resample becoming a full defered like groupby. >> > - multi-index slicing along any level (obviates need for .xs) and >> > allows >> > assignment >> > - .loc/.iloc - for the most part obviates use of .ix >> > - .pipe & .assign >> > - plotting accessors >> > - fixing of the sorting API >> > - many performance enhancements both micro & macro (e.g. release >> > GIL) >> > >> > Some on-deck enhancements are (meaning these are basically ready to >> > go >> > in): >> > - IntervalIndex (and eventually make PeriodIndex just a sub-class >> > of >> > this) >> > - RangeIndex >> > >> > so lots of changes, though nothing really earth shaking, just more >> > convenience, reducing magicness somewhat >> > and providing flexibility. >> > >> > Of course we are getting increasing issues, mostly bug reports (and >> > lots >> > of >> > dupes), some edge case enhancements >> > which can add to the existing API's and of course, requests to >> > expand >> > the >> > (already) large code to other usecases. >> > Balancing this are a good many pull-requests from many different >> > users, >> > some >> > even deep into the internals. >> > >> > Here are some things that I have talked about and could be >> > considered >> > for >> > the roadmap. Disclaimer: I do work for Continuum >> > but these views are of course my own; furthermore obviously I am a >> > bit >> > more >> > familiar with some of the 'sponsored' open-source >> > libraries, but always open to new things. >> > >> > - integration / automatic deferral to numba for JIT (this would be >> > thru >> > .apply) >> > - automatic deferal to dask from groubpy where appropriate / maybe a >> > .to_parallel (to simply return a dask.DataFrame object) >> > - incorporation of quantities / units (as part of the dtype) >> > - use of DyND to allow missing values for int dtypes >> > - make Period a first class dtype. >> > - provide some copy-on-write semantics to alleviate the >> > chained-indexing >> > issues which occasionaly come up with the mis-use of the indexing >> > API >> > - allow a 'policy' to automatically provide column blocks for >> > dict-like >> > input (e.g. each column would be a block), this would allow a >> > pass-thru >> > API >> > where you could >> > put in numpy arrays where you have views and have them preserved >> > rather >> > than >> > copied automatically. Note that this would also allow what I call >> > 'split' >> > where a passed in >> > multi-dim numpy array could be split up to individual blocks (which >> > actually >> > gives a nice perf boost after the splitting costs). >> > >> > In working towards some of these goals. I have come to the opinion >> > that >> > it >> > would make sense to have a neutral API protocol layer >> > that would allow us to swap out different engines as needed, for >> > particular >> > dtypes, or *maybe* out-of-core type computations. E.g. >> > imagine that we replaced the in-memory block structure with a bclolz >> > / >> > memap >> > type; in theory this should be 'easy' and just work. >> > I could also see us adopting *some* of the SFrame code to allow >> > easier >> > interop with this API layer. >> > >> > In practice, I think a nice API layer would need to be created to >> > make >> > this >> > clean / nice. >> > >> > So this comes around to Wes's point about creating a c++
>> > the >> > internals (and possibly even some of the indexing routines). >> > In an ideal world, or course this would be desirable. Getting
>> > is a >> > bit >> > non-trivial I think, and IMHO might not be worth the effort. I don't >> > really see big performance bottlenecks. We *already* defer much of >> > the >> > computation to libraries like numexpr & bottleneck (where >> > appropriate). >> > Adding numba / dask to the list would be helpful. >> > >> > I think that almost all performance issues are the result of: >> > >> > a) gross misuse of the pandas API. How much code have you seen
>> > does >> > df.apply(lambda x: x.sum()) >> > b) routines which operate column-by-column rather block-by-block and >> > are >> > in >> > python space (e.g. we have an issue right now about .quantile) >> > >> > So I am glossing over a big goal of having a c++ library that >> > represents >> > the >> > pandas internals. This would by definition have a c-API that so >> > you *could* use pandas like semantics in c/c++ and just have it work >> > (and >> > then pandas would be a thin wrapper around this library). >> > >> > I am not averse to this, but I think would be quite a big effort, >> > and >> > not a >> > huge perf boost IMHO. Further there are a number of API issues >> > w.r.t. >> > indexing >> > which need to be clarified / worked out (e.g. should we simply >> > deprecate >> > []) >> > that are much easier to test / figure out in python space. >> > >> > I also thing that we have quite a large number of contributors. >> > Moving >> > to >> > c++ might make the internals a bit more impenetrable that the >> > current >> > internals. >> > (though this would allow c++ people to contribute, so that might >> > balance >> > out). >> > >> > We have a limited core of devs whom right now are familar with >> > things. >> > If >> > someone happened to have a starting base for a c++ library,
>> > might >> > change >> > opinions here. >> > >> > >> > my 4c. >> > >> > Jeff >> > >> > >> > >> > >> > On Thu, Dec 24, 2015 at 7:18 PM, Wes McKinney < wesmckinn@gmail.com> >> > wrote: >> >> >> >> Deep thoughts during the holidays. >> >> >> >> I might be out of line here, but the interpreter-heaviness of
>> >> inside of pandas objects is likely to be a long-term liability and >> >> source of performance problems and technical debt. >> >> >> >> Has anyone put any thought into planning and beginning to execute >> >> on a >> >> rewrite that moves as much as possible of the internals into native >> >> / >> >> compiled code? I'm talking about: >> >> >> >> - pandas/core/internals >> >> - indexing and assignment >> >> - much of pandas/core/common >> >> - categorical and custom dtypes >> >> - all indexing mechanisms >> >> >> >> I'm concerned we've already exposed too much internals to users, so >> >> this might lead to a lot of API breakage, but it might be for
>> >> Greater Good. As a first step, beginning a partial migration of >> >> internals into some C++ classes that encapsulate the insides of >> >> DataFrame objects and implement indexing and block-level >> >> manipulations >> >> would be a good place to start. I think you could do this wouldn't >> >> too >> >> much disruption. >> >> >> >> As part of this internal retooling we might give consideration to >> >> alternative data structures for representing data internal to >> >> pandas >> >> objects. Now in 2015/2016, continuing to be hamstrung by NumPy's >> >> limitations feels somewhat anachronistic. User code is riddled with >> >> workarounds for data type fidelity issues and the like. Like, >> >> really, >> >> why not add a bitndarray (similar to ilanschnell/bitarray) for >> >> storing >> >> nullness for problematic types and hide this from the user? =) >> >> >> >> Since we are now a NumFOCUS-sponsored project, I feel like we might >> >> consider establishing some formal governance over pandas and >> >> publishing meetings notes and roadmap documents describing
>> >> for >> >> the project and meetings notes from committers. There's no real >> >> "committer culture" for NumFOCUS projects like there is with
On Tue, Dec 29, 2015 at 1:20 PM, Jeff Reback <jeffreback@gmail.com> wrote: 1) put penalty think the library for there that then I the the plans the
>> >> Apache Software Foundation, but we might try leading by example! >> >> >> >> Also, I believe pandas as a project has reached a level of >> >> importance >> >> where we ought to consider planning and execution on larger scale >> >> undertakings such as this for safeguarding the future. >> >> >> >> As for myself, well, I have my hands full in Big Data-land. I wish >> >> I >> >> could be helping more with pandas, but there a quite a few >> >> fundamental >> >> issues (like data interoperability nested data handling and file >> >> format support — e.g. Parquet, see >> >> >> >> >> >> >> >> http://wesmckinney.com/blog/the-problem-with-the-data-science-language-wars/ ) >> >> preventing Python from being more useful in industry analytics >> >> applications. >> >> >> >> Aside: one of the bigger mistakes I made with pandas's API design >> >> was >> >> making it acceptable to call class constructors — like >> >> pandas.DataFrame — directly (versus factory functions). Sorry about >> >> that! If we could convince everyone to start writing >> >> pandas.data_frame >> >> or dataframe instead of using the class reference it would help a >> >> lot >> >> with code cleanup. It's hard to plan for these things — NumPy >> >> interoperability seemed a lot more important in 2008 than it does >> >> now, >> >> so I forgive myself. >> >> >> >> cheers and best wishes for 2016, >> >> Wes >> >> _______________________________________________ >> >> Pandas-dev mailing list >> >> Pandas-dev@python.org >> >> https://mail.python.org/mailman/listinfo/pandas-dev >> > >> > >> _______________________________________________ >> Pandas-dev mailing list >> Pandas-dev@python.org >> https://mail.python.org/mailman/listinfo/pandas-dev
Pandas-dev mailing list Pandas-dev@python.org https://mail.python.org/mailman/listinfo/pandas-dev
_______________________________________________ Pandas-dev mailing list Pandas-dev@python.org https://mail.python.org/mailman/listinfo/pandas-dev
I cobbled together an ugly start of a c++->cython->pandas toolchain here https://github.com/wesm/pandas/tree/libpandas-native-core I used a mix of Kudu, Impala, and dynd-python cmake sources, so it's a bit messy at the moment but it should be sufficient to run some real experiments with a little more work. I reckon it's like a 6 month project to tear out the insides of Series and DataFrame and replace it with a new "native core", but we should be able to get enough info to see whether it's a viable plan within a month or so. The end goal is to create "private" extension types in Cython that can be the new base classes for Series and NDFrame; these will hold a reference to a C++ object that contains wrappered NumPy arrays and other metadata (like pandas-only dtypes). It might be too hard to try to replace a single usage of block manager as a first experiment, so I'll try to create a minimal "SeriesLite" that supports 3 dtypes 1) float64 with nans 2) int64 with a bitmask for NAs 3) category type for one of these Just want to get a feel for the extensibility and offer an NA singleton Python object (a la None) for getting and setting NAs across these 3 dtypes. If we end up going down this route, any way to place a moratorium on invasive work on pandas internals (outside bug fixes)? Pedantic aside: I'd rather avoid shipping thirdparty C/C++ libraries like googletest and friends in pandas if we can. Cloudera folks have been working on a portable C++ library toolchain for Impala and other projects at https://github.com/cloudera/native-toolchain, but it is only being tested on Linux and OS X. Most google libraries should build out of the box on MSVC but it'll be something to keep an eye on. BTW thanks to the libdynd developers for pioneering the c++ lib <-> python-c++ lib <-> cython toolchain; being able to build Cython extensions directly from cmake is a godsend HNY all Wes On Tue, Dec 29, 2015 at 4:17 PM, Irwin Zaid <izaid@continuum.io> wrote:
Yeah, that seems reasonable and I totally agree a Pandas wrapper layer would be necessary.
I'll keep an eye on this and I'd like to help if I can.
Irwin
On Tue, Dec 29, 2015 at 6:01 PM, Wes McKinney <wesmckinn@gmail.com> wrote:
I'm not suggesting a rewrite of NumPy functionality but rather pandas functionality that is currently written in a mishmash of Cython and Python. Happy to experiment with changing the internal compute infrastructure and data representation to DyND after this first stage of cleanup is done. Even if we use DyND a pretty extensive pandas wrapper layer will be necessary.
On Tuesday, December 29, 2015, Irwin Zaid <izaid@continuum.io> wrote:
Hi Wes (and others),
I've been following this conversation with interest. I do think it would be worth exploring DyND, rather than setting up yet another rewrite of NumPy-functionality. Especially because DyND is already an optional dependency of Pandas.
For things like Integer NA and new dtypes, DyND is there and ready to do this.
Irwin
On Tue, Dec 29, 2015 at 5:18 PM, Wes McKinney <wesmckinn@gmail.com> wrote:
Can you link to the PR you're talking about?
I will see about spending a few hours setting up a libpandas.so as a C++ shared library where we can run some experiments and validate whether it can solve the integer-NA problem and be a place to put new data types (categorical and friends). I'm +1 on targeting
Would it also be worth making a wish list of APIs we might consider breaking in a pandas 1.0 release that also features this new "native core"? Might as well right some wrongs while we're doing some invasive work on the internals; some breakage might be unavoidable. We can always maintain a pandas legacy 0.x.x maintenance branch (providing a conda binary build) for legacy users where showstopper bugs can get fixed.
On Tue, Dec 29, 2015 at 1:20 PM, Jeff Reback <jeffreback@gmail.com> wrote:
Wes your last is noted as well. I *think* we can actually do this now (well there is a PR out there).
On Tue, Dec 29, 2015 at 4:12 PM, Wes McKinney <wesmckinn@gmail.com> wrote:
The other huge thing this will enable is to do is copy-on-write for various kinds of views, which should cut down on some of the defensive copying in the library and reduce memory usage.
On Tue, Dec 29, 2015 at 1:02 PM, Wes McKinney <wesmckinn@gmail.com> wrote: > Basically the approach is > > 1) Base dtype type > 2) Base array type with K >= 1 dimensions > 3) Base scalar type > 4) Base index type > 5) "Wrapper" subclasses for all NumPy types fitting into categories > #1, #2, #3, #4 > 6) Subclasses for pandas-specific types like category, datetimeTZ, > etc. > 7) NDFrame as cpcloud wrote is just a list of these > > Indexes and axis labels / column names can get layered on top. > > After we do all this we can look at adding nested types (arrays, > maps, > structs) to better support JSON. > > - Wes > > On Tue, Dec 29, 2015 at 12:14 PM, Phillip Cloud <cpcloud@gmail.com> > wrote: >> Maybe this is saying the same thing as Wes, but how far would >> something >> like >> this get us? >> >> // warning: things are probably not this simple >> >> struct data_array_t { >> void *primitive; // scalar data >> data_array_t *nested; // nested data >> boost::dynamic_bitset isnull; // might have to create our own >> to >> avoid >> boost >> schema_t schema; // not sure exactly what this looks like >> }; >> >> typedef std::map<string, data_array_t> data_frame_t; // probably >> not >> this >> simple >> >> To answer Jeff’s use-case question: I think that the use cases are >> 1) >> freedom from numpy (mostly) 2) no more block manager which frees >> us >> from the >> limitations of the block memory layout. In particular, the ability >> to >> take >> advantage of memory mapped IO would be a big win IMO. >> >> >> On Tue, Dec 29, 2015 at 2:50 PM Wes McKinney <wesmckinn@gmail.com> >> wrote: >>> >>> I will write a more detailed response to some of these things >>> after >>> the new year, but, in particular, re: missing values, can you or >>> someone tell me why creating an object that contains a NumPy >>> array and >>> a bitmap is not sufficient? If we we can add a lightweight C/C++ >>> class >>> layer between NumPy function calls (e.g. arithmetic) and pandas >>> function calls, then I see no reason why we cannot have >>> >>> Int32Array->add >>> >>> and >>> >>> Float32Array->add >>> >>> do the right thing (the former would be responsible for >>> bitmasking to >>> propagate NA values; the latter would defer to NumPy). If we can >>> put >>> all the internals of pandas objects inside a black box, we can >>> add >>> layers of virtual function indirection without a performance >>> penalty >>> (e.g. adding more interpreter overhead with more abstraction >>> layers >>> does add up to a perf penalty). >>> >>> I don't think this is too scary -- I would be willing to create a >>> small POC C++ library to prototype something like what I'm >>> talking >>> about. >>> >>> Since pandas has limited points of contact with NumPy I don't >>> think >>> this would end up being too onerous. >>> >>> For the record, I'm pretty allergic to "advanced C++"; I think it >>> is a >>> useful tool if you pick a sane 20% subset of the C++11 spec and >>> follow >>> Google C++ style it's not very inaccessible to intermediate >>> developers. More or less "C plus OOP and easier object lifetime >>> management (shared/unique_ptr, etc.)". As soon as you add a lot >>> of >>> template metaprogramming C++ library development quickly becomes >>> inaccessible except to the C++-Jedi. >>> >>> Maybe let's start a Google document on "pandas roadmap" where we >>> can >>> break down the 1-2 year goals and some of these infrastructure >>> issues >>> and have our discussion there? (obviously publish this someplace >>> once >>> we're done) >>> >>> - Wes >>> >>> On Fri, Dec 25, 2015 at 2:14 PM, Jeff Reback >>> <jeffreback@gmail.com> >>> wrote: >>> > Here are some of my thoughts about pandas Roadmap / status and >>> > some >>> > responses to Wes's thoughts. >>> > >>> > In the last few (and upcoming) major releases we have been made >>> > the >>> > following changes: >>> > >>> > - dtype enhancements (Categorical, Timedelta, Datetime w/tz) & >>> > making >>> > these >>> > first class objects >>> > - code refactoring to remove subclassing of ndarrays for Series >>> > & >>> > Index >>> > - carving out / deprecating non-core parts of pandas >>> > - datareader >>> > - SparsePanel, WidePanel & other aliases (TImeSeries) >>> > - rpy, rplot, irow et al. >>> > - google-analytics >>> > - API changes to make things more consistent >>> > - pd.rolling/expanding * -> .rolling/expanding (this is in >>> > master >>> > now) >>> > - .resample becoming a full defered like groupby. >>> > - multi-index slicing along any level (obviates need for .xs) >>> > and >>> > allows >>> > assignment >>> > - .loc/.iloc - for the most part obviates use of .ix >>> > - .pipe & .assign >>> > - plotting accessors >>> > - fixing of the sorting API >>> > - many performance enhancements both micro & macro (e.g. >>> > release >>> > GIL) >>> > >>> > Some on-deck enhancements are (meaning these are basically >>> > ready to >>> > go >>> > in): >>> > - IntervalIndex (and eventually make PeriodIndex just a >>> > sub-class >>> > of >>> > this) >>> > - RangeIndex >>> > >>> > so lots of changes, though nothing really earth shaking, just >>> > more >>> > convenience, reducing magicness somewhat >>> > and providing flexibility. >>> > >>> > Of course we are getting increasing issues, mostly bug reports >>> > (and >>> > lots >>> > of >>> > dupes), some edge case enhancements >>> > which can add to the existing API's and of course, requests to >>> > expand >>> > the >>> > (already) large code to other usecases. >>> > Balancing this are a good many pull-requests from many >>> > different >>> > users, >>> > some >>> > even deep into the internals. >>> > >>> > Here are some things that I have talked about and could be >>> > considered >>> > for >>> > the roadmap. Disclaimer: I do work for Continuum >>> > but these views are of course my own; furthermore obviously I >>> > am a >>> > bit >>> > more >>> > familiar with some of the 'sponsored' open-source >>> > libraries, but always open to new things. >>> > >>> > - integration / automatic deferral to numba for JIT (this would >>> > be >>> > thru >>> > .apply) >>> > - automatic deferal to dask from groubpy where appropriate / >>> > maybe a >>> > .to_parallel (to simply return a dask.DataFrame object) >>> > - incorporation of quantities / units (as part of the dtype) >>> > - use of DyND to allow missing values for int dtypes >>> > - make Period a first class dtype. >>> > - provide some copy-on-write semantics to alleviate the >>> > chained-indexing >>> > issues which occasionaly come up with the mis-use of the >>> > indexing >>> > API >>> > - allow a 'policy' to automatically provide column blocks for >>> > dict-like >>> > input (e.g. each column would be a block), this would allow a >>> > pass-thru >>> > API >>> > where you could >>> > put in numpy arrays where you have views and have them >>> > preserved >>> > rather >>> > than >>> > copied automatically. Note that this would also allow what I >>> > call >>> > 'split' >>> > where a passed in >>> > multi-dim numpy array could be split up to individual blocks >>> > (which >>> > actually >>> > gives a nice perf boost after the splitting costs). >>> > >>> > In working towards some of these goals. I have come to the >>> > opinion >>> > that >>> > it >>> > would make sense to have a neutral API protocol layer >>> > that would allow us to swap out different engines as needed, >>> > for >>> > particular >>> > dtypes, or *maybe* out-of-core type computations. E.g. >>> > imagine that we replaced the in-memory block structure with a >>> > bclolz >>> > / >>> > memap >>> > type; in theory this should be 'easy' and just work. >>> > I could also see us adopting *some* of the SFrame code to allow >>> > easier >>> > interop with this API layer. >>> > >>> > In practice, I think a nice API layer would need to be created >>> > to >>> > make >>> > this >>> > clean / nice. >>> > >>> > So this comes around to Wes's point about creating a c++ >>> > library for >>> > the >>> > internals (and possibly even some of the indexing routines). >>> > In an ideal world, or course this would be desirable. Getting >>> > there >>> > is a >>> > bit >>> > non-trivial I think, and IMHO might not be worth the effort. I >>> > don't >>> > really see big performance bottlenecks. We *already* defer much >>> > of >>> > the >>> > computation to libraries like numexpr & bottleneck (where >>> > appropriate). >>> > Adding numba / dask to the list would be helpful. >>> > >>> > I think that almost all performance issues are the result of: >>> > >>> > a) gross misuse of the pandas API. How much code have you seen >>> > that >>> > does >>> > df.apply(lambda x: x.sum()) >>> > b) routines which operate column-by-column rather >>> > block-by-block and >>> > are >>> > in >>> > python space (e.g. we have an issue right now about .quantile) >>> > >>> > So I am glossing over a big goal of having a c++ library that >>> > represents >>> > the >>> > pandas internals. This would by definition have a c-API that so >>> > you *could* use pandas like semantics in c/c++ and just have it >>> > work >>> > (and >>> > then pandas would be a thin wrapper around this library). >>> > >>> > I am not averse to this, but I think would be quite a big >>> > effort, >>> > and >>> > not a >>> > huge perf boost IMHO. Further there are a number of API issues >>> > w.r.t. >>> > indexing >>> > which need to be clarified / worked out (e.g. should we simply >>> > deprecate >>> > []) >>> > that are much easier to test / figure out in python space. >>> > >>> > I also thing that we have quite a large number of contributors. >>> > Moving >>> > to >>> > c++ might make the internals a bit more impenetrable that the >>> > current >>> > internals. >>> > (though this would allow c++ people to contribute, so that >>> > might >>> > balance >>> > out). >>> > >>> > We have a limited core of devs whom right now are familar with >>> > things. >>> > If >>> > someone happened to have a starting base for a c++ library, >>> > then I >>> > might >>> > change >>> > opinions here. >>> > >>> > >>> > my 4c. >>> > >>> > Jeff >>> > >>> > >>> > >>> > >>> > On Thu, Dec 24, 2015 at 7:18 PM, Wes McKinney >>> > <wesmckinn@gmail.com> >>> > wrote: >>> >> >>> >> Deep thoughts during the holidays. >>> >> >>> >> I might be out of line here, but the interpreter-heaviness of >>> >> the >>> >> inside of pandas objects is likely to be a long-term liability >>> >> and >>> >> source of performance problems and technical debt. >>> >> >>> >> Has anyone put any thought into planning and beginning to >>> >> execute >>> >> on a >>> >> rewrite that moves as much as possible of the internals into >>> >> native >>> >> / >>> >> compiled code? I'm talking about: >>> >> >>> >> - pandas/core/internals >>> >> - indexing and assignment >>> >> - much of pandas/core/common >>> >> - categorical and custom dtypes >>> >> - all indexing mechanisms >>> >> >>> >> I'm concerned we've already exposed too much internals to >>> >> users, so >>> >> this might lead to a lot of API breakage, but it might be for >>> >> the >>> >> Greater Good. As a first step, beginning a partial migration >>> >> of >>> >> internals into some C++ classes that encapsulate the insides >>> >> of >>> >> DataFrame objects and implement indexing and block-level >>> >> manipulations >>> >> would be a good place to start. I think you could do this >>> >> wouldn't >>> >> too >>> >> much disruption. >>> >> >>> >> As part of this internal retooling we might give consideration >>> >> to >>> >> alternative data structures for representing data internal to >>> >> pandas >>> >> objects. Now in 2015/2016, continuing to be hamstrung by >>> >> NumPy's >>> >> limitations feels somewhat anachronistic. User code is riddled >>> >> with >>> >> workarounds for data type fidelity issues and the like. Like, >>> >> really, >>> >> why not add a bitndarray (similar to ilanschnell/bitarray) for >>> >> storing >>> >> nullness for problematic types and hide this from the user? =) >>> >> >>> >> Since we are now a NumFOCUS-sponsored project, I feel like we >>> >> might >>> >> consider establishing some formal governance over pandas and >>> >> publishing meetings notes and roadmap documents describing >>> >> plans >>> >> for >>> >> the project and meetings notes from committers. There's no >>> >> real >>> >> "committer culture" for NumFOCUS projects like there is with >>> >> the >>> >> Apache Software Foundation, but we might try leading by >>> >> example! >>> >> >>> >> Also, I believe pandas as a project has reached a level of >>> >> importance >>> >> where we ought to consider planning and execution on larger >>> >> scale >>> >> undertakings such as this for safeguarding the future. >>> >> >>> >> As for myself, well, I have my hands full in Big Data-land. I >>> >> wish >>> >> I >>> >> could be helping more with pandas, but there a quite a few >>> >> fundamental >>> >> issues (like data interoperability nested data handling and >>> >> file >>> >> format support — e.g. Parquet, see >>> >> >>> >> >>> >> >>> >> >>> >> http://wesmckinney.com/blog/the-problem-with-the-data-science-language-wars/) >>> >> preventing Python from being more useful in industry analytics >>> >> applications. >>> >> >>> >> Aside: one of the bigger mistakes I made with pandas's API >>> >> design >>> >> was >>> >> making it acceptable to call class constructors — like >>> >> pandas.DataFrame — directly (versus factory functions). Sorry >>> >> about >>> >> that! If we could convince everyone to start writing >>> >> pandas.data_frame >>> >> or dataframe instead of using the class reference it would >>> >> help a >>> >> lot >>> >> with code cleanup. It's hard to plan for these things — NumPy >>> >> interoperability seemed a lot more important in 2008 than it >>> >> does >>> >> now, >>> >> so I forgive myself. >>> >> >>> >> cheers and best wishes for 2016, >>> >> Wes >>> >> _______________________________________________ >>> >> Pandas-dev mailing list >>> >> Pandas-dev@python.org >>> >> https://mail.python.org/mailman/listinfo/pandas-dev >>> > >>> > >>> _______________________________________________ >>> Pandas-dev mailing list >>> Pandas-dev@python.org >>> https://mail.python.org/mailman/listinfo/pandas-dev _______________________________________________ Pandas-dev mailing list Pandas-dev@python.org https://mail.python.org/mailman/listinfo/pandas-dev
_______________________________________________ Pandas-dev mailing list Pandas-dev@python.org https://mail.python.org/mailman/listinfo/pandas-dev
Jeff -- can you require log-in for editing on this document? https://docs.google.com/document/d/151ct8jcZWwh7XStptjbLsda6h2b3C0IuiH_hfZnU... There are a number of anonymous edits. On Wed, Dec 30, 2015 at 6:04 PM, Wes McKinney <wesmckinn@gmail.com> wrote:
I cobbled together an ugly start of a c++->cython->pandas toolchain here
https://github.com/wesm/pandas/tree/libpandas-native-core
I used a mix of Kudu, Impala, and dynd-python cmake sources, so it's a bit messy at the moment but it should be sufficient to run some real experiments with a little more work. I reckon it's like a 6 month project to tear out the insides of Series and DataFrame and replace it with a new "native core", but we should be able to get enough info to see whether it's a viable plan within a month or so.
The end goal is to create "private" extension types in Cython that can be the new base classes for Series and NDFrame; these will hold a reference to a C++ object that contains wrappered NumPy arrays and other metadata (like pandas-only dtypes).
It might be too hard to try to replace a single usage of block manager as a first experiment, so I'll try to create a minimal "SeriesLite" that supports 3 dtypes
1) float64 with nans 2) int64 with a bitmask for NAs 3) category type for one of these
Just want to get a feel for the extensibility and offer an NA singleton Python object (a la None) for getting and setting NAs across these 3 dtypes.
If we end up going down this route, any way to place a moratorium on invasive work on pandas internals (outside bug fixes)?
Pedantic aside: I'd rather avoid shipping thirdparty C/C++ libraries like googletest and friends in pandas if we can. Cloudera folks have been working on a portable C++ library toolchain for Impala and other projects at https://github.com/cloudera/native-toolchain, but it is only being tested on Linux and OS X. Most google libraries should build out of the box on MSVC but it'll be something to keep an eye on.
BTW thanks to the libdynd developers for pioneering the c++ lib <-> python-c++ lib <-> cython toolchain; being able to build Cython extensions directly from cmake is a godsend
HNY all Wes
On Tue, Dec 29, 2015 at 4:17 PM, Irwin Zaid <izaid@continuum.io> wrote:
Yeah, that seems reasonable and I totally agree a Pandas wrapper layer would be necessary.
I'll keep an eye on this and I'd like to help if I can.
Irwin
On Tue, Dec 29, 2015 at 6:01 PM, Wes McKinney <wesmckinn@gmail.com> wrote:
I'm not suggesting a rewrite of NumPy functionality but rather pandas functionality that is currently written in a mishmash of Cython and Python. Happy to experiment with changing the internal compute infrastructure and data representation to DyND after this first stage of cleanup is done. Even if we use DyND a pretty extensive pandas wrapper layer will be necessary.
On Tuesday, December 29, 2015, Irwin Zaid <izaid@continuum.io> wrote:
Hi Wes (and others),
I've been following this conversation with interest. I do think it would be worth exploring DyND, rather than setting up yet another rewrite of NumPy-functionality. Especially because DyND is already an optional dependency of Pandas.
For things like Integer NA and new dtypes, DyND is there and ready to do this.
Irwin
On Tue, Dec 29, 2015 at 5:18 PM, Wes McKinney <wesmckinn@gmail.com> wrote:
Can you link to the PR you're talking about?
I will see about spending a few hours setting up a libpandas.so as a C++ shared library where we can run some experiments and validate whether it can solve the integer-NA problem and be a place to put new data types (categorical and friends). I'm +1 on targeting
Would it also be worth making a wish list of APIs we might consider breaking in a pandas 1.0 release that also features this new "native core"? Might as well right some wrongs while we're doing some invasive work on the internals; some breakage might be unavoidable. We can always maintain a pandas legacy 0.x.x maintenance branch (providing a conda binary build) for legacy users where showstopper bugs can get fixed.
On Tue, Dec 29, 2015 at 1:20 PM, Jeff Reback <jeffreback@gmail.com> wrote:
Wes your last is noted as well. I *think* we can actually do this now (well there is a PR out there).
On Tue, Dec 29, 2015 at 4:12 PM, Wes McKinney <wesmckinn@gmail.com> wrote: > > The other huge thing this will enable is to do is copy-on-write for > various kinds of views, which should cut down on some of the > defensive > copying in the library and reduce memory usage. > > On Tue, Dec 29, 2015 at 1:02 PM, Wes McKinney <wesmckinn@gmail.com> > wrote: > > Basically the approach is > > > > 1) Base dtype type > > 2) Base array type with K >= 1 dimensions > > 3) Base scalar type > > 4) Base index type > > 5) "Wrapper" subclasses for all NumPy types fitting into categories > > #1, #2, #3, #4 > > 6) Subclasses for pandas-specific types like category, datetimeTZ, > > etc. > > 7) NDFrame as cpcloud wrote is just a list of these > > > > Indexes and axis labels / column names can get layered on top. > > > > After we do all this we can look at adding nested types (arrays, > > maps, > > structs) to better support JSON. > > > > - Wes > > > > On Tue, Dec 29, 2015 at 12:14 PM, Phillip Cloud <cpcloud@gmail.com> > > wrote: > >> Maybe this is saying the same thing as Wes, but how far would > >> something > >> like > >> this get us? > >> > >> // warning: things are probably not this simple > >> > >> struct data_array_t { > >> void *primitive; // scalar data > >> data_array_t *nested; // nested data > >> boost::dynamic_bitset isnull; // might have to create our own > >> to > >> avoid > >> boost > >> schema_t schema; // not sure exactly what this looks like > >> }; > >> > >> typedef std::map<string, data_array_t> data_frame_t; // probably > >> not > >> this > >> simple > >> > >> To answer Jeff’s use-case question: I think that the use cases are > >> 1) > >> freedom from numpy (mostly) 2) no more block manager which frees > >> us > >> from the > >> limitations of the block memory layout. In particular, the ability > >> to > >> take > >> advantage of memory mapped IO would be a big win IMO. > >> > >> > >> On Tue, Dec 29, 2015 at 2:50 PM Wes McKinney <wesmckinn@gmail.com> > >> wrote: > >>> > >>> I will write a more detailed response to some of these things > >>> after > >>> the new year, but, in particular, re: missing values, can you or > >>> someone tell me why creating an object that contains a NumPy > >>> array and > >>> a bitmap is not sufficient? If we we can add a lightweight C/C++ > >>> class > >>> layer between NumPy function calls (e.g. arithmetic) and pandas > >>> function calls, then I see no reason why we cannot have > >>> > >>> Int32Array->add > >>> > >>> and > >>> > >>> Float32Array->add > >>> > >>> do the right thing (the former would be responsible for > >>> bitmasking to > >>> propagate NA values; the latter would defer to NumPy). If we can > >>> put > >>> all the internals of pandas objects inside a black box, we can > >>> add > >>> layers of virtual function indirection without a performance > >>> penalty > >>> (e.g. adding more interpreter overhead with more abstraction > >>> layers > >>> does add up to a perf penalty). > >>> > >>> I don't think this is too scary -- I would be willing to create a > >>> small POC C++ library to prototype something like what I'm > >>> talking > >>> about. > >>> > >>> Since pandas has limited points of contact with NumPy I don't > >>> think > >>> this would end up being too onerous. > >>> > >>> For the record, I'm pretty allergic to "advanced C++"; I think it > >>> is a > >>> useful tool if you pick a sane 20% subset of the C++11 spec and > >>> follow > >>> Google C++ style it's not very inaccessible to intermediate > >>> developers. More or less "C plus OOP and easier object lifetime > >>> management (shared/unique_ptr, etc.)". As soon as you add a lot > >>> of > >>> template metaprogramming C++ library development quickly becomes > >>> inaccessible except to the C++-Jedi. > >>> > >>> Maybe let's start a Google document on "pandas roadmap" where we > >>> can > >>> break down the 1-2 year goals and some of these infrastructure > >>> issues > >>> and have our discussion there? (obviously publish this someplace > >>> once > >>> we're done) > >>> > >>> - Wes > >>> > >>> On Fri, Dec 25, 2015 at 2:14 PM, Jeff Reback > >>> <jeffreback@gmail.com> > >>> wrote: > >>> > Here are some of my thoughts about pandas Roadmap / status and > >>> > some > >>> > responses to Wes's thoughts. > >>> > > >>> > In the last few (and upcoming) major releases we have been made > >>> > the > >>> > following changes: > >>> > > >>> > - dtype enhancements (Categorical, Timedelta, Datetime w/tz) & > >>> > making > >>> > these > >>> > first class objects > >>> > - code refactoring to remove subclassing of ndarrays for Series > >>> > & > >>> > Index > >>> > - carving out / deprecating non-core parts of pandas > >>> > - datareader > >>> > - SparsePanel, WidePanel & other aliases (TImeSeries) > >>> > - rpy, rplot, irow et al. > >>> > - google-analytics > >>> > - API changes to make things more consistent > >>> > - pd.rolling/expanding * -> .rolling/expanding (this is in > >>> > master > >>> > now) > >>> > - .resample becoming a full defered like groupby. > >>> > - multi-index slicing along any level (obviates need for .xs) > >>> > and > >>> > allows > >>> > assignment > >>> > - .loc/.iloc - for the most part obviates use of .ix > >>> > - .pipe & .assign > >>> > - plotting accessors > >>> > - fixing of the sorting API > >>> > - many performance enhancements both micro & macro (e.g. > >>> > release > >>> > GIL) > >>> > > >>> > Some on-deck enhancements are (meaning these are basically > >>> > ready to > >>> > go > >>> > in): > >>> > - IntervalIndex (and eventually make PeriodIndex just a > >>> > sub-class > >>> > of > >>> > this) > >>> > - RangeIndex > >>> > > >>> > so lots of changes, though nothing really earth shaking, just > >>> > more > >>> > convenience, reducing magicness somewhat > >>> > and providing flexibility. > >>> > > >>> > Of course we are getting increasing issues, mostly bug reports > >>> > (and > >>> > lots > >>> > of > >>> > dupes), some edge case enhancements > >>> > which can add to the existing API's and of course, requests to > >>> > expand > >>> > the > >>> > (already) large code to other usecases. > >>> > Balancing this are a good many pull-requests from many > >>> > different > >>> > users, > >>> > some > >>> > even deep into the internals. > >>> > > >>> > Here are some things that I have talked about and could be > >>> > considered > >>> > for > >>> > the roadmap. Disclaimer: I do work for Continuum > >>> > but these views are of course my own; furthermore obviously I > >>> > am a > >>> > bit > >>> > more > >>> > familiar with some of the 'sponsored' open-source > >>> > libraries, but always open to new things. > >>> > > >>> > - integration / automatic deferral to numba for JIT (this would > >>> > be > >>> > thru > >>> > .apply) > >>> > - automatic deferal to dask from groubpy where appropriate / > >>> > maybe a > >>> > .to_parallel (to simply return a dask.DataFrame object) > >>> > - incorporation of quantities / units (as part of the dtype) > >>> > - use of DyND to allow missing values for int dtypes > >>> > - make Period a first class dtype. > >>> > - provide some copy-on-write semantics to alleviate the > >>> > chained-indexing > >>> > issues which occasionaly come up with the mis-use of the > >>> > indexing > >>> > API > >>> > - allow a 'policy' to automatically provide column blocks for > >>> > dict-like > >>> > input (e.g. each column would be a block), this would allow a > >>> > pass-thru > >>> > API > >>> > where you could > >>> > put in numpy arrays where you have views and have them > >>> > preserved > >>> > rather > >>> > than > >>> > copied automatically. Note that this would also allow what I > >>> > call > >>> > 'split' > >>> > where a passed in > >>> > multi-dim numpy array could be split up to individual blocks > >>> > (which > >>> > actually > >>> > gives a nice perf boost after the splitting costs). > >>> > > >>> > In working towards some of these goals. I have come to the > >>> > opinion > >>> > that > >>> > it > >>> > would make sense to have a neutral API protocol layer > >>> > that would allow us to swap out different engines as needed, > >>> > for > >>> > particular > >>> > dtypes, or *maybe* out-of-core type computations. E.g. > >>> > imagine that we replaced the in-memory block structure with a > >>> > bclolz > >>> > / > >>> > memap > >>> > type; in theory this should be 'easy' and just work. > >>> > I could also see us adopting *some* of the SFrame code to allow > >>> > easier > >>> > interop with this API layer. > >>> > > >>> > In practice, I think a nice API layer would need to be created > >>> > to > >>> > make > >>> > this > >>> > clean / nice. > >>> > > >>> > So this comes around to Wes's point about creating a c++ > >>> > library for > >>> > the > >>> > internals (and possibly even some of the indexing routines). > >>> > In an ideal world, or course this would be desirable. Getting > >>> > there > >>> > is a > >>> > bit > >>> > non-trivial I think, and IMHO might not be worth the effort. I > >>> > don't > >>> > really see big performance bottlenecks. We *already* defer much > >>> > of > >>> > the > >>> > computation to libraries like numexpr & bottleneck (where > >>> > appropriate). > >>> > Adding numba / dask to the list would be helpful. > >>> > > >>> > I think that almost all performance issues are the result of: > >>> > > >>> > a) gross misuse of the pandas API. How much code have you seen > >>> > that > >>> > does > >>> > df.apply(lambda x: x.sum()) > >>> > b) routines which operate column-by-column rather > >>> > block-by-block and > >>> > are > >>> > in > >>> > python space (e.g. we have an issue right now about .quantile) > >>> > > >>> > So I am glossing over a big goal of having a c++ library that > >>> > represents > >>> > the > >>> > pandas internals. This would by definition have a c-API that so > >>> > you *could* use pandas like semantics in c/c++ and just have it > >>> > work > >>> > (and > >>> > then pandas would be a thin wrapper around this library). > >>> > > >>> > I am not averse to this, but I think would be quite a big > >>> > effort, > >>> > and > >>> > not a > >>> > huge perf boost IMHO. Further there are a number of API issues > >>> > w.r.t. > >>> > indexing > >>> > which need to be clarified / worked out (e.g. should we simply > >>> > deprecate > >>> > []) > >>> > that are much easier to test / figure out in python space. > >>> > > >>> > I also thing that we have quite a large number of contributors. > >>> > Moving > >>> > to > >>> > c++ might make the internals a bit more impenetrable that the > >>> > current > >>> > internals. > >>> > (though this would allow c++ people to contribute, so that > >>> > might > >>> > balance > >>> > out). > >>> > > >>> > We have a limited core of devs whom right now are familar with > >>> > things. > >>> > If > >>> > someone happened to have a starting base for a c++ library, > >>> > then I > >>> > might > >>> > change > >>> > opinions here. > >>> > > >>> > > >>> > my 4c. > >>> > > >>> > Jeff > >>> > > >>> > > >>> > > >>> > > >>> > On Thu, Dec 24, 2015 at 7:18 PM, Wes McKinney > >>> > <wesmckinn@gmail.com> > >>> > wrote: > >>> >> > >>> >> Deep thoughts during the holidays. > >>> >> > >>> >> I might be out of line here, but the interpreter-heaviness of > >>> >> the > >>> >> inside of pandas objects is likely to be a long-term liability > >>> >> and > >>> >> source of performance problems and technical debt. > >>> >> > >>> >> Has anyone put any thought into planning and beginning to > >>> >> execute > >>> >> on a > >>> >> rewrite that moves as much as possible of the internals into > >>> >> native > >>> >> / > >>> >> compiled code? I'm talking about: > >>> >> > >>> >> - pandas/core/internals > >>> >> - indexing and assignment > >>> >> - much of pandas/core/common > >>> >> - categorical and custom dtypes > >>> >> - all indexing mechanisms > >>> >> > >>> >> I'm concerned we've already exposed too much internals to > >>> >> users, so > >>> >> this might lead to a lot of API breakage, but it might be for > >>> >> the > >>> >> Greater Good. As a first step, beginning a partial migration > >>> >> of > >>> >> internals into some C++ classes that encapsulate the insides > >>> >> of > >>> >> DataFrame objects and implement indexing and block-level > >>> >> manipulations > >>> >> would be a good place to start. I think you could do this > >>> >> wouldn't > >>> >> too > >>> >> much disruption. > >>> >> > >>> >> As part of this internal retooling we might give consideration > >>> >> to > >>> >> alternative data structures for representing data internal to > >>> >> pandas > >>> >> objects. Now in 2015/2016, continuing to be hamstrung by > >>> >> NumPy's > >>> >> limitations feels somewhat anachronistic. User code is riddled > >>> >> with > >>> >> workarounds for data type fidelity issues and the like. Like, > >>> >> really, > >>> >> why not add a bitndarray (similar to ilanschnell/bitarray) for > >>> >> storing > >>> >> nullness for problematic types and hide this from the user? =) > >>> >> > >>> >> Since we are now a NumFOCUS-sponsored project, I feel like we > >>> >> might > >>> >> consider establishing some formal governance over pandas and > >>> >> publishing meetings notes and roadmap documents describing > >>> >> plans > >>> >> for > >>> >> the project and meetings notes from committers. There's no > >>> >> real > >>> >> "committer culture" for NumFOCUS projects like there is with > >>> >> the > >>> >> Apache Software Foundation, but we might try leading by > >>> >> example! > >>> >> > >>> >> Also, I believe pandas as a project has reached a level of > >>> >> importance > >>> >> where we ought to consider planning and execution on larger > >>> >> scale > >>> >> undertakings such as this for safeguarding the future. > >>> >> > >>> >> As for myself, well, I have my hands full in Big Data-land. I > >>> >> wish > >>> >> I > >>> >> could be helping more with pandas, but there a quite a few > >>> >> fundamental > >>> >> issues (like data interoperability nested data handling and > >>> >> file > >>> >> format support — e.g. Parquet, see > >>> >> > >>> >> > >>> >> > >>> >> > >>> >> http://wesmckinney.com/blog/the-problem-with-the-data-science-language-wars/) > >>> >> preventing Python from being more useful in industry analytics > >>> >> applications. > >>> >> > >>> >> Aside: one of the bigger mistakes I made with pandas's API > >>> >> design > >>> >> was > >>> >> making it acceptable to call class constructors — like > >>> >> pandas.DataFrame — directly (versus factory functions). Sorry > >>> >> about > >>> >> that! If we could convince everyone to start writing > >>> >> pandas.data_frame > >>> >> or dataframe instead of using the class reference it would > >>> >> help a > >>> >> lot > >>> >> with code cleanup. It's hard to plan for these things — NumPy > >>> >> interoperability seemed a lot more important in 2008 than it > >>> >> does > >>> >> now, > >>> >> so I forgive myself. > >>> >> > >>> >> cheers and best wishes for 2016, > >>> >> Wes > >>> >> _______________________________________________ > >>> >> Pandas-dev mailing list > >>> >> Pandas-dev@python.org > >>> >> https://mail.python.org/mailman/listinfo/pandas-dev > >>> > > >>> > > >>> _______________________________________________ > >>> Pandas-dev mailing list > >>> Pandas-dev@python.org > >>> https://mail.python.org/mailman/listinfo/pandas-dev > _______________________________________________ > Pandas-dev mailing list > Pandas-dev@python.org > https://mail.python.org/mailman/listinfo/pandas-dev
_______________________________________________ Pandas-dev mailing list Pandas-dev@python.org https://mail.python.org/mailman/listinfo/pandas-dev
I changed the doc so that the core dev people can edit. I *think* that everyone should be able to view/comment though. On Fri, Jan 1, 2016 at 8:13 PM, Wes McKinney <wesmckinn@gmail.com> wrote:
Jeff -- can you require log-in for editing on this document?
https://docs.google.com/document/d/151ct8jcZWwh7XStptjbLsda6h2b3C0IuiH_hfZnU...
There are a number of anonymous edits.
I cobbled together an ugly start of a c++->cython->pandas toolchain here
https://github.com/wesm/pandas/tree/libpandas-native-core
I used a mix of Kudu, Impala, and dynd-python cmake sources, so it's a bit messy at the moment but it should be sufficient to run some real experiments with a little more work. I reckon it's like a 6 month project to tear out the insides of Series and DataFrame and replace it with a new "native core", but we should be able to get enough info to see whether it's a viable plan within a month or so.
The end goal is to create "private" extension types in Cython that can be the new base classes for Series and NDFrame; these will hold a reference to a C++ object that contains wrappered NumPy arrays and other metadata (like pandas-only dtypes).
It might be too hard to try to replace a single usage of block manager as a first experiment, so I'll try to create a minimal "SeriesLite" that supports 3 dtypes
1) float64 with nans 2) int64 with a bitmask for NAs 3) category type for one of these
Just want to get a feel for the extensibility and offer an NA singleton Python object (a la None) for getting and setting NAs across these 3 dtypes.
If we end up going down this route, any way to place a moratorium on invasive work on pandas internals (outside bug fixes)?
Pedantic aside: I'd rather avoid shipping thirdparty C/C++ libraries like googletest and friends in pandas if we can. Cloudera folks have been working on a portable C++ library toolchain for Impala and other projects at https://github.com/cloudera/native-toolchain, but it is only being tested on Linux and OS X. Most google libraries should build out of the box on MSVC but it'll be something to keep an eye on.
BTW thanks to the libdynd developers for pioneering the c++ lib <-> python-c++ lib <-> cython toolchain; being able to build Cython extensions directly from cmake is a godsend
HNY all Wes
On Tue, Dec 29, 2015 at 4:17 PM, Irwin Zaid <izaid@continuum.io> wrote:
Yeah, that seems reasonable and I totally agree a Pandas wrapper layer would be necessary.
I'll keep an eye on this and I'd like to help if I can.
Irwin
On Tue, Dec 29, 2015 at 6:01 PM, Wes McKinney <wesmckinn@gmail.com> wrote:
I'm not suggesting a rewrite of NumPy functionality but rather pandas functionality that is currently written in a mishmash of Cython and
Python.
Happy to experiment with changing the internal compute infrastructure and data representation to DyND after this first stage of cleanup is done. Even if we use DyND a pretty extensive pandas wrapper layer will be necessary.
On Tuesday, December 29, 2015, Irwin Zaid <izaid@continuum.io> wrote:
Hi Wes (and others),
I've been following this conversation with interest. I do think it
would
be worth exploring DyND, rather than setting up yet another rewrite of NumPy-functionality. Especially because DyND is already an optional dependency of Pandas.
For things like Integer NA and new dtypes, DyND is there and ready to do this.
Irwin
On Tue, Dec 29, 2015 at 5:18 PM, Wes McKinney <wesmckinn@gmail.com> wrote:
Can you link to the PR you're talking about?
I will see about spending a few hours setting up a libpandas.so as a
C++
shared library where we can run some experiments and validate whether it can solve the integer-NA problem and be a place to put new data types (categorical and friends). I'm +1 on targeting
Would it also be worth making a wish list of APIs we might consider breaking in a pandas 1.0 release that also features this new "native core"? Might as well right some wrongs while we're doing some invasive work on the internals; some breakage might be unavoidable. We can always
pandas legacy 0.x.x maintenance branch (providing a conda binary build) for legacy users where showstopper bugs can get fixed.
On Tue, Dec 29, 2015 at 1:20 PM, Jeff Reback <jeffreback@gmail.com> wrote: > Wes your last is noted as well. I *think* we can actually do this now > (well > there is a PR out there). > > On Tue, Dec 29, 2015 at 4:12 PM, Wes McKinney <wesmckinn@gmail.com
> wrote: >> >> The other huge thing this will enable is to do is copy-on-write for >> various kinds of views, which should cut down on some of the >> defensive >> copying in the library and reduce memory usage. >> >> On Tue, Dec 29, 2015 at 1:02 PM, Wes McKinney < wesmckinn@gmail.com> >> wrote: >> > Basically the approach is >> > >> > 1) Base dtype type >> > 2) Base array type with K >= 1 dimensions >> > 3) Base scalar type >> > 4) Base index type >> > 5) "Wrapper" subclasses for all NumPy types fitting into categories >> > #1, #2, #3, #4 >> > 6) Subclasses for pandas-specific types like category, datetimeTZ, >> > etc. >> > 7) NDFrame as cpcloud wrote is just a list of these >> > >> > Indexes and axis labels / column names can get layered on top. >> > >> > After we do all this we can look at adding nested types (arrays, >> > maps, >> > structs) to better support JSON. >> > >> > - Wes >> > >> > On Tue, Dec 29, 2015 at 12:14 PM, Phillip Cloud < cpcloud@gmail.com> >> > wrote: >> >> Maybe this is saying the same thing as Wes, but how far would >> >> something >> >> like >> >> this get us? >> >> >> >> // warning: things are probably not this simple >> >> >> >> struct data_array_t { >> >> void *primitive; // scalar data >> >> data_array_t *nested; // nested data >> >> boost::dynamic_bitset isnull; // might have to create our own >> >> to >> >> avoid >> >> boost >> >> schema_t schema; // not sure exactly what this looks like >> >> }; >> >> >> >> typedef std::map<string, data_array_t> data_frame_t; //
>> >> not >> >> this >> >> simple >> >> >> >> To answer Jeff’s use-case question: I think that the use cases are >> >> 1) >> >> freedom from numpy (mostly) 2) no more block manager which frees >> >> us >> >> from the >> >> limitations of the block memory layout. In particular, the ability >> >> to >> >> take >> >> advantage of memory mapped IO would be a big win IMO. >> >> >> >> >> >> On Tue, Dec 29, 2015 at 2:50 PM Wes McKinney < wesmckinn@gmail.com> >> >> wrote: >> >>> >> >>> I will write a more detailed response to some of these things >> >>> after >> >>> the new year, but, in particular, re: missing values, can you or >> >>> someone tell me why creating an object that contains a NumPy >> >>> array and >> >>> a bitmap is not sufficient? If we we can add a lightweight C/C++ >> >>> class >> >>> layer between NumPy function calls (e.g. arithmetic) and
>> >>> function calls, then I see no reason why we cannot have >> >>> >> >>> Int32Array->add >> >>> >> >>> and >> >>> >> >>> Float32Array->add >> >>> >> >>> do the right thing (the former would be responsible for >> >>> bitmasking to >> >>> propagate NA values; the latter would defer to NumPy). If we can >> >>> put >> >>> all the internals of pandas objects inside a black box, we can >> >>> add >> >>> layers of virtual function indirection without a performance >> >>> penalty >> >>> (e.g. adding more interpreter overhead with more abstraction >> >>> layers >> >>> does add up to a perf penalty). >> >>> >> >>> I don't think this is too scary -- I would be willing to create a >> >>> small POC C++ library to prototype something like what I'm >> >>> talking >> >>> about. >> >>> >> >>> Since pandas has limited points of contact with NumPy I don't >> >>> think >> >>> this would end up being too onerous. >> >>> >> >>> For the record, I'm pretty allergic to "advanced C++"; I
>> >>> is a >> >>> useful tool if you pick a sane 20% subset of the C++11 spec and >> >>> follow >> >>> Google C++ style it's not very inaccessible to intermediate >> >>> developers. More or less "C plus OOP and easier object
>> >>> management (shared/unique_ptr, etc.)". As soon as you add a lot >> >>> of >> >>> template metaprogramming C++ library development quickly becomes >> >>> inaccessible except to the C++-Jedi. >> >>> >> >>> Maybe let's start a Google document on "pandas roadmap" where we >> >>> can >> >>> break down the 1-2 year goals and some of these infrastructure >> >>> issues >> >>> and have our discussion there? (obviously publish this someplace >> >>> once >> >>> we're done) >> >>> >> >>> - Wes >> >>> >> >>> On Fri, Dec 25, 2015 at 2:14 PM, Jeff Reback >> >>> <jeffreback@gmail.com> >> >>> wrote: >> >>> > Here are some of my thoughts about pandas Roadmap / status and >> >>> > some >> >>> > responses to Wes's thoughts. >> >>> > >> >>> > In the last few (and upcoming) major releases we have been made >> >>> > the >> >>> > following changes: >> >>> > >> >>> > - dtype enhancements (Categorical, Timedelta, Datetime w/tz) & >> >>> > making >> >>> > these >> >>> > first class objects >> >>> > - code refactoring to remove subclassing of ndarrays for Series >> >>> > & >> >>> > Index >> >>> > - carving out / deprecating non-core parts of pandas >> >>> > - datareader >> >>> > - SparsePanel, WidePanel & other aliases (TImeSeries) >> >>> > - rpy, rplot, irow et al. >> >>> > - google-analytics >> >>> > - API changes to make things more consistent >> >>> > - pd.rolling/expanding * -> .rolling/expanding (this is in >> >>> > master >> >>> > now) >> >>> > - .resample becoming a full defered like groupby. >> >>> > - multi-index slicing along any level (obviates need for .xs) >> >>> > and >> >>> > allows >> >>> > assignment >> >>> > - .loc/.iloc - for the most part obviates use of .ix >> >>> > - .pipe & .assign >> >>> > - plotting accessors >> >>> > - fixing of the sorting API >> >>> > - many performance enhancements both micro & macro (e.g. >> >>> > release >> >>> > GIL) >> >>> > >> >>> > Some on-deck enhancements are (meaning these are basically >> >>> > ready to >> >>> > go >> >>> > in): >> >>> > - IntervalIndex (and eventually make PeriodIndex just a >> >>> > sub-class >> >>> > of >> >>> > this) >> >>> > - RangeIndex >> >>> > >> >>> > so lots of changes, though nothing really earth shaking, just >> >>> > more >> >>> > convenience, reducing magicness somewhat >> >>> > and providing flexibility. >> >>> > >> >>> > Of course we are getting increasing issues, mostly bug reports >> >>> > (and >> >>> > lots >> >>> > of >> >>> > dupes), some edge case enhancements >> >>> > which can add to the existing API's and of course, requests to >> >>> > expand >> >>> > the >> >>> > (already) large code to other usecases. >> >>> > Balancing this are a good many pull-requests from many >> >>> > different >> >>> > users, >> >>> > some >> >>> > even deep into the internals. >> >>> > >> >>> > Here are some things that I have talked about and could be >> >>> > considered >> >>> > for >> >>> > the roadmap. Disclaimer: I do work for Continuum >> >>> > but these views are of course my own; furthermore obviously I >> >>> > am a >> >>> > bit >> >>> > more >> >>> > familiar with some of the 'sponsored' open-source >> >>> > libraries, but always open to new things. >> >>> > >> >>> > - integration / automatic deferral to numba for JIT (this would >> >>> > be >> >>> > thru >> >>> > .apply) >> >>> > - automatic deferal to dask from groubpy where appropriate / >> >>> > maybe a >> >>> > .to_parallel (to simply return a dask.DataFrame object) >> >>> > - incorporation of quantities / units (as part of the dtype) >> >>> > - use of DyND to allow missing values for int dtypes >> >>> > - make Period a first class dtype. >> >>> > - provide some copy-on-write semantics to alleviate the >> >>> > chained-indexing >> >>> > issues which occasionaly come up with the mis-use of the >> >>> > indexing >> >>> > API >> >>> > - allow a 'policy' to automatically provide column blocks for >> >>> > dict-like >> >>> > input (e.g. each column would be a block), this would allow a >> >>> > pass-thru >> >>> > API >> >>> > where you could >> >>> > put in numpy arrays where you have views and have them >> >>> > preserved >> >>> > rather >> >>> > than >> >>> > copied automatically. Note that this would also allow what I >> >>> > call >> >>> > 'split' >> >>> > where a passed in >> >>> > multi-dim numpy array could be split up to individual blocks >> >>> > (which >> >>> > actually >> >>> > gives a nice perf boost after the splitting costs). >> >>> > >> >>> > In working towards some of these goals. I have come to the >> >>> > opinion >> >>> > that >> >>> > it >> >>> > would make sense to have a neutral API protocol layer >> >>> > that would allow us to swap out different engines as needed, >> >>> > for >> >>> > particular >> >>> > dtypes, or *maybe* out-of-core type computations. E.g. >> >>> > imagine that we replaced the in-memory block structure with a >> >>> > bclolz >> >>> > / >> >>> > memap >> >>> > type; in theory this should be 'easy' and just work. >> >>> > I could also see us adopting *some* of the SFrame code to allow >> >>> > easier >> >>> > interop with this API layer. >> >>> > >> >>> > In practice, I think a nice API layer would need to be created >> >>> > to >> >>> > make >> >>> > this >> >>> > clean / nice. >> >>> > >> >>> > So this comes around to Wes's point about creating a c++ >> >>> > library for >> >>> > the >> >>> > internals (and possibly even some of the indexing routines). >> >>> > In an ideal world, or course this would be desirable. Getting >> >>> > there >> >>> > is a >> >>> > bit >> >>> > non-trivial I think, and IMHO might not be worth the effort. I >> >>> > don't >> >>> > really see big performance bottlenecks. We *already* defer much >> >>> > of >> >>> > the >> >>> > computation to libraries like numexpr & bottleneck (where >> >>> > appropriate). >> >>> > Adding numba / dask to the list would be helpful. >> >>> > >> >>> > I think that almost all performance issues are the result of: >> >>> > >> >>> > a) gross misuse of the pandas API. How much code have you seen >> >>> > that >> >>> > does >> >>> > df.apply(lambda x: x.sum()) >> >>> > b) routines which operate column-by-column rather >> >>> > block-by-block and >> >>> > are >> >>> > in >> >>> > python space (e.g. we have an issue right now about .quantile) >> >>> > >> >>> > So I am glossing over a big goal of having a c++ library
>> >>> > represents >> >>> > the >> >>> > pandas internals. This would by definition have a c-API
>> >>> > you *could* use pandas like semantics in c/c++ and just have it >> >>> > work >> >>> > (and >> >>> > then pandas would be a thin wrapper around this library). >> >>> > >> >>> > I am not averse to this, but I think would be quite a big >> >>> > effort, >> >>> > and >> >>> > not a >> >>> > huge perf boost IMHO. Further there are a number of API issues >> >>> > w.r.t. >> >>> > indexing >> >>> > which need to be clarified / worked out (e.g. should we simply >> >>> > deprecate >> >>> > []) >> >>> > that are much easier to test / figure out in python space. >> >>> > >> >>> > I also thing that we have quite a large number of contributors. >> >>> > Moving >> >>> > to >> >>> > c++ might make the internals a bit more impenetrable that
>> >>> > current >> >>> > internals. >> >>> > (though this would allow c++ people to contribute, so that >> >>> > might >> >>> > balance >> >>> > out). >> >>> > >> >>> > We have a limited core of devs whom right now are familar with >> >>> > things. >> >>> > If >> >>> > someone happened to have a starting base for a c++ library, >> >>> > then I >> >>> > might >> >>> > change >> >>> > opinions here. >> >>> > >> >>> > >> >>> > my 4c. >> >>> > >> >>> > Jeff >> >>> > >> >>> > >> >>> > >> >>> > >> >>> > On Thu, Dec 24, 2015 at 7:18 PM, Wes McKinney >> >>> > <wesmckinn@gmail.com> >> >>> > wrote: >> >>> >> >> >>> >> Deep thoughts during the holidays. >> >>> >> >> >>> >> I might be out of line here, but the interpreter-heaviness of >> >>> >> the >> >>> >> inside of pandas objects is likely to be a long-term
On Wed, Dec 30, 2015 at 6:04 PM, Wes McKinney <wesmckinn@gmail.com> wrote: maintain a probably pandas think it lifetime that that so the liability
>> >>> >> and >> >>> >> source of performance problems and technical debt. >> >>> >> >> >>> >> Has anyone put any thought into planning and beginning to >> >>> >> execute >> >>> >> on a >> >>> >> rewrite that moves as much as possible of the internals into >> >>> >> native >> >>> >> / >> >>> >> compiled code? I'm talking about: >> >>> >> >> >>> >> - pandas/core/internals >> >>> >> - indexing and assignment >> >>> >> - much of pandas/core/common >> >>> >> - categorical and custom dtypes >> >>> >> - all indexing mechanisms >> >>> >> >> >>> >> I'm concerned we've already exposed too much internals to >> >>> >> users, so >> >>> >> this might lead to a lot of API breakage, but it might be for >> >>> >> the >> >>> >> Greater Good. As a first step, beginning a partial migration >> >>> >> of >> >>> >> internals into some C++ classes that encapsulate the insides >> >>> >> of >> >>> >> DataFrame objects and implement indexing and block-level >> >>> >> manipulations >> >>> >> would be a good place to start. I think you could do this >> >>> >> wouldn't >> >>> >> too >> >>> >> much disruption. >> >>> >> >> >>> >> As part of this internal retooling we might give consideration >> >>> >> to >> >>> >> alternative data structures for representing data internal to >> >>> >> pandas >> >>> >> objects. Now in 2015/2016, continuing to be hamstrung by >> >>> >> NumPy's >> >>> >> limitations feels somewhat anachronistic. User code is riddled >> >>> >> with >> >>> >> workarounds for data type fidelity issues and the like. Like, >> >>> >> really, >> >>> >> why not add a bitndarray (similar to ilanschnell/bitarray) for >> >>> >> storing >> >>> >> nullness for problematic types and hide this from the user? =) >> >>> >> >> >>> >> Since we are now a NumFOCUS-sponsored project, I feel like we >> >>> >> might >> >>> >> consider establishing some formal governance over pandas and >> >>> >> publishing meetings notes and roadmap documents describing >> >>> >> plans >> >>> >> for >> >>> >> the project and meetings notes from committers. There's no >> >>> >> real >> >>> >> "committer culture" for NumFOCUS projects like there is with >> >>> >> the >> >>> >> Apache Software Foundation, but we might try leading by >> >>> >> example! >> >>> >> >> >>> >> Also, I believe pandas as a project has reached a level of >> >>> >> importance >> >>> >> where we ought to consider planning and execution on larger >> >>> >> scale >> >>> >> undertakings such as this for safeguarding the future. >> >>> >> >> >>> >> As for myself, well, I have my hands full in Big Data-land. I >> >>> >> wish >> >>> >> I >> >>> >> could be helping more with pandas, but there a quite a few >> >>> >> fundamental >> >>> >> issues (like data interoperability nested data handling and >> >>> >> file >> >>> >> format support — e.g. Parquet, see >> >>> >> >> >>> >> >> >>> >> >> >>> >> >> >>> >> http://wesmckinney.com/blog/the-problem-with-the-data-science-language-wars/ ) >> >>> >> preventing Python from being more useful in industry analytics >> >>> >> applications. >> >>> >> >> >>> >> Aside: one of the bigger mistakes I made with pandas's API >> >>> >> design >> >>> >> was >> >>> >> making it acceptable to call class constructors — like >> >>> >> pandas.DataFrame — directly (versus factory functions). Sorry >> >>> >> about >> >>> >> that! If we could convince everyone to start writing >> >>> >> pandas.data_frame >> >>> >> or dataframe instead of using the class reference it would >> >>> >> help a >> >>> >> lot >> >>> >> with code cleanup. It's hard to plan for these things — NumPy >> >>> >> interoperability seemed a lot more important in 2008 than it >> >>> >> does >> >>> >> now, >> >>> >> so I forgive myself. >> >>> >> >> >>> >> cheers and best wishes for 2016, >> >>> >> Wes >> >>> >> _______________________________________________ >> >>> >> Pandas-dev mailing list >> >>> >> Pandas-dev@python.org >> >>> >> https://mail.python.org/mailman/listinfo/pandas-dev >> >>> > >> >>> > >> >>> _______________________________________________ >> >>> Pandas-dev mailing list >> >>> Pandas-dev@python.org >> >>> https://mail.python.org/mailman/listinfo/pandas-dev >> _______________________________________________ >> Pandas-dev mailing list >> Pandas-dev@python.org >> https://mail.python.org/mailman/listinfo/pandas-dev > >
_______________________________________________ Pandas-dev mailing list Pandas-dev@python.org https://mail.python.org/mailman/listinfo/pandas-dev
Pandas-dev mailing list Pandas-dev@python.org https://mail.python.org/mailman/listinfo/pandas-dev
Thanks Jeff. Can you create and share a shared Drive folder containing this where I can put other auxiliary / follow up documents? On Fri, Jan 1, 2016 at 5:23 PM, Jeff Reback <jeffreback@gmail.com> wrote:
I changed the doc so that the core dev people can edit. I *think* that everyone should be able to view/comment though.
On Fri, Jan 1, 2016 at 8:13 PM, Wes McKinney <wesmckinn@gmail.com> wrote:
Jeff -- can you require log-in for editing on this document?
https://docs.google.com/document/d/151ct8jcZWwh7XStptjbLsda6h2b3C0IuiH_hfZnU...
There are a number of anonymous edits.
On Wed, Dec 30, 2015 at 6:04 PM, Wes McKinney <wesmckinn@gmail.com> wrote:
I cobbled together an ugly start of a c++->cython->pandas toolchain here
https://github.com/wesm/pandas/tree/libpandas-native-core
I used a mix of Kudu, Impala, and dynd-python cmake sources, so it's a bit messy at the moment but it should be sufficient to run some real experiments with a little more work. I reckon it's like a 6 month project to tear out the insides of Series and DataFrame and replace it with a new "native core", but we should be able to get enough info to see whether it's a viable plan within a month or so.
The end goal is to create "private" extension types in Cython that can be the new base classes for Series and NDFrame; these will hold a reference to a C++ object that contains wrappered NumPy arrays and other metadata (like pandas-only dtypes).
It might be too hard to try to replace a single usage of block manager as a first experiment, so I'll try to create a minimal "SeriesLite" that supports 3 dtypes
1) float64 with nans 2) int64 with a bitmask for NAs 3) category type for one of these
Just want to get a feel for the extensibility and offer an NA singleton Python object (a la None) for getting and setting NAs across these 3 dtypes.
If we end up going down this route, any way to place a moratorium on invasive work on pandas internals (outside bug fixes)?
Pedantic aside: I'd rather avoid shipping thirdparty C/C++ libraries like googletest and friends in pandas if we can. Cloudera folks have been working on a portable C++ library toolchain for Impala and other projects at https://github.com/cloudera/native-toolchain, but it is only being tested on Linux and OS X. Most google libraries should build out of the box on MSVC but it'll be something to keep an eye on.
BTW thanks to the libdynd developers for pioneering the c++ lib <-> python-c++ lib <-> cython toolchain; being able to build Cython extensions directly from cmake is a godsend
HNY all Wes
On Tue, Dec 29, 2015 at 4:17 PM, Irwin Zaid <izaid@continuum.io> wrote:
Yeah, that seems reasonable and I totally agree a Pandas wrapper layer would be necessary.
I'll keep an eye on this and I'd like to help if I can.
Irwin
On Tue, Dec 29, 2015 at 6:01 PM, Wes McKinney <wesmckinn@gmail.com> wrote:
I'm not suggesting a rewrite of NumPy functionality but rather pandas functionality that is currently written in a mishmash of Cython and Python. Happy to experiment with changing the internal compute infrastructure and data representation to DyND after this first stage of cleanup is done. Even if we use DyND a pretty extensive pandas wrapper layer will be necessary.
On Tuesday, December 29, 2015, Irwin Zaid <izaid@continuum.io> wrote:
Hi Wes (and others),
I've been following this conversation with interest. I do think it would be worth exploring DyND, rather than setting up yet another rewrite of NumPy-functionality. Especially because DyND is already an optional dependency of Pandas.
For things like Integer NA and new dtypes, DyND is there and ready to do this.
Irwin
On Tue, Dec 29, 2015 at 5:18 PM, Wes McKinney <wesmckinn@gmail.com> wrote: > > Can you link to the PR you're talking about? > > I will see about spending a few hours setting up a libpandas.so as a > C++ > shared library where we can run some experiments and validate > whether it can > solve the integer-NA problem and be a place to put new data types > (categorical and friends). I'm +1 on targeting > > Would it also be worth making a wish list of APIs we might consider > breaking in a pandas 1.0 release that also features this new "native > core"? > Might as well right some wrongs while we're doing some invasive work > on the > internals; some breakage might be unavoidable. We can always > maintain a > pandas legacy 0.x.x maintenance branch (providing a conda binary > build) for > legacy users where showstopper bugs can get fixed. > > On Tue, Dec 29, 2015 at 1:20 PM, Jeff Reback <jeffreback@gmail.com> > wrote: > > Wes your last is noted as well. I *think* we can actually do this > > now > > (well > > there is a PR out there). > > > > On Tue, Dec 29, 2015 at 4:12 PM, Wes McKinney > > <wesmckinn@gmail.com> > > wrote: > >> > >> The other huge thing this will enable is to do is copy-on-write > >> for > >> various kinds of views, which should cut down on some of the > >> defensive > >> copying in the library and reduce memory usage. > >> > >> On Tue, Dec 29, 2015 at 1:02 PM, Wes McKinney > >> <wesmckinn@gmail.com> > >> wrote: > >> > Basically the approach is > >> > > >> > 1) Base dtype type > >> > 2) Base array type with K >= 1 dimensions > >> > 3) Base scalar type > >> > 4) Base index type > >> > 5) "Wrapper" subclasses for all NumPy types fitting into > >> > categories > >> > #1, #2, #3, #4 > >> > 6) Subclasses for pandas-specific types like category, > >> > datetimeTZ, > >> > etc. > >> > 7) NDFrame as cpcloud wrote is just a list of these > >> > > >> > Indexes and axis labels / column names can get layered on top. > >> > > >> > After we do all this we can look at adding nested types > >> > (arrays, > >> > maps, > >> > structs) to better support JSON. > >> > > >> > - Wes > >> > > >> > On Tue, Dec 29, 2015 at 12:14 PM, Phillip Cloud > >> > <cpcloud@gmail.com> > >> > wrote: > >> >> Maybe this is saying the same thing as Wes, but how far would > >> >> something > >> >> like > >> >> this get us? > >> >> > >> >> // warning: things are probably not this simple > >> >> > >> >> struct data_array_t { > >> >> void *primitive; // scalar data > >> >> data_array_t *nested; // nested data > >> >> boost::dynamic_bitset isnull; // might have to create our > >> >> own > >> >> to > >> >> avoid > >> >> boost > >> >> schema_t schema; // not sure exactly what this looks like > >> >> }; > >> >> > >> >> typedef std::map<string, data_array_t> data_frame_t; // > >> >> probably > >> >> not > >> >> this > >> >> simple > >> >> > >> >> To answer Jeff’s use-case question: I think that the use cases > >> >> are > >> >> 1) > >> >> freedom from numpy (mostly) 2) no more block manager which > >> >> frees > >> >> us > >> >> from the > >> >> limitations of the block memory layout. In particular, the > >> >> ability > >> >> to > >> >> take > >> >> advantage of memory mapped IO would be a big win IMO. > >> >> > >> >> > >> >> On Tue, Dec 29, 2015 at 2:50 PM Wes McKinney > >> >> <wesmckinn@gmail.com> > >> >> wrote: > >> >>> > >> >>> I will write a more detailed response to some of these things > >> >>> after > >> >>> the new year, but, in particular, re: missing values, can you > >> >>> or > >> >>> someone tell me why creating an object that contains a NumPy > >> >>> array and > >> >>> a bitmap is not sufficient? If we we can add a lightweight > >> >>> C/C++ > >> >>> class > >> >>> layer between NumPy function calls (e.g. arithmetic) and > >> >>> pandas > >> >>> function calls, then I see no reason why we cannot have > >> >>> > >> >>> Int32Array->add > >> >>> > >> >>> and > >> >>> > >> >>> Float32Array->add > >> >>> > >> >>> do the right thing (the former would be responsible for > >> >>> bitmasking to > >> >>> propagate NA values; the latter would defer to NumPy). If we > >> >>> can > >> >>> put > >> >>> all the internals of pandas objects inside a black box, we > >> >>> can > >> >>> add > >> >>> layers of virtual function indirection without a performance > >> >>> penalty > >> >>> (e.g. adding more interpreter overhead with more abstraction > >> >>> layers > >> >>> does add up to a perf penalty). > >> >>> > >> >>> I don't think this is too scary -- I would be willing to > >> >>> create a > >> >>> small POC C++ library to prototype something like what I'm > >> >>> talking > >> >>> about. > >> >>> > >> >>> Since pandas has limited points of contact with NumPy I don't > >> >>> think > >> >>> this would end up being too onerous. > >> >>> > >> >>> For the record, I'm pretty allergic to "advanced C++"; I > >> >>> think it > >> >>> is a > >> >>> useful tool if you pick a sane 20% subset of the C++11 spec > >> >>> and > >> >>> follow > >> >>> Google C++ style it's not very inaccessible to intermediate > >> >>> developers. More or less "C plus OOP and easier object > >> >>> lifetime > >> >>> management (shared/unique_ptr, etc.)". As soon as you add a > >> >>> lot > >> >>> of > >> >>> template metaprogramming C++ library development quickly > >> >>> becomes > >> >>> inaccessible except to the C++-Jedi. > >> >>> > >> >>> Maybe let's start a Google document on "pandas roadmap" where > >> >>> we > >> >>> can > >> >>> break down the 1-2 year goals and some of these > >> >>> infrastructure > >> >>> issues > >> >>> and have our discussion there? (obviously publish this > >> >>> someplace > >> >>> once > >> >>> we're done) > >> >>> > >> >>> - Wes > >> >>> > >> >>> On Fri, Dec 25, 2015 at 2:14 PM, Jeff Reback > >> >>> <jeffreback@gmail.com> > >> >>> wrote: > >> >>> > Here are some of my thoughts about pandas Roadmap / status > >> >>> > and > >> >>> > some > >> >>> > responses to Wes's thoughts. > >> >>> > > >> >>> > In the last few (and upcoming) major releases we have been > >> >>> > made > >> >>> > the > >> >>> > following changes: > >> >>> > > >> >>> > - dtype enhancements (Categorical, Timedelta, Datetime > >> >>> > w/tz) & > >> >>> > making > >> >>> > these > >> >>> > first class objects > >> >>> > - code refactoring to remove subclassing of ndarrays for > >> >>> > Series > >> >>> > & > >> >>> > Index > >> >>> > - carving out / deprecating non-core parts of pandas > >> >>> > - datareader > >> >>> > - SparsePanel, WidePanel & other aliases (TImeSeries) > >> >>> > - rpy, rplot, irow et al. > >> >>> > - google-analytics > >> >>> > - API changes to make things more consistent > >> >>> > - pd.rolling/expanding * -> .rolling/expanding (this is > >> >>> > in > >> >>> > master > >> >>> > now) > >> >>> > - .resample becoming a full defered like groupby. > >> >>> > - multi-index slicing along any level (obviates need for > >> >>> > .xs) > >> >>> > and > >> >>> > allows > >> >>> > assignment > >> >>> > - .loc/.iloc - for the most part obviates use of .ix > >> >>> > - .pipe & .assign > >> >>> > - plotting accessors > >> >>> > - fixing of the sorting API > >> >>> > - many performance enhancements both micro & macro (e.g. > >> >>> > release > >> >>> > GIL) > >> >>> > > >> >>> > Some on-deck enhancements are (meaning these are basically > >> >>> > ready to > >> >>> > go > >> >>> > in): > >> >>> > - IntervalIndex (and eventually make PeriodIndex just a > >> >>> > sub-class > >> >>> > of > >> >>> > this) > >> >>> > - RangeIndex > >> >>> > > >> >>> > so lots of changes, though nothing really earth shaking, > >> >>> > just > >> >>> > more > >> >>> > convenience, reducing magicness somewhat > >> >>> > and providing flexibility. > >> >>> > > >> >>> > Of course we are getting increasing issues, mostly bug > >> >>> > reports > >> >>> > (and > >> >>> > lots > >> >>> > of > >> >>> > dupes), some edge case enhancements > >> >>> > which can add to the existing API's and of course, requests > >> >>> > to > >> >>> > expand > >> >>> > the > >> >>> > (already) large code to other usecases. > >> >>> > Balancing this are a good many pull-requests from many > >> >>> > different > >> >>> > users, > >> >>> > some > >> >>> > even deep into the internals. > >> >>> > > >> >>> > Here are some things that I have talked about and could be > >> >>> > considered > >> >>> > for > >> >>> > the roadmap. Disclaimer: I do work for Continuum > >> >>> > but these views are of course my own; furthermore obviously > >> >>> > I > >> >>> > am a > >> >>> > bit > >> >>> > more > >> >>> > familiar with some of the 'sponsored' open-source > >> >>> > libraries, but always open to new things. > >> >>> > > >> >>> > - integration / automatic deferral to numba for JIT (this > >> >>> > would > >> >>> > be > >> >>> > thru > >> >>> > .apply) > >> >>> > - automatic deferal to dask from groubpy where appropriate > >> >>> > / > >> >>> > maybe a > >> >>> > .to_parallel (to simply return a dask.DataFrame object) > >> >>> > - incorporation of quantities / units (as part of the > >> >>> > dtype) > >> >>> > - use of DyND to allow missing values for int dtypes > >> >>> > - make Period a first class dtype. > >> >>> > - provide some copy-on-write semantics to alleviate the > >> >>> > chained-indexing > >> >>> > issues which occasionaly come up with the mis-use of the > >> >>> > indexing > >> >>> > API > >> >>> > - allow a 'policy' to automatically provide column blocks > >> >>> > for > >> >>> > dict-like > >> >>> > input (e.g. each column would be a block), this would allow > >> >>> > a > >> >>> > pass-thru > >> >>> > API > >> >>> > where you could > >> >>> > put in numpy arrays where you have views and have them > >> >>> > preserved > >> >>> > rather > >> >>> > than > >> >>> > copied automatically. Note that this would also allow what > >> >>> > I > >> >>> > call > >> >>> > 'split' > >> >>> > where a passed in > >> >>> > multi-dim numpy array could be split up to individual > >> >>> > blocks > >> >>> > (which > >> >>> > actually > >> >>> > gives a nice perf boost after the splitting costs). > >> >>> > > >> >>> > In working towards some of these goals. I have come to the > >> >>> > opinion > >> >>> > that > >> >>> > it > >> >>> > would make sense to have a neutral API protocol layer > >> >>> > that would allow us to swap out different engines as > >> >>> > needed, > >> >>> > for > >> >>> > particular > >> >>> > dtypes, or *maybe* out-of-core type computations. E.g. > >> >>> > imagine that we replaced the in-memory block structure with > >> >>> > a > >> >>> > bclolz > >> >>> > / > >> >>> > memap > >> >>> > type; in theory this should be 'easy' and just work. > >> >>> > I could also see us adopting *some* of the SFrame code to > >> >>> > allow > >> >>> > easier > >> >>> > interop with this API layer. > >> >>> > > >> >>> > In practice, I think a nice API layer would need to be > >> >>> > created > >> >>> > to > >> >>> > make > >> >>> > this > >> >>> > clean / nice. > >> >>> > > >> >>> > So this comes around to Wes's point about creating a c++ > >> >>> > library for > >> >>> > the > >> >>> > internals (and possibly even some of the indexing > >> >>> > routines). > >> >>> > In an ideal world, or course this would be desirable. > >> >>> > Getting > >> >>> > there > >> >>> > is a > >> >>> > bit > >> >>> > non-trivial I think, and IMHO might not be worth the > >> >>> > effort. I > >> >>> > don't > >> >>> > really see big performance bottlenecks. We *already* defer > >> >>> > much > >> >>> > of > >> >>> > the > >> >>> > computation to libraries like numexpr & bottleneck (where > >> >>> > appropriate). > >> >>> > Adding numba / dask to the list would be helpful. > >> >>> > > >> >>> > I think that almost all performance issues are the result > >> >>> > of: > >> >>> > > >> >>> > a) gross misuse of the pandas API. How much code have you > >> >>> > seen > >> >>> > that > >> >>> > does > >> >>> > df.apply(lambda x: x.sum()) > >> >>> > b) routines which operate column-by-column rather > >> >>> > block-by-block and > >> >>> > are > >> >>> > in > >> >>> > python space (e.g. we have an issue right now about > >> >>> > .quantile) > >> >>> > > >> >>> > So I am glossing over a big goal of having a c++ library > >> >>> > that > >> >>> > represents > >> >>> > the > >> >>> > pandas internals. This would by definition have a c-API > >> >>> > that so > >> >>> > you *could* use pandas like semantics in c/c++ and just > >> >>> > have it > >> >>> > work > >> >>> > (and > >> >>> > then pandas would be a thin wrapper around this library). > >> >>> > > >> >>> > I am not averse to this, but I think would be quite a big > >> >>> > effort, > >> >>> > and > >> >>> > not a > >> >>> > huge perf boost IMHO. Further there are a number of API > >> >>> > issues > >> >>> > w.r.t. > >> >>> > indexing > >> >>> > which need to be clarified / worked out (e.g. should we > >> >>> > simply > >> >>> > deprecate > >> >>> > []) > >> >>> > that are much easier to test / figure out in python space. > >> >>> > > >> >>> > I also thing that we have quite a large number of > >> >>> > contributors. > >> >>> > Moving > >> >>> > to > >> >>> > c++ might make the internals a bit more impenetrable that > >> >>> > the > >> >>> > current > >> >>> > internals. > >> >>> > (though this would allow c++ people to contribute, so that > >> >>> > might > >> >>> > balance > >> >>> > out). > >> >>> > > >> >>> > We have a limited core of devs whom right now are familar > >> >>> > with > >> >>> > things. > >> >>> > If > >> >>> > someone happened to have a starting base for a c++ library, > >> >>> > then I > >> >>> > might > >> >>> > change > >> >>> > opinions here. > >> >>> > > >> >>> > > >> >>> > my 4c. > >> >>> > > >> >>> > Jeff > >> >>> > > >> >>> > > >> >>> > > >> >>> > > >> >>> > On Thu, Dec 24, 2015 at 7:18 PM, Wes McKinney > >> >>> > <wesmckinn@gmail.com> > >> >>> > wrote: > >> >>> >> > >> >>> >> Deep thoughts during the holidays. > >> >>> >> > >> >>> >> I might be out of line here, but the interpreter-heaviness > >> >>> >> of > >> >>> >> the > >> >>> >> inside of pandas objects is likely to be a long-term > >> >>> >> liability > >> >>> >> and > >> >>> >> source of performance problems and technical debt. > >> >>> >> > >> >>> >> Has anyone put any thought into planning and beginning to > >> >>> >> execute > >> >>> >> on a > >> >>> >> rewrite that moves as much as possible of the internals > >> >>> >> into > >> >>> >> native > >> >>> >> / > >> >>> >> compiled code? I'm talking about: > >> >>> >> > >> >>> >> - pandas/core/internals > >> >>> >> - indexing and assignment > >> >>> >> - much of pandas/core/common > >> >>> >> - categorical and custom dtypes > >> >>> >> - all indexing mechanisms > >> >>> >> > >> >>> >> I'm concerned we've already exposed too much internals to > >> >>> >> users, so > >> >>> >> this might lead to a lot of API breakage, but it might be > >> >>> >> for > >> >>> >> the > >> >>> >> Greater Good. As a first step, beginning a partial > >> >>> >> migration > >> >>> >> of > >> >>> >> internals into some C++ classes that encapsulate the > >> >>> >> insides > >> >>> >> of > >> >>> >> DataFrame objects and implement indexing and block-level > >> >>> >> manipulations > >> >>> >> would be a good place to start. I think you could do this > >> >>> >> wouldn't > >> >>> >> too > >> >>> >> much disruption. > >> >>> >> > >> >>> >> As part of this internal retooling we might give > >> >>> >> consideration > >> >>> >> to > >> >>> >> alternative data structures for representing data internal > >> >>> >> to > >> >>> >> pandas > >> >>> >> objects. Now in 2015/2016, continuing to be hamstrung by > >> >>> >> NumPy's > >> >>> >> limitations feels somewhat anachronistic. User code is > >> >>> >> riddled > >> >>> >> with > >> >>> >> workarounds for data type fidelity issues and the like. > >> >>> >> Like, > >> >>> >> really, > >> >>> >> why not add a bitndarray (similar to ilanschnell/bitarray) > >> >>> >> for > >> >>> >> storing > >> >>> >> nullness for problematic types and hide this from the > >> >>> >> user? =) > >> >>> >> > >> >>> >> Since we are now a NumFOCUS-sponsored project, I feel like > >> >>> >> we > >> >>> >> might > >> >>> >> consider establishing some formal governance over pandas > >> >>> >> and > >> >>> >> publishing meetings notes and roadmap documents describing > >> >>> >> plans > >> >>> >> for > >> >>> >> the project and meetings notes from committers. There's no > >> >>> >> real > >> >>> >> "committer culture" for NumFOCUS projects like there is > >> >>> >> with > >> >>> >> the > >> >>> >> Apache Software Foundation, but we might try leading by > >> >>> >> example! > >> >>> >> > >> >>> >> Also, I believe pandas as a project has reached a level of > >> >>> >> importance > >> >>> >> where we ought to consider planning and execution on > >> >>> >> larger > >> >>> >> scale > >> >>> >> undertakings such as this for safeguarding the future. > >> >>> >> > >> >>> >> As for myself, well, I have my hands full in Big > >> >>> >> Data-land. I > >> >>> >> wish > >> >>> >> I > >> >>> >> could be helping more with pandas, but there a quite a few > >> >>> >> fundamental > >> >>> >> issues (like data interoperability nested data handling > >> >>> >> and > >> >>> >> file > >> >>> >> format support — e.g. Parquet, see > >> >>> >> > >> >>> >> > >> >>> >> > >> >>> >> > >> >>> >> > >> >>> >> http://wesmckinney.com/blog/the-problem-with-the-data-science-language-wars/) > >> >>> >> preventing Python from being more useful in industry > >> >>> >> analytics > >> >>> >> applications. > >> >>> >> > >> >>> >> Aside: one of the bigger mistakes I made with pandas's API > >> >>> >> design > >> >>> >> was > >> >>> >> making it acceptable to call class constructors — like > >> >>> >> pandas.DataFrame — directly (versus factory functions). > >> >>> >> Sorry > >> >>> >> about > >> >>> >> that! If we could convince everyone to start writing > >> >>> >> pandas.data_frame > >> >>> >> or dataframe instead of using the class reference it would > >> >>> >> help a > >> >>> >> lot > >> >>> >> with code cleanup. It's hard to plan for these things — > >> >>> >> NumPy > >> >>> >> interoperability seemed a lot more important in 2008 than > >> >>> >> it > >> >>> >> does > >> >>> >> now, > >> >>> >> so I forgive myself. > >> >>> >> > >> >>> >> cheers and best wishes for 2016, > >> >>> >> Wes > >> >>> >> _______________________________________________ > >> >>> >> Pandas-dev mailing list > >> >>> >> Pandas-dev@python.org > >> >>> >> https://mail.python.org/mailman/listinfo/pandas-dev > >> >>> > > >> >>> > > >> >>> _______________________________________________ > >> >>> Pandas-dev mailing list > >> >>> Pandas-dev@python.org > >> >>> https://mail.python.org/mailman/listinfo/pandas-dev > >> _______________________________________________ > >> Pandas-dev mailing list > >> Pandas-dev@python.org > >> https://mail.python.org/mailman/listinfo/pandas-dev > > > > > > > _______________________________________________ > Pandas-dev mailing list > Pandas-dev@python.org > https://mail.python.org/mailman/listinfo/pandas-dev >
Pandas-dev mailing list Pandas-dev@python.org https://mail.python.org/mailman/listinfo/pandas-dev
ok I moved the document to the Pandas folder, where the same group should be able to edit/upload/etc. lmk if any issues On Fri, Jan 1, 2016 at 8:48 PM, Wes McKinney <wesmckinn@gmail.com> wrote:
Thanks Jeff. Can you create and share a shared Drive folder containing this where I can put other auxiliary / follow up documents?
I changed the doc so that the core dev people can edit. I *think* that everyone should be able to view/comment though.
On Fri, Jan 1, 2016 at 8:13 PM, Wes McKinney <wesmckinn@gmail.com> wrote:
Jeff -- can you require log-in for editing on this document?
https://docs.google.com/document/d/151ct8jcZWwh7XStptjbLsda6h2b3C0IuiH_hfZnU...
There are a number of anonymous edits.
On Wed, Dec 30, 2015 at 6:04 PM, Wes McKinney <wesmckinn@gmail.com>
wrote:
I cobbled together an ugly start of a c++->cython->pandas toolchain here
https://github.com/wesm/pandas/tree/libpandas-native-core
I used a mix of Kudu, Impala, and dynd-python cmake sources, so it's a bit messy at the moment but it should be sufficient to run some real experiments with a little more work. I reckon it's like a 6 month project to tear out the insides of Series and DataFrame and replace it with a new "native core", but we should be able to get enough info to see whether it's a viable plan within a month or so.
The end goal is to create "private" extension types in Cython that can be the new base classes for Series and NDFrame; these will hold a reference to a C++ object that contains wrappered NumPy arrays and other metadata (like pandas-only dtypes).
It might be too hard to try to replace a single usage of block manager as a first experiment, so I'll try to create a minimal "SeriesLite" that supports 3 dtypes
1) float64 with nans 2) int64 with a bitmask for NAs 3) category type for one of these
Just want to get a feel for the extensibility and offer an NA singleton Python object (a la None) for getting and setting NAs across these 3 dtypes.
If we end up going down this route, any way to place a moratorium on invasive work on pandas internals (outside bug fixes)?
Pedantic aside: I'd rather avoid shipping thirdparty C/C++ libraries like googletest and friends in pandas if we can. Cloudera folks have been working on a portable C++ library toolchain for Impala and other projects at https://github.com/cloudera/native-toolchain, but it is only being tested on Linux and OS X. Most google libraries should build out of the box on MSVC but it'll be something to keep an eye on.
BTW thanks to the libdynd developers for pioneering the c++ lib <-> python-c++ lib <-> cython toolchain; being able to build Cython extensions directly from cmake is a godsend
HNY all Wes
On Tue, Dec 29, 2015 at 4:17 PM, Irwin Zaid <izaid@continuum.io> wrote:
Yeah, that seems reasonable and I totally agree a Pandas wrapper layer would be necessary.
I'll keep an eye on this and I'd like to help if I can.
Irwin
On Tue, Dec 29, 2015 at 6:01 PM, Wes McKinney <wesmckinn@gmail.com> wrote:
I'm not suggesting a rewrite of NumPy functionality but rather
functionality that is currently written in a mishmash of Cython and Python. Happy to experiment with changing the internal compute infrastructure and data representation to DyND after this first stage of cleanup is done. Even if we use DyND a pretty extensive pandas wrapper layer will be necessary.
On Tuesday, December 29, 2015, Irwin Zaid <izaid@continuum.io> wrote: > > Hi Wes (and others), > > I've been following this conversation with interest. I do think it > would > be worth exploring DyND, rather than setting up yet another rewrite > of > NumPy-functionality. Especially because DyND is already an optional > dependency of Pandas. > > For things like Integer NA and new dtypes, DyND is there and ready to > do > this. > > Irwin > > On Tue, Dec 29, 2015 at 5:18 PM, Wes McKinney <wesmckinn@gmail.com
> wrote: >> >> Can you link to the PR you're talking about? >> >> I will see about spending a few hours setting up a libpandas.so as a >> C++ >> shared library where we can run some experiments and validate >> whether it can >> solve the integer-NA problem and be a place to put new data types >> (categorical and friends). I'm +1 on targeting >> >> Would it also be worth making a wish list of APIs we might consider >> breaking in a pandas 1.0 release that also features this new "native >> core"? >> Might as well right some wrongs while we're doing some invasive work >> on the >> internals; some breakage might be unavoidable. We can always >> maintain a >> pandas legacy 0.x.x maintenance branch (providing a conda binary >> build) for >> legacy users where showstopper bugs can get fixed. >> >> On Tue, Dec 29, 2015 at 1:20 PM, Jeff Reback < jeffreback@gmail.com> >> wrote: >> > Wes your last is noted as well. I *think* we can actually do
>> > now >> > (well >> > there is a PR out there). >> > >> > On Tue, Dec 29, 2015 at 4:12 PM, Wes McKinney >> > <wesmckinn@gmail.com> >> > wrote: >> >> >> >> The other huge thing this will enable is to do is copy-on-write >> >> for >> >> various kinds of views, which should cut down on some of the >> >> defensive >> >> copying in the library and reduce memory usage. >> >> >> >> On Tue, Dec 29, 2015 at 1:02 PM, Wes McKinney >> >> <wesmckinn@gmail.com> >> >> wrote: >> >> > Basically the approach is >> >> > >> >> > 1) Base dtype type >> >> > 2) Base array type with K >= 1 dimensions >> >> > 3) Base scalar type >> >> > 4) Base index type >> >> > 5) "Wrapper" subclasses for all NumPy types fitting into >> >> > categories >> >> > #1, #2, #3, #4 >> >> > 6) Subclasses for pandas-specific types like category, >> >> > datetimeTZ, >> >> > etc. >> >> > 7) NDFrame as cpcloud wrote is just a list of these >> >> > >> >> > Indexes and axis labels / column names can get layered on top. >> >> > >> >> > After we do all this we can look at adding nested types >> >> > (arrays, >> >> > maps, >> >> > structs) to better support JSON. >> >> > >> >> > - Wes >> >> > >> >> > On Tue, Dec 29, 2015 at 12:14 PM, Phillip Cloud >> >> > <cpcloud@gmail.com> >> >> > wrote: >> >> >> Maybe this is saying the same thing as Wes, but how far would >> >> >> something >> >> >> like >> >> >> this get us? >> >> >> >> >> >> // warning: things are probably not this simple >> >> >> >> >> >> struct data_array_t { >> >> >> void *primitive; // scalar data >> >> >> data_array_t *nested; // nested data >> >> >> boost::dynamic_bitset isnull; // might have to create our >> >> >> own >> >> >> to >> >> >> avoid >> >> >> boost >> >> >> schema_t schema; // not sure exactly what this looks
>> >> >> }; >> >> >> >> >> >> typedef std::map<string, data_array_t> data_frame_t; // >> >> >> probably >> >> >> not >> >> >> this >> >> >> simple >> >> >> >> >> >> To answer Jeff’s use-case question: I think that the use cases >> >> >> are >> >> >> 1) >> >> >> freedom from numpy (mostly) 2) no more block manager which >> >> >> frees >> >> >> us >> >> >> from the >> >> >> limitations of the block memory layout. In particular, the >> >> >> ability >> >> >> to >> >> >> take >> >> >> advantage of memory mapped IO would be a big win IMO. >> >> >> >> >> >> >> >> >> On Tue, Dec 29, 2015 at 2:50 PM Wes McKinney >> >> >> <wesmckinn@gmail.com> >> >> >> wrote: >> >> >>> >> >> >>> I will write a more detailed response to some of these
>> >> >>> after >> >> >>> the new year, but, in particular, re: missing values, can you >> >> >>> or >> >> >>> someone tell me why creating an object that contains a NumPy >> >> >>> array and >> >> >>> a bitmap is not sufficient? If we we can add a lightweight >> >> >>> C/C++ >> >> >>> class >> >> >>> layer between NumPy function calls (e.g. arithmetic) and >> >> >>> pandas >> >> >>> function calls, then I see no reason why we cannot have >> >> >>> >> >> >>> Int32Array->add >> >> >>> >> >> >>> and >> >> >>> >> >> >>> Float32Array->add >> >> >>> >> >> >>> do the right thing (the former would be responsible for >> >> >>> bitmasking to >> >> >>> propagate NA values; the latter would defer to NumPy). If we >> >> >>> can >> >> >>> put >> >> >>> all the internals of pandas objects inside a black box, we >> >> >>> can >> >> >>> add >> >> >>> layers of virtual function indirection without a
>> >> >>> penalty >> >> >>> (e.g. adding more interpreter overhead with more abstraction >> >> >>> layers >> >> >>> does add up to a perf penalty). >> >> >>> >> >> >>> I don't think this is too scary -- I would be willing to >> >> >>> create a >> >> >>> small POC C++ library to prototype something like what I'm >> >> >>> talking >> >> >>> about. >> >> >>> >> >> >>> Since pandas has limited points of contact with NumPy I don't >> >> >>> think >> >> >>> this would end up being too onerous. >> >> >>> >> >> >>> For the record, I'm pretty allergic to "advanced C++"; I >> >> >>> think it >> >> >>> is a >> >> >>> useful tool if you pick a sane 20% subset of the C++11 spec >> >> >>> and >> >> >>> follow >> >> >>> Google C++ style it's not very inaccessible to intermediate >> >> >>> developers. More or less "C plus OOP and easier object >> >> >>> lifetime >> >> >>> management (shared/unique_ptr, etc.)". As soon as you add a >> >> >>> lot >> >> >>> of >> >> >>> template metaprogramming C++ library development quickly >> >> >>> becomes >> >> >>> inaccessible except to the C++-Jedi. >> >> >>> >> >> >>> Maybe let's start a Google document on "pandas roadmap" where >> >> >>> we >> >> >>> can >> >> >>> break down the 1-2 year goals and some of these >> >> >>> infrastructure >> >> >>> issues >> >> >>> and have our discussion there? (obviously publish this >> >> >>> someplace >> >> >>> once >> >> >>> we're done) >> >> >>> >> >> >>> - Wes >> >> >>> >> >> >>> On Fri, Dec 25, 2015 at 2:14 PM, Jeff Reback >> >> >>> <jeffreback@gmail.com> >> >> >>> wrote: >> >> >>> > Here are some of my thoughts about pandas Roadmap / status >> >> >>> > and >> >> >>> > some >> >> >>> > responses to Wes's thoughts. >> >> >>> > >> >> >>> > In the last few (and upcoming) major releases we have been >> >> >>> > made >> >> >>> > the >> >> >>> > following changes: >> >> >>> > >> >> >>> > - dtype enhancements (Categorical, Timedelta, Datetime >> >> >>> > w/tz) & >> >> >>> > making >> >> >>> > these >> >> >>> > first class objects >> >> >>> > - code refactoring to remove subclassing of ndarrays for >> >> >>> > Series >> >> >>> > & >> >> >>> > Index >> >> >>> > - carving out / deprecating non-core parts of pandas >> >> >>> > - datareader >> >> >>> > - SparsePanel, WidePanel & other aliases (TImeSeries) >> >> >>> > - rpy, rplot, irow et al. >> >> >>> > - google-analytics >> >> >>> > - API changes to make things more consistent >> >> >>> > - pd.rolling/expanding * -> .rolling/expanding (this is >> >> >>> > in >> >> >>> > master >> >> >>> > now) >> >> >>> > - .resample becoming a full defered like groupby. >> >> >>> > - multi-index slicing along any level (obviates need for >> >> >>> > .xs) >> >> >>> > and >> >> >>> > allows >> >> >>> > assignment >> >> >>> > - .loc/.iloc - for the most part obviates use of .ix >> >> >>> > - .pipe & .assign >> >> >>> > - plotting accessors >> >> >>> > - fixing of the sorting API >> >> >>> > - many performance enhancements both micro & macro (e.g. >> >> >>> > release >> >> >>> > GIL) >> >> >>> > >> >> >>> > Some on-deck enhancements are (meaning these are basically >> >> >>> > ready to >> >> >>> > go >> >> >>> > in): >> >> >>> > - IntervalIndex (and eventually make PeriodIndex just a >> >> >>> > sub-class >> >> >>> > of >> >> >>> > this) >> >> >>> > - RangeIndex >> >> >>> > >> >> >>> > so lots of changes, though nothing really earth shaking, >> >> >>> > just >> >> >>> > more >> >> >>> > convenience, reducing magicness somewhat >> >> >>> > and providing flexibility. >> >> >>> > >> >> >>> > Of course we are getting increasing issues, mostly bug >> >> >>> > reports >> >> >>> > (and >> >> >>> > lots >> >> >>> > of >> >> >>> > dupes), some edge case enhancements >> >> >>> > which can add to the existing API's and of course, requests >> >> >>> > to >> >> >>> > expand >> >> >>> > the >> >> >>> > (already) large code to other usecases. >> >> >>> > Balancing this are a good many pull-requests from many >> >> >>> > different >> >> >>> > users, >> >> >>> > some >> >> >>> > even deep into the internals. >> >> >>> > >> >> >>> > Here are some things that I have talked about and could be >> >> >>> > considered >> >> >>> > for >> >> >>> > the roadmap. Disclaimer: I do work for Continuum >> >> >>> > but these views are of course my own; furthermore obviously >> >> >>> > I >> >> >>> > am a >> >> >>> > bit >> >> >>> > more >> >> >>> > familiar with some of the 'sponsored' open-source >> >> >>> > libraries, but always open to new things. >> >> >>> > >> >> >>> > - integration / automatic deferral to numba for JIT (this >> >> >>> > would >> >> >>> > be >> >> >>> > thru >> >> >>> > .apply) >> >> >>> > - automatic deferal to dask from groubpy where appropriate >> >> >>> > / >> >> >>> > maybe a >> >> >>> > .to_parallel (to simply return a dask.DataFrame object) >> >> >>> > - incorporation of quantities / units (as part of the >> >> >>> > dtype) >> >> >>> > - use of DyND to allow missing values for int dtypes >> >> >>> > - make Period a first class dtype. >> >> >>> > - provide some copy-on-write semantics to alleviate the >> >> >>> > chained-indexing >> >> >>> > issues which occasionaly come up with the mis-use of the >> >> >>> > indexing >> >> >>> > API >> >> >>> > - allow a 'policy' to automatically provide column blocks >> >> >>> > for >> >> >>> > dict-like >> >> >>> > input (e.g. each column would be a block), this would allow >> >> >>> > a >> >> >>> > pass-thru >> >> >>> > API >> >> >>> > where you could >> >> >>> > put in numpy arrays where you have views and have them >> >> >>> > preserved >> >> >>> > rather >> >> >>> > than >> >> >>> > copied automatically. Note that this would also allow what >> >> >>> > I >> >> >>> > call >> >> >>> > 'split' >> >> >>> > where a passed in >> >> >>> > multi-dim numpy array could be split up to individual >> >> >>> > blocks >> >> >>> > (which >> >> >>> > actually >> >> >>> > gives a nice perf boost after the splitting costs). >> >> >>> > >> >> >>> > In working towards some of these goals. I have come to
>> >> >>> > opinion >> >> >>> > that >> >> >>> > it >> >> >>> > would make sense to have a neutral API protocol layer >> >> >>> > that would allow us to swap out different engines as >> >> >>> > needed, >> >> >>> > for >> >> >>> > particular >> >> >>> > dtypes, or *maybe* out-of-core type computations. E.g. >> >> >>> > imagine that we replaced the in-memory block structure with >> >> >>> > a >> >> >>> > bclolz >> >> >>> > / >> >> >>> > memap >> >> >>> > type; in theory this should be 'easy' and just work. >> >> >>> > I could also see us adopting *some* of the SFrame code to >> >> >>> > allow >> >> >>> > easier >> >> >>> > interop with this API layer. >> >> >>> > >> >> >>> > In practice, I think a nice API layer would need to be >> >> >>> > created >> >> >>> > to >> >> >>> > make >> >> >>> > this >> >> >>> > clean / nice. >> >> >>> > >> >> >>> > So this comes around to Wes's point about creating a c++ >> >> >>> > library for >> >> >>> > the >> >> >>> > internals (and possibly even some of the indexing >> >> >>> > routines). >> >> >>> > In an ideal world, or course this would be desirable. >> >> >>> > Getting >> >> >>> > there >> >> >>> > is a >> >> >>> > bit >> >> >>> > non-trivial I think, and IMHO might not be worth the >> >> >>> > effort. I >> >> >>> > don't >> >> >>> > really see big performance bottlenecks. We *already* defer >> >> >>> > much >> >> >>> > of >> >> >>> > the >> >> >>> > computation to libraries like numexpr & bottleneck (where >> >> >>> > appropriate). >> >> >>> > Adding numba / dask to the list would be helpful. >> >> >>> > >> >> >>> > I think that almost all performance issues are the result >> >> >>> > of: >> >> >>> > >> >> >>> > a) gross misuse of the pandas API. How much code have you >> >> >>> > seen >> >> >>> > that >> >> >>> > does >> >> >>> > df.apply(lambda x: x.sum()) >> >> >>> > b) routines which operate column-by-column rather >> >> >>> > block-by-block and >> >> >>> > are >> >> >>> > in >> >> >>> > python space (e.g. we have an issue right now about >> >> >>> > .quantile) >> >> >>> > >> >> >>> > So I am glossing over a big goal of having a c++ library >> >> >>> > that >> >> >>> > represents >> >> >>> > the >> >> >>> > pandas internals. This would by definition have a c-API >> >> >>> > that so >> >> >>> > you *could* use pandas like semantics in c/c++ and just >> >> >>> > have it >> >> >>> > work >> >> >>> > (and >> >> >>> > then pandas would be a thin wrapper around this library). >> >> >>> > >> >> >>> > I am not averse to this, but I think would be quite a big >> >> >>> > effort, >> >> >>> > and >> >> >>> > not a >> >> >>> > huge perf boost IMHO. Further there are a number of API >> >> >>> > issues >> >> >>> > w.r.t. >> >> >>> > indexing >> >> >>> > which need to be clarified / worked out (e.g. should we >> >> >>> > simply >> >> >>> > deprecate >> >> >>> > []) >> >> >>> > that are much easier to test / figure out in python space. >> >> >>> > >> >> >>> > I also thing that we have quite a large number of >> >> >>> > contributors. >> >> >>> > Moving >> >> >>> > to >> >> >>> > c++ might make the internals a bit more impenetrable that >> >> >>> > the >> >> >>> > current >> >> >>> > internals. >> >> >>> > (though this would allow c++ people to contribute, so
>> >> >>> > might >> >> >>> > balance >> >> >>> > out). >> >> >>> > >> >> >>> > We have a limited core of devs whom right now are familar >> >> >>> > with >> >> >>> > things. >> >> >>> > If >> >> >>> > someone happened to have a starting base for a c++
>> >> >>> > then I >> >> >>> > might >> >> >>> > change >> >> >>> > opinions here. >> >> >>> > >> >> >>> > >> >> >>> > my 4c. >> >> >>> > >> >> >>> > Jeff >> >> >>> > >> >> >>> > >> >> >>> > >> >> >>> > >> >> >>> > On Thu, Dec 24, 2015 at 7:18 PM, Wes McKinney >> >> >>> > <wesmckinn@gmail.com> >> >> >>> > wrote: >> >> >>> >> >> >> >>> >> Deep thoughts during the holidays. >> >> >>> >> >> >> >>> >> I might be out of line here, but the interpreter-heaviness >> >> >>> >> of >> >> >>> >> the >> >> >>> >> inside of pandas objects is likely to be a long-term >> >> >>> >> liability >> >> >>> >> and >> >> >>> >> source of performance problems and technical debt. >> >> >>> >> >> >> >>> >> Has anyone put any thought into planning and beginning to >> >> >>> >> execute >> >> >>> >> on a >> >> >>> >> rewrite that moves as much as possible of the internals >> >> >>> >> into >> >> >>> >> native >> >> >>> >> / >> >> >>> >> compiled code? I'm talking about: >> >> >>> >> >> >> >>> >> - pandas/core/internals >> >> >>> >> - indexing and assignment >> >> >>> >> - much of pandas/core/common >> >> >>> >> - categorical and custom dtypes >> >> >>> >> - all indexing mechanisms >> >> >>> >> >> >> >>> >> I'm concerned we've already exposed too much internals to >> >> >>> >> users, so >> >> >>> >> this might lead to a lot of API breakage, but it might be >> >> >>> >> for >> >> >>> >> the >> >> >>> >> Greater Good. As a first step, beginning a partial >> >> >>> >> migration >> >> >>> >> of >> >> >>> >> internals into some C++ classes that encapsulate the >> >> >>> >> insides >> >> >>> >> of >> >> >>> >> DataFrame objects and implement indexing and block-level >> >> >>> >> manipulations >> >> >>> >> would be a good place to start. I think you could do
>> >> >>> >> wouldn't >> >> >>> >> too >> >> >>> >> much disruption. >> >> >>> >> >> >> >>> >> As part of this internal retooling we might give >> >> >>> >> consideration >> >> >>> >> to >> >> >>> >> alternative data structures for representing data internal >> >> >>> >> to >> >> >>> >> pandas >> >> >>> >> objects. Now in 2015/2016, continuing to be hamstrung by >> >> >>> >> NumPy's >> >> >>> >> limitations feels somewhat anachronistic. User code is >> >> >>> >> riddled >> >> >>> >> with >> >> >>> >> workarounds for data type fidelity issues and the like. >> >> >>> >> Like, >> >> >>> >> really, >> >> >>> >> why not add a bitndarray (similar to ilanschnell/bitarray) >> >> >>> >> for >> >> >>> >> storing >> >> >>> >> nullness for problematic types and hide this from the >> >> >>> >> user? =) >> >> >>> >> >> >> >>> >> Since we are now a NumFOCUS-sponsored project, I feel
>> >> >>> >> we >> >> >>> >> might >> >> >>> >> consider establishing some formal governance over pandas >> >> >>> >> and >> >> >>> >> publishing meetings notes and roadmap documents describing >> >> >>> >> plans >> >> >>> >> for >> >> >>> >> the project and meetings notes from committers. There's no >> >> >>> >> real >> >> >>> >> "committer culture" for NumFOCUS projects like there is >> >> >>> >> with >> >> >>> >> the >> >> >>> >> Apache Software Foundation, but we might try leading by >> >> >>> >> example! >> >> >>> >> >> >> >>> >> Also, I believe pandas as a project has reached a level of >> >> >>> >> importance >> >> >>> >> where we ought to consider planning and execution on >> >> >>> >> larger >> >> >>> >> scale >> >> >>> >> undertakings such as this for safeguarding the future. >> >> >>> >> >> >> >>> >> As for myself, well, I have my hands full in Big >> >> >>> >> Data-land. I >> >> >>> >> wish >> >> >>> >> I >> >> >>> >> could be helping more with pandas, but there a quite a few >> >> >>> >> fundamental >> >> >>> >> issues (like data interoperability nested data handling >> >> >>> >> and >> >> >>> >> file >> >> >>> >> format support — e.g. Parquet, see >> >> >>> >> >> >> >>> >> >> >> >>> >> >> >> >>> >> >> >> >>> >> >> >> >>> >> http://wesmckinney.com/blog/the-problem-with-the-data-science-language-wars/ ) >> >> >>> >> preventing Python from being more useful in industry >> >> >>> >> analytics >> >> >>> >> applications. >> >> >>> >> >> >> >>> >> Aside: one of the bigger mistakes I made with pandas's API >> >> >>> >> design >> >> >>> >> was >> >> >>> >> making it acceptable to call class constructors — like >> >> >>> >> pandas.DataFrame — directly (versus factory functions). >> >> >>> >> Sorry >> >> >>> >> about >> >> >>> >> that! If we could convince everyone to start writing >> >> >>> >> pandas.data_frame >> >> >>> >> or dataframe instead of using the class reference it would >> >> >>> >> help a >> >> >>> >> lot >> >> >>> >> with code cleanup. It's hard to plan for these things — >> >> >>> >> NumPy >> >> >>> >> interoperability seemed a lot more important in 2008
On Fri, Jan 1, 2016 at 5:23 PM, Jeff Reback <jeffreback@gmail.com> wrote: pandas this like things performance the that library, this like than
>> >> >>> >> it >> >> >>> >> does >> >> >>> >> now, >> >> >>> >> so I forgive myself. >> >> >>> >> >> >> >>> >> cheers and best wishes for 2016, >> >> >>> >> Wes >> >> >>> >> _______________________________________________ >> >> >>> >> Pandas-dev mailing list >> >> >>> >> Pandas-dev@python.org >> >> >>> >> https://mail.python.org/mailman/listinfo/pandas-dev >> >> >>> > >> >> >>> > >> >> >>> _______________________________________________ >> >> >>> Pandas-dev mailing list >> >> >>> Pandas-dev@python.org >> >> >>> https://mail.python.org/mailman/listinfo/pandas-dev >> >> _______________________________________________ >> >> Pandas-dev mailing list >> >> Pandas-dev@python.org >> >> https://mail.python.org/mailman/listinfo/pandas-dev >> > >> > >> >> >> _______________________________________________ >> Pandas-dev mailing list >> Pandas-dev@python.org >> https://mail.python.org/mailman/listinfo/pandas-dev >> >
Pandas-dev mailing list Pandas-dev@python.org https://mail.python.org/mailman/listinfo/pandas-dev
_______________________________________________ Pandas-dev mailing list Pandas-dev@python.org https://mail.python.org/mailman/listinfo/pandas-dev
I was asked about this off list, so I'll belatedly share my thoughts. First of all, I am really excited by Wes's renewed engagement in the project and his interest in rewriting pandas internals. This is quite an ambitious plan and nobody is better positioned to tackle it than Wes. I have mixed feelings about the details of the rewrite itself. +1 on the simpler internal data model. The block manager is confusing and leads to hard to predict performance issues related to copying data. If we can do all column additions/removals/re-orderings without a copy it will be a clear win. +0 on moving internals to C++. I do like the performance benefits, but it seems like a lot of work, and it may make pandas less friendly to new contributors. -0 on writing a brand new dtype system just for pandas -- this stuff really belongs in NumPy (or another array library like DyND), and I am skeptical that pandas can do a complete enough job to be useful without replicating all that functionality. More broadly, I am concerned that this rewrite may improve the tabular computation ecosystem at the cost of inter-operability with the array-based ecosystem (numpy, scipy, sklearn, xarray, etc.). The later has been one of the strengths of pandas and it would be a shame to see that go away. We're already starting to struggle with inter-operability with the new pandas dtypes and a further rewrite would make this even harder. For example, see categoricals and scikit-learn in Tom's recent post [1], or the fact that .values no longer always returns a numpy array. This has also been a challenge for xarray, which can't handle these new dtypes because we lack a suitable array backend for them. Personally, I would much rather leverage a full featured library like an improved NumPy or DyND for new dtypes, because that could also be used by the array-based ecosystem. At the very least, it would be good to think about zero-copy inter-operability with array-based tools. On the other hand, I wonder if maybe it would be better to write a native in-memory backend for Ibis instead of rewriting pandas. Ibis does seem to have improved/simplified API which resolves many of pandas's warts. That said, it's a pretty big change from the "DataFrame as matrix" model, and pandas won't be going away anytime soon. I do like that it would force users to be more explicit about converting between tables and arrays, which might also make distinctions between the tabular and array oriented ecosystems easier to swallow. Just my two cents, from someone who has lots of opinions but who will likely stay on the sidelines for most of this work. Cheers, Stephan [1] http://tomaugspurger.github.io/categorical-pipelines.html On Fri, Jan 1, 2016 at 6:06 PM, Jeff Reback <jeffreback@gmail.com> wrote:
ok I moved the document to the Pandas folder, where the same group should be able to edit/upload/etc. lmk if any issues
On Fri, Jan 1, 2016 at 8:48 PM, Wes McKinney <wesmckinn@gmail.com> wrote:
Thanks Jeff. Can you create and share a shared Drive folder containing this where I can put other auxiliary / follow up documents?
I changed the doc so that the core dev people can edit. I *think* that everyone should be able to view/comment though.
On Fri, Jan 1, 2016 at 8:13 PM, Wes McKinney <wesmckinn@gmail.com> wrote:
Jeff -- can you require log-in for editing on this document?
https://docs.google.com/document/d/151ct8jcZWwh7XStptjbLsda6h2b3C0IuiH_hfZnU...
There are a number of anonymous edits.
On Wed, Dec 30, 2015 at 6:04 PM, Wes McKinney <wesmckinn@gmail.com>
wrote:
I cobbled together an ugly start of a c++->cython->pandas toolchain here
https://github.com/wesm/pandas/tree/libpandas-native-core
I used a mix of Kudu, Impala, and dynd-python cmake sources, so it's a bit messy at the moment but it should be sufficient to run some real experiments with a little more work. I reckon it's like a 6 month project to tear out the insides of Series and DataFrame and replace it with a new "native core", but we should be able to get enough info to see whether it's a viable plan within a month or so.
The end goal is to create "private" extension types in Cython that can be the new base classes for Series and NDFrame; these will hold a reference to a C++ object that contains wrappered NumPy arrays and other metadata (like pandas-only dtypes).
It might be too hard to try to replace a single usage of block manager as a first experiment, so I'll try to create a minimal "SeriesLite" that supports 3 dtypes
1) float64 with nans 2) int64 with a bitmask for NAs 3) category type for one of these
Just want to get a feel for the extensibility and offer an NA singleton Python object (a la None) for getting and setting NAs across these 3 dtypes.
If we end up going down this route, any way to place a moratorium on invasive work on pandas internals (outside bug fixes)?
Pedantic aside: I'd rather avoid shipping thirdparty C/C++ libraries like googletest and friends in pandas if we can. Cloudera folks have been working on a portable C++ library toolchain for Impala and other projects at https://github.com/cloudera/native-toolchain, but it is only being tested on Linux and OS X. Most google libraries should build out of the box on MSVC but it'll be something to keep an eye on.
BTW thanks to the libdynd developers for pioneering the c++ lib <-> python-c++ lib <-> cython toolchain; being able to build Cython extensions directly from cmake is a godsend
HNY all Wes
On Tue, Dec 29, 2015 at 4:17 PM, Irwin Zaid <izaid@continuum.io> wrote:
Yeah, that seems reasonable and I totally agree a Pandas wrapper layer would be necessary.
I'll keep an eye on this and I'd like to help if I can.
Irwin
On Tue, Dec 29, 2015 at 6:01 PM, Wes McKinney <wesmckinn@gmail.com> wrote: > > I'm not suggesting a rewrite of NumPy functionality but rather
> functionality that is currently written in a mishmash of Cython and > Python. > Happy to experiment with changing the internal compute infrastructure > and > data representation to DyND after this first stage of cleanup is done. > Even > if we use DyND a pretty extensive pandas wrapper layer will be > necessary. > > > On Tuesday, December 29, 2015, Irwin Zaid <izaid@continuum.io> wrote: >> >> Hi Wes (and others), >> >> I've been following this conversation with interest. I do think it >> would >> be worth exploring DyND, rather than setting up yet another rewrite >> of >> NumPy-functionality. Especially because DyND is already an
>> dependency of Pandas. >> >> For things like Integer NA and new dtypes, DyND is there and ready to >> do >> this. >> >> Irwin >> >> On Tue, Dec 29, 2015 at 5:18 PM, Wes McKinney < wesmckinn@gmail.com> >> wrote: >>> >>> Can you link to the PR you're talking about? >>> >>> I will see about spending a few hours setting up a libpandas.so as a >>> C++ >>> shared library where we can run some experiments and validate >>> whether it can >>> solve the integer-NA problem and be a place to put new data types >>> (categorical and friends). I'm +1 on targeting >>> >>> Would it also be worth making a wish list of APIs we might consider >>> breaking in a pandas 1.0 release that also features this new "native >>> core"? >>> Might as well right some wrongs while we're doing some invasive work >>> on the >>> internals; some breakage might be unavoidable. We can always >>> maintain a >>> pandas legacy 0.x.x maintenance branch (providing a conda binary >>> build) for >>> legacy users where showstopper bugs can get fixed. >>> >>> On Tue, Dec 29, 2015 at 1:20 PM, Jeff Reback < jeffreback@gmail.com> >>> wrote: >>> > Wes your last is noted as well. I *think* we can actually do
>>> > now >>> > (well >>> > there is a PR out there). >>> > >>> > On Tue, Dec 29, 2015 at 4:12 PM, Wes McKinney >>> > <wesmckinn@gmail.com> >>> > wrote: >>> >> >>> >> The other huge thing this will enable is to do is copy-on-write >>> >> for >>> >> various kinds of views, which should cut down on some of the >>> >> defensive >>> >> copying in the library and reduce memory usage. >>> >> >>> >> On Tue, Dec 29, 2015 at 1:02 PM, Wes McKinney >>> >> <wesmckinn@gmail.com> >>> >> wrote: >>> >> > Basically the approach is >>> >> > >>> >> > 1) Base dtype type >>> >> > 2) Base array type with K >= 1 dimensions >>> >> > 3) Base scalar type >>> >> > 4) Base index type >>> >> > 5) "Wrapper" subclasses for all NumPy types fitting into >>> >> > categories >>> >> > #1, #2, #3, #4 >>> >> > 6) Subclasses for pandas-specific types like category, >>> >> > datetimeTZ, >>> >> > etc. >>> >> > 7) NDFrame as cpcloud wrote is just a list of these >>> >> > >>> >> > Indexes and axis labels / column names can get layered on top. >>> >> > >>> >> > After we do all this we can look at adding nested types >>> >> > (arrays, >>> >> > maps, >>> >> > structs) to better support JSON. >>> >> > >>> >> > - Wes >>> >> > >>> >> > On Tue, Dec 29, 2015 at 12:14 PM, Phillip Cloud >>> >> > <cpcloud@gmail.com> >>> >> > wrote: >>> >> >> Maybe this is saying the same thing as Wes, but how far would >>> >> >> something >>> >> >> like >>> >> >> this get us? >>> >> >> >>> >> >> // warning: things are probably not this simple >>> >> >> >>> >> >> struct data_array_t { >>> >> >> void *primitive; // scalar data >>> >> >> data_array_t *nested; // nested data >>> >> >> boost::dynamic_bitset isnull; // might have to create our >>> >> >> own >>> >> >> to >>> >> >> avoid >>> >> >> boost >>> >> >> schema_t schema; // not sure exactly what this looks
>>> >> >> }; >>> >> >> >>> >> >> typedef std::map<string, data_array_t> data_frame_t; // >>> >> >> probably >>> >> >> not >>> >> >> this >>> >> >> simple >>> >> >> >>> >> >> To answer Jeff’s use-case question: I think that the use cases >>> >> >> are >>> >> >> 1) >>> >> >> freedom from numpy (mostly) 2) no more block manager which >>> >> >> frees >>> >> >> us >>> >> >> from the >>> >> >> limitations of the block memory layout. In particular, the >>> >> >> ability >>> >> >> to >>> >> >> take >>> >> >> advantage of memory mapped IO would be a big win IMO. >>> >> >> >>> >> >> >>> >> >> On Tue, Dec 29, 2015 at 2:50 PM Wes McKinney >>> >> >> <wesmckinn@gmail.com> >>> >> >> wrote: >>> >> >>> >>> >> >>> I will write a more detailed response to some of these
>>> >> >>> after >>> >> >>> the new year, but, in particular, re: missing values, can you >>> >> >>> or >>> >> >>> someone tell me why creating an object that contains a NumPy >>> >> >>> array and >>> >> >>> a bitmap is not sufficient? If we we can add a lightweight >>> >> >>> C/C++ >>> >> >>> class >>> >> >>> layer between NumPy function calls (e.g. arithmetic) and >>> >> >>> pandas >>> >> >>> function calls, then I see no reason why we cannot have >>> >> >>> >>> >> >>> Int32Array->add >>> >> >>> >>> >> >>> and >>> >> >>> >>> >> >>> Float32Array->add >>> >> >>> >>> >> >>> do the right thing (the former would be responsible for >>> >> >>> bitmasking to >>> >> >>> propagate NA values; the latter would defer to NumPy). If we >>> >> >>> can >>> >> >>> put >>> >> >>> all the internals of pandas objects inside a black box, we >>> >> >>> can >>> >> >>> add >>> >> >>> layers of virtual function indirection without a
>>> >> >>> penalty >>> >> >>> (e.g. adding more interpreter overhead with more abstraction >>> >> >>> layers >>> >> >>> does add up to a perf penalty). >>> >> >>> >>> >> >>> I don't think this is too scary -- I would be willing to >>> >> >>> create a >>> >> >>> small POC C++ library to prototype something like what I'm >>> >> >>> talking >>> >> >>> about. >>> >> >>> >>> >> >>> Since pandas has limited points of contact with NumPy I don't >>> >> >>> think >>> >> >>> this would end up being too onerous. >>> >> >>> >>> >> >>> For the record, I'm pretty allergic to "advanced C++"; I >>> >> >>> think it >>> >> >>> is a >>> >> >>> useful tool if you pick a sane 20% subset of the C++11 spec >>> >> >>> and >>> >> >>> follow >>> >> >>> Google C++ style it's not very inaccessible to intermediate >>> >> >>> developers. More or less "C plus OOP and easier object >>> >> >>> lifetime >>> >> >>> management (shared/unique_ptr, etc.)". As soon as you add a >>> >> >>> lot >>> >> >>> of >>> >> >>> template metaprogramming C++ library development quickly >>> >> >>> becomes >>> >> >>> inaccessible except to the C++-Jedi. >>> >> >>> >>> >> >>> Maybe let's start a Google document on "pandas roadmap" where >>> >> >>> we >>> >> >>> can >>> >> >>> break down the 1-2 year goals and some of these >>> >> >>> infrastructure >>> >> >>> issues >>> >> >>> and have our discussion there? (obviously publish this >>> >> >>> someplace >>> >> >>> once >>> >> >>> we're done) >>> >> >>> >>> >> >>> - Wes >>> >> >>> >>> >> >>> On Fri, Dec 25, 2015 at 2:14 PM, Jeff Reback >>> >> >>> <jeffreback@gmail.com> >>> >> >>> wrote: >>> >> >>> > Here are some of my thoughts about pandas Roadmap / status >>> >> >>> > and >>> >> >>> > some >>> >> >>> > responses to Wes's thoughts. >>> >> >>> > >>> >> >>> > In the last few (and upcoming) major releases we have been >>> >> >>> > made >>> >> >>> > the >>> >> >>> > following changes: >>> >> >>> > >>> >> >>> > - dtype enhancements (Categorical, Timedelta, Datetime >>> >> >>> > w/tz) & >>> >> >>> > making >>> >> >>> > these >>> >> >>> > first class objects >>> >> >>> > - code refactoring to remove subclassing of ndarrays for >>> >> >>> > Series >>> >> >>> > & >>> >> >>> > Index >>> >> >>> > - carving out / deprecating non-core parts of pandas >>> >> >>> > - datareader >>> >> >>> > - SparsePanel, WidePanel & other aliases (TImeSeries) >>> >> >>> > - rpy, rplot, irow et al. >>> >> >>> > - google-analytics >>> >> >>> > - API changes to make things more consistent >>> >> >>> > - pd.rolling/expanding * -> .rolling/expanding (this is >>> >> >>> > in >>> >> >>> > master >>> >> >>> > now) >>> >> >>> > - .resample becoming a full defered like groupby. >>> >> >>> > - multi-index slicing along any level (obviates need for >>> >> >>> > .xs) >>> >> >>> > and >>> >> >>> > allows >>> >> >>> > assignment >>> >> >>> > - .loc/.iloc - for the most part obviates use of .ix >>> >> >>> > - .pipe & .assign >>> >> >>> > - plotting accessors >>> >> >>> > - fixing of the sorting API >>> >> >>> > - many performance enhancements both micro & macro (e.g. >>> >> >>> > release >>> >> >>> > GIL) >>> >> >>> > >>> >> >>> > Some on-deck enhancements are (meaning these are basically >>> >> >>> > ready to >>> >> >>> > go >>> >> >>> > in): >>> >> >>> > - IntervalIndex (and eventually make PeriodIndex just a >>> >> >>> > sub-class >>> >> >>> > of >>> >> >>> > this) >>> >> >>> > - RangeIndex >>> >> >>> > >>> >> >>> > so lots of changes, though nothing really earth shaking, >>> >> >>> > just >>> >> >>> > more >>> >> >>> > convenience, reducing magicness somewhat >>> >> >>> > and providing flexibility. >>> >> >>> > >>> >> >>> > Of course we are getting increasing issues, mostly bug >>> >> >>> > reports >>> >> >>> > (and >>> >> >>> > lots >>> >> >>> > of >>> >> >>> > dupes), some edge case enhancements >>> >> >>> > which can add to the existing API's and of course, requests >>> >> >>> > to >>> >> >>> > expand >>> >> >>> > the >>> >> >>> > (already) large code to other usecases. >>> >> >>> > Balancing this are a good many pull-requests from many >>> >> >>> > different >>> >> >>> > users, >>> >> >>> > some >>> >> >>> > even deep into the internals. >>> >> >>> > >>> >> >>> > Here are some things that I have talked about and could be >>> >> >>> > considered >>> >> >>> > for >>> >> >>> > the roadmap. Disclaimer: I do work for Continuum >>> >> >>> > but these views are of course my own; furthermore obviously >>> >> >>> > I >>> >> >>> > am a >>> >> >>> > bit >>> >> >>> > more >>> >> >>> > familiar with some of the 'sponsored' open-source >>> >> >>> > libraries, but always open to new things. >>> >> >>> > >>> >> >>> > - integration / automatic deferral to numba for JIT (this >>> >> >>> > would >>> >> >>> > be >>> >> >>> > thru >>> >> >>> > .apply) >>> >> >>> > - automatic deferal to dask from groubpy where appropriate >>> >> >>> > / >>> >> >>> > maybe a >>> >> >>> > .to_parallel (to simply return a dask.DataFrame object) >>> >> >>> > - incorporation of quantities / units (as part of the >>> >> >>> > dtype) >>> >> >>> > - use of DyND to allow missing values for int dtypes >>> >> >>> > - make Period a first class dtype. >>> >> >>> > - provide some copy-on-write semantics to alleviate the >>> >> >>> > chained-indexing >>> >> >>> > issues which occasionaly come up with the mis-use of the >>> >> >>> > indexing >>> >> >>> > API >>> >> >>> > - allow a 'policy' to automatically provide column blocks >>> >> >>> > for >>> >> >>> > dict-like >>> >> >>> > input (e.g. each column would be a block), this would allow >>> >> >>> > a >>> >> >>> > pass-thru >>> >> >>> > API >>> >> >>> > where you could >>> >> >>> > put in numpy arrays where you have views and have them >>> >> >>> > preserved >>> >> >>> > rather >>> >> >>> > than >>> >> >>> > copied automatically. Note that this would also allow what >>> >> >>> > I >>> >> >>> > call >>> >> >>> > 'split' >>> >> >>> > where a passed in >>> >> >>> > multi-dim numpy array could be split up to individual >>> >> >>> > blocks >>> >> >>> > (which >>> >> >>> > actually >>> >> >>> > gives a nice perf boost after the splitting costs). >>> >> >>> > >>> >> >>> > In working towards some of these goals. I have come to
>>> >> >>> > opinion >>> >> >>> > that >>> >> >>> > it >>> >> >>> > would make sense to have a neutral API protocol layer >>> >> >>> > that would allow us to swap out different engines as >>> >> >>> > needed, >>> >> >>> > for >>> >> >>> > particular >>> >> >>> > dtypes, or *maybe* out-of-core type computations. E.g. >>> >> >>> > imagine that we replaced the in-memory block structure with >>> >> >>> > a >>> >> >>> > bclolz >>> >> >>> > / >>> >> >>> > memap >>> >> >>> > type; in theory this should be 'easy' and just work. >>> >> >>> > I could also see us adopting *some* of the SFrame code to >>> >> >>> > allow >>> >> >>> > easier >>> >> >>> > interop with this API layer. >>> >> >>> > >>> >> >>> > In practice, I think a nice API layer would need to be >>> >> >>> > created >>> >> >>> > to >>> >> >>> > make >>> >> >>> > this >>> >> >>> > clean / nice. >>> >> >>> > >>> >> >>> > So this comes around to Wes's point about creating a c++ >>> >> >>> > library for >>> >> >>> > the >>> >> >>> > internals (and possibly even some of the indexing >>> >> >>> > routines). >>> >> >>> > In an ideal world, or course this would be desirable. >>> >> >>> > Getting >>> >> >>> > there >>> >> >>> > is a >>> >> >>> > bit >>> >> >>> > non-trivial I think, and IMHO might not be worth the >>> >> >>> > effort. I >>> >> >>> > don't >>> >> >>> > really see big performance bottlenecks. We *already* defer >>> >> >>> > much >>> >> >>> > of >>> >> >>> > the >>> >> >>> > computation to libraries like numexpr & bottleneck (where >>> >> >>> > appropriate). >>> >> >>> > Adding numba / dask to the list would be helpful. >>> >> >>> > >>> >> >>> > I think that almost all performance issues are the result >>> >> >>> > of: >>> >> >>> > >>> >> >>> > a) gross misuse of the pandas API. How much code have you >>> >> >>> > seen >>> >> >>> > that >>> >> >>> > does >>> >> >>> > df.apply(lambda x: x.sum()) >>> >> >>> > b) routines which operate column-by-column rather >>> >> >>> > block-by-block and >>> >> >>> > are >>> >> >>> > in >>> >> >>> > python space (e.g. we have an issue right now about >>> >> >>> > .quantile) >>> >> >>> > >>> >> >>> > So I am glossing over a big goal of having a c++ library >>> >> >>> > that >>> >> >>> > represents >>> >> >>> > the >>> >> >>> > pandas internals. This would by definition have a c-API >>> >> >>> > that so >>> >> >>> > you *could* use pandas like semantics in c/c++ and just >>> >> >>> > have it >>> >> >>> > work >>> >> >>> > (and >>> >> >>> > then pandas would be a thin wrapper around this
>>> >> >>> > >>> >> >>> > I am not averse to this, but I think would be quite a big >>> >> >>> > effort, >>> >> >>> > and >>> >> >>> > not a >>> >> >>> > huge perf boost IMHO. Further there are a number of API >>> >> >>> > issues >>> >> >>> > w.r.t. >>> >> >>> > indexing >>> >> >>> > which need to be clarified / worked out (e.g. should we >>> >> >>> > simply >>> >> >>> > deprecate >>> >> >>> > []) >>> >> >>> > that are much easier to test / figure out in python space. >>> >> >>> > >>> >> >>> > I also thing that we have quite a large number of >>> >> >>> > contributors. >>> >> >>> > Moving >>> >> >>> > to >>> >> >>> > c++ might make the internals a bit more impenetrable
>>> >> >>> > the >>> >> >>> > current >>> >> >>> > internals. >>> >> >>> > (though this would allow c++ people to contribute, so
>>> >> >>> > might >>> >> >>> > balance >>> >> >>> > out). >>> >> >>> > >>> >> >>> > We have a limited core of devs whom right now are familar >>> >> >>> > with >>> >> >>> > things. >>> >> >>> > If >>> >> >>> > someone happened to have a starting base for a c++
>>> >> >>> > then I >>> >> >>> > might >>> >> >>> > change >>> >> >>> > opinions here. >>> >> >>> > >>> >> >>> > >>> >> >>> > my 4c. >>> >> >>> > >>> >> >>> > Jeff >>> >> >>> > >>> >> >>> > >>> >> >>> > >>> >> >>> > >>> >> >>> > On Thu, Dec 24, 2015 at 7:18 PM, Wes McKinney >>> >> >>> > <wesmckinn@gmail.com> >>> >> >>> > wrote: >>> >> >>> >> >>> >> >>> >> Deep thoughts during the holidays. >>> >> >>> >> >>> >> >>> >> I might be out of line here, but the interpreter-heaviness >>> >> >>> >> of >>> >> >>> >> the >>> >> >>> >> inside of pandas objects is likely to be a long-term >>> >> >>> >> liability >>> >> >>> >> and >>> >> >>> >> source of performance problems and technical debt. >>> >> >>> >> >>> >> >>> >> Has anyone put any thought into planning and beginning to >>> >> >>> >> execute >>> >> >>> >> on a >>> >> >>> >> rewrite that moves as much as possible of the internals >>> >> >>> >> into >>> >> >>> >> native >>> >> >>> >> / >>> >> >>> >> compiled code? I'm talking about: >>> >> >>> >> >>> >> >>> >> - pandas/core/internals >>> >> >>> >> - indexing and assignment >>> >> >>> >> - much of pandas/core/common >>> >> >>> >> - categorical and custom dtypes >>> >> >>> >> - all indexing mechanisms >>> >> >>> >> >>> >> >>> >> I'm concerned we've already exposed too much internals to >>> >> >>> >> users, so >>> >> >>> >> this might lead to a lot of API breakage, but it might be >>> >> >>> >> for >>> >> >>> >> the >>> >> >>> >> Greater Good. As a first step, beginning a partial >>> >> >>> >> migration >>> >> >>> >> of >>> >> >>> >> internals into some C++ classes that encapsulate the >>> >> >>> >> insides >>> >> >>> >> of >>> >> >>> >> DataFrame objects and implement indexing and block-level >>> >> >>> >> manipulations >>> >> >>> >> would be a good place to start. I think you could do
>>> >> >>> >> wouldn't >>> >> >>> >> too >>> >> >>> >> much disruption. >>> >> >>> >> >>> >> >>> >> As part of this internal retooling we might give >>> >> >>> >> consideration >>> >> >>> >> to >>> >> >>> >> alternative data structures for representing data internal >>> >> >>> >> to >>> >> >>> >> pandas >>> >> >>> >> objects. Now in 2015/2016, continuing to be hamstrung by >>> >> >>> >> NumPy's >>> >> >>> >> limitations feels somewhat anachronistic. User code is >>> >> >>> >> riddled >>> >> >>> >> with >>> >> >>> >> workarounds for data type fidelity issues and the like. >>> >> >>> >> Like, >>> >> >>> >> really, >>> >> >>> >> why not add a bitndarray (similar to ilanschnell/bitarray) >>> >> >>> >> for >>> >> >>> >> storing >>> >> >>> >> nullness for problematic types and hide this from the >>> >> >>> >> user? =) >>> >> >>> >> >>> >> >>> >> Since we are now a NumFOCUS-sponsored project, I feel
>>> >> >>> >> we >>> >> >>> >> might >>> >> >>> >> consider establishing some formal governance over
>>> >> >>> >> and >>> >> >>> >> publishing meetings notes and roadmap documents describing >>> >> >>> >> plans >>> >> >>> >> for >>> >> >>> >> the project and meetings notes from committers. There's no >>> >> >>> >> real >>> >> >>> >> "committer culture" for NumFOCUS projects like there is >>> >> >>> >> with >>> >> >>> >> the >>> >> >>> >> Apache Software Foundation, but we might try leading by >>> >> >>> >> example! >>> >> >>> >> >>> >> >>> >> Also, I believe pandas as a project has reached a level of >>> >> >>> >> importance >>> >> >>> >> where we ought to consider planning and execution on >>> >> >>> >> larger >>> >> >>> >> scale >>> >> >>> >> undertakings such as this for safeguarding the future. >>> >> >>> >> >>> >> >>> >> As for myself, well, I have my hands full in Big >>> >> >>> >> Data-land. I >>> >> >>> >> wish >>> >> >>> >> I >>> >> >>> >> could be helping more with pandas, but there a quite a few >>> >> >>> >> fundamental >>> >> >>> >> issues (like data interoperability nested data handling >>> >> >>> >> and >>> >> >>> >> file >>> >> >>> >> format support — e.g. Parquet, see >>> >> >>> >> >>> >> >>> >> >>> >> >>> >> >>> >> >>> >> >>> >> >>> >> >>> >> >>> >> http://wesmckinney.com/blog/the-problem-with-the-data-science-language-wars/ ) >>> >> >>> >> preventing Python from being more useful in industry >>> >> >>> >> analytics >>> >> >>> >> applications. >>> >> >>> >> >>> >> >>> >> Aside: one of the bigger mistakes I made with pandas's API >>> >> >>> >> design >>> >> >>> >> was >>> >> >>> >> making it acceptable to call class constructors — like >>> >> >>> >> pandas.DataFrame — directly (versus factory functions). >>> >> >>> >> Sorry >>> >> >>> >> about >>> >> >>> >> that! If we could convince everyone to start writing >>> >> >>> >> pandas.data_frame >>> >> >>> >> or dataframe instead of using the class reference it would >>> >> >>> >> help a >>> >> >>> >> lot >>> >> >>> >> with code cleanup. It's hard to plan for these things — >>> >> >>> >> NumPy >>> >> >>> >> interoperability seemed a lot more important in 2008
On Fri, Jan 1, 2016 at 5:23 PM, Jeff Reback <jeffreback@gmail.com> wrote: pandas optional this like things performance the library). that that library, this like pandas than
>>> >> >>> >> it >>> >> >>> >> does >>> >> >>> >> now, >>> >> >>> >> so I forgive myself. >>> >> >>> >> >>> >> >>> >> cheers and best wishes for 2016, >>> >> >>> >> Wes >>> >> >>> >> _______________________________________________ >>> >> >>> >> Pandas-dev mailing list >>> >> >>> >> Pandas-dev@python.org >>> >> >>> >> https://mail.python.org/mailman/listinfo/pandas-dev >>> >> >>> > >>> >> >>> > >>> >> >>> _______________________________________________ >>> >> >>> Pandas-dev mailing list >>> >> >>> Pandas-dev@python.org >>> >> >>> https://mail.python.org/mailman/listinfo/pandas-dev >>> >> _______________________________________________ >>> >> Pandas-dev mailing list >>> >> Pandas-dev@python.org >>> >> https://mail.python.org/mailman/listinfo/pandas-dev >>> > >>> > >>> >>> >>> _______________________________________________ >>> Pandas-dev mailing list >>> Pandas-dev@python.org >>> https://mail.python.org/mailman/listinfo/pandas-dev >>> >>
Pandas-dev mailing list Pandas-dev@python.org https://mail.python.org/mailman/listinfo/pandas-dev
_______________________________________________ Pandas-dev mailing list Pandas-dev@python.org https://mail.python.org/mailman/listinfo/pandas-dev
_______________________________________________ Pandas-dev mailing list Pandas-dev@python.org https://mail.python.org/mailman/listinfo/pandas-dev
hey Stephan, Thanks for all the thoughts. Let me make a few off-the-cuff comments. On Wed, Jan 6, 2016 at 10:11 AM, Stephan Hoyer <shoyer@gmail.com> wrote:
I was asked about this off list, so I'll belatedly share my thoughts.
First of all, I am really excited by Wes's renewed engagement in the project and his interest in rewriting pandas internals. This is quite an ambitious plan and nobody is better positioned to tackle it than Wes.
I have mixed feelings about the details of the rewrite itself.
+1 on the simpler internal data model. The block manager is confusing and leads to hard to predict performance issues related to copying data. If we can do all column additions/removals/re-orderings without a copy it will be a clear win.
+0 on moving internals to C++. I do like the performance benefits, but it seems like a lot of work, and it may make pandas less friendly to new contributors.
It really goes beyond performance benefits. If you go back to my 2013 talk http://www.slideshare.net/wesm/practical-medium-data-analytics-with-python there's a long list of architectural problems that now in 2016 haven't found solutions. The only way (that I can fully reason through -- I am happy to look at alternate proposals) to move the internals of pandas closer to the metal is to give Series and DataFrame a C/C++ API -- this is the "libpandas native core" as I've been describing.
-0 on writing a brand new dtype system just for pandas -- this stuff really belongs in NumPy (or another array library like DyND), and I am skeptical that pandas can do a complete enough job to be useful without replicating all that functionality.
I'm curious what "a brand new dtype system" means to you. pandas already has its own data type system, but it's a potpourri of inconsistencies and rough edges with self-evident problems for both users and developers. Some indicators: - Some pandas types use NaN for missing data, others None (or both), others nothing at all. We lose data (integers) or bloat memory (booleans) by upcasting to float-NaN or object-None. - Internal functions full of is_XXX_dtype functions: pandas.core.common, pandas.core.algorithms, etc. - Series.values on synthetic dtypes like Categorical - We use arrays of Python objects for string data The biggest cause IMHO is that pandas is too tightly coupled to NumPy, but it's coupled in a way that makes development and extensibility difficult. We've already allowed NumPy-specific details to taint the pandas user API in many unpleasant ways. This isn't to say "NumPy is bad" but rather "pandas tries to layer domain-specific functionality [that NumPy was not designed for] on top". Some things things I'm advocating with the internals refactor: 1) First class "pandas type" objects. This is not the same as a NumPy dtype which has some pretty loaded implications -- in particular, NumPy dtypes are implicitly coupled to an array computing framework (see the function table that is attached to the PyArray_Descr object) 2) Pandas array container types that map user-land API calls to implementation-land API calls (in NumPy, DyND, or pandas-native code like pandas.core.algorithms etc.). This will make it much easier to leverage innovations in NumPy and DyND without those implementation details spilling over into the pandas user API 3) Adding a single pandas.NA singleton to have one library-wide notion of a scalar null value (obviously, we can automatically map NaN and None to NA for backwards compatibility). 4) Layering a bitmask internally on NumPy arrays (especially integer and boolean) to add null-ness to types that need it. Note that this does not prevent us from switching to DyND arrays with option dtype in the future. If the details of how we are implementing NULL are visible to the user, we have failed. 5) Removing the block manager in favor of simpler pandas Array (1D) and Table (2D -- vector of Array) data structures I believe you can do all this without harming interoperability with the ecosystem of projects that people currently use in conjunction with pandas.
More broadly, I am concerned that this rewrite may improve the tabular computation ecosystem at the cost of inter-operability with the array-based ecosystem (numpy, scipy, sklearn, xarray, etc.). The later has been one of the strengths of pandas and it would be a shame to see that go away.
I have no intention of letting this happen. What I've am asking from you (and others reading) is to help define what constitutes interoperability. What guarantees do we make the user? For example, we should have very strict guidelines for the output of: np.asarray(pandas_obj) For example In [3]: s = pd.Series([1,2,3]*10).astype('category') In [4]: np.asarray(s) Out[4]: array([1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3]) I see no reason why this should necessarily behave any differently. The problem will come in when there is pandas data that is not precisely representable in a NumPy array. Example: In [5]: s = pd.Series([1,2,3, 4]) In [6]: s.dtype Out[6]: dtype('int64') In [7]: s2 = s.reindex(np.arange(10)) In [8]: s2.dtype Out[8]: dtype('float64') In [9]: np.asarray(s2) Out[9]: array([ 1., 2., 3., 4., nan, nan, nan, nan, nan, nan]) With the "new internals", s2 will still be int64 type, but we may decide that np.asarray(s2) should raise an exception rather than implicitly make a decision about how to perform a "lossy" conversion to a NumPy array. If you are using DyND with pandas, then the equivalent function would be able to implicitly convert without data loss.
We're already starting to struggle with inter-operability with the new pandas dtypes and a further rewrite would make this even harder. For example, see categoricals and scikit-learn in Tom's recent post [1], or the fact that .values no longer always returns a numpy array. This has also been a challenge for xarray, which can't handle these new dtypes because we lack a suitable array backend for them.
I'm definitely motivated in this initiative by these challenges. The idea here is that with the new internals, Series.values will always return the same type of object, and there will be one consistent code path for getting a NumPy array out. For example, rather than: if isinstance(s.values, Categorical): # pandas ... else: # NumPy ... We could have (just an idea) s.values.to_numpy() Or simply np.asarray(s.values)
Personally, I would much rather leverage a full featured library like an improved NumPy or DyND for new dtypes, because that could also be used by the array-based ecosystem. At the very least, it would be good to think about zero-copy inter-operability with array-based tools.
I'm all for zero-copy interoperability when possible, but my gut feeling is that exposing the data type system of an array library (the choice of which is an implementation detail) to pandas users is an inherent leaky abstraction that will continue to cause problems if we plan to keep innovating inside pandas. By better hiding NumPy details and types from the user we will make it much easier to swap out new low level array data structures and compute components (e.g. DyND), or add custom data structures or out-of-core tools (memory maps, bcolz, etc.) I'm additionally offering to do nearly all of this replumbing of pandas internals myself, and completely in my free time. What I will expect in return from you all is to help enumerate our contracts with the pandas user (i.e. interoperability) and to hold me accountable to not break them. I know I haven't been committing code on pandas since mid-2013 (after a 5 year marathon), but these architectural problems have been on my mind almost constantly since then, I just haven't had the bandwidth to start tackling them. cheers, Wes
On the other hand, I wonder if maybe it would be better to write a native in-memory backend for Ibis instead of rewriting pandas. Ibis does seem to have improved/simplified API which resolves many of pandas's warts. That said, it's a pretty big change from the "DataFrame as matrix" model, and pandas won't be going away anytime soon. I do like that it would force users to be more explicit about converting between tables and arrays, which might also make distinctions between the tabular and array oriented ecosystems easier to swallow.
Just my two cents, from someone who has lots of opinions but who will likely stay on the sidelines for most of this work.
Cheers, Stephan
[1] http://tomaugspurger.github.io/categorical-pipelines.html
On Fri, Jan 1, 2016 at 6:06 PM, Jeff Reback <jeffreback@gmail.com> wrote:
ok I moved the document to the Pandas folder, where the same group should be able to edit/upload/etc. lmk if any issues
On Fri, Jan 1, 2016 at 8:48 PM, Wes McKinney <wesmckinn@gmail.com> wrote:
Thanks Jeff. Can you create and share a shared Drive folder containing this where I can put other auxiliary / follow up documents?
On Fri, Jan 1, 2016 at 5:23 PM, Jeff Reback <jeffreback@gmail.com> wrote:
I changed the doc so that the core dev people can edit. I *think* that everyone should be able to view/comment though.
On Fri, Jan 1, 2016 at 8:13 PM, Wes McKinney <wesmckinn@gmail.com> wrote:
Jeff -- can you require log-in for editing on this document?
https://docs.google.com/document/d/151ct8jcZWwh7XStptjbLsda6h2b3C0IuiH_hfZnU...
There are a number of anonymous edits.
On Wed, Dec 30, 2015 at 6:04 PM, Wes McKinney <wesmckinn@gmail.com> wrote:
I cobbled together an ugly start of a c++->cython->pandas toolchain here
https://github.com/wesm/pandas/tree/libpandas-native-core
I used a mix of Kudu, Impala, and dynd-python cmake sources, so it's a bit messy at the moment but it should be sufficient to run some real experiments with a little more work. I reckon it's like a 6 month project to tear out the insides of Series and DataFrame and replace it with a new "native core", but we should be able to get enough info to see whether it's a viable plan within a month or so.
The end goal is to create "private" extension types in Cython that can be the new base classes for Series and NDFrame; these will hold a reference to a C++ object that contains wrappered NumPy arrays and other metadata (like pandas-only dtypes).
It might be too hard to try to replace a single usage of block manager as a first experiment, so I'll try to create a minimal "SeriesLite" that supports 3 dtypes
1) float64 with nans 2) int64 with a bitmask for NAs 3) category type for one of these
Just want to get a feel for the extensibility and offer an NA singleton Python object (a la None) for getting and setting NAs across these 3 dtypes.
If we end up going down this route, any way to place a moratorium on invasive work on pandas internals (outside bug fixes)?
Pedantic aside: I'd rather avoid shipping thirdparty C/C++ libraries like googletest and friends in pandas if we can. Cloudera folks have been working on a portable C++ library toolchain for Impala and other projects at https://github.com/cloudera/native-toolchain, but it is only being tested on Linux and OS X. Most google libraries should build out of the box on MSVC but it'll be something to keep an eye on.
BTW thanks to the libdynd developers for pioneering the c++ lib <-> python-c++ lib <-> cython toolchain; being able to build Cython extensions directly from cmake is a godsend
HNY all Wes
On Tue, Dec 29, 2015 at 4:17 PM, Irwin Zaid <izaid@continuum.io> wrote: > Yeah, that seems reasonable and I totally agree a Pandas wrapper > layer > would > be necessary. > > I'll keep an eye on this and I'd like to help if I can. > > Irwin > > > On Tue, Dec 29, 2015 at 6:01 PM, Wes McKinney <wesmckinn@gmail.com> > wrote: >> >> I'm not suggesting a rewrite of NumPy functionality but rather >> pandas >> functionality that is currently written in a mishmash of Cython >> and >> Python. >> Happy to experiment with changing the internal compute >> infrastructure >> and >> data representation to DyND after this first stage of cleanup is >> done. >> Even >> if we use DyND a pretty extensive pandas wrapper layer will be >> necessary. >> >> >> On Tuesday, December 29, 2015, Irwin Zaid <izaid@continuum.io> >> wrote: >>> >>> Hi Wes (and others), >>> >>> I've been following this conversation with interest. I do think >>> it >>> would >>> be worth exploring DyND, rather than setting up yet another >>> rewrite >>> of >>> NumPy-functionality. Especially because DyND is already an >>> optional >>> dependency of Pandas. >>> >>> For things like Integer NA and new dtypes, DyND is there and >>> ready to >>> do >>> this. >>> >>> Irwin >>> >>> On Tue, Dec 29, 2015 at 5:18 PM, Wes McKinney >>> <wesmckinn@gmail.com> >>> wrote: >>>> >>>> Can you link to the PR you're talking about? >>>> >>>> I will see about spending a few hours setting up a libpandas.so >>>> as a >>>> C++ >>>> shared library where we can run some experiments and validate >>>> whether it can >>>> solve the integer-NA problem and be a place to put new data >>>> types >>>> (categorical and friends). I'm +1 on targeting >>>> >>>> Would it also be worth making a wish list of APIs we might >>>> consider >>>> breaking in a pandas 1.0 release that also features this new >>>> "native >>>> core"? >>>> Might as well right some wrongs while we're doing some invasive >>>> work >>>> on the >>>> internals; some breakage might be unavoidable. We can always >>>> maintain a >>>> pandas legacy 0.x.x maintenance branch (providing a conda binary >>>> build) for >>>> legacy users where showstopper bugs can get fixed. >>>> >>>> On Tue, Dec 29, 2015 at 1:20 PM, Jeff Reback >>>> <jeffreback@gmail.com> >>>> wrote: >>>> > Wes your last is noted as well. I *think* we can actually do >>>> > this >>>> > now >>>> > (well >>>> > there is a PR out there). >>>> > >>>> > On Tue, Dec 29, 2015 at 4:12 PM, Wes McKinney >>>> > <wesmckinn@gmail.com> >>>> > wrote: >>>> >> >>>> >> The other huge thing this will enable is to do is >>>> >> copy-on-write >>>> >> for >>>> >> various kinds of views, which should cut down on some of the >>>> >> defensive >>>> >> copying in the library and reduce memory usage. >>>> >> >>>> >> On Tue, Dec 29, 2015 at 1:02 PM, Wes McKinney >>>> >> <wesmckinn@gmail.com> >>>> >> wrote: >>>> >> > Basically the approach is >>>> >> > >>>> >> > 1) Base dtype type >>>> >> > 2) Base array type with K >= 1 dimensions >>>> >> > 3) Base scalar type >>>> >> > 4) Base index type >>>> >> > 5) "Wrapper" subclasses for all NumPy types fitting into >>>> >> > categories >>>> >> > #1, #2, #3, #4 >>>> >> > 6) Subclasses for pandas-specific types like category, >>>> >> > datetimeTZ, >>>> >> > etc. >>>> >> > 7) NDFrame as cpcloud wrote is just a list of these >>>> >> > >>>> >> > Indexes and axis labels / column names can get layered on >>>> >> > top. >>>> >> > >>>> >> > After we do all this we can look at adding nested types >>>> >> > (arrays, >>>> >> > maps, >>>> >> > structs) to better support JSON. >>>> >> > >>>> >> > - Wes >>>> >> > >>>> >> > On Tue, Dec 29, 2015 at 12:14 PM, Phillip Cloud >>>> >> > <cpcloud@gmail.com> >>>> >> > wrote: >>>> >> >> Maybe this is saying the same thing as Wes, but how far >>>> >> >> would >>>> >> >> something >>>> >> >> like >>>> >> >> this get us? >>>> >> >> >>>> >> >> // warning: things are probably not this simple >>>> >> >> >>>> >> >> struct data_array_t { >>>> >> >> void *primitive; // scalar data >>>> >> >> data_array_t *nested; // nested data >>>> >> >> boost::dynamic_bitset isnull; // might have to create >>>> >> >> our >>>> >> >> own >>>> >> >> to >>>> >> >> avoid >>>> >> >> boost >>>> >> >> schema_t schema; // not sure exactly what this looks >>>> >> >> like >>>> >> >> }; >>>> >> >> >>>> >> >> typedef std::map<string, data_array_t> data_frame_t; // >>>> >> >> probably >>>> >> >> not >>>> >> >> this >>>> >> >> simple >>>> >> >> >>>> >> >> To answer Jeff’s use-case question: I think that the use >>>> >> >> cases >>>> >> >> are >>>> >> >> 1) >>>> >> >> freedom from numpy (mostly) 2) no more block manager which >>>> >> >> frees >>>> >> >> us >>>> >> >> from the >>>> >> >> limitations of the block memory layout. In particular, the >>>> >> >> ability >>>> >> >> to >>>> >> >> take >>>> >> >> advantage of memory mapped IO would be a big win IMO. >>>> >> >> >>>> >> >> >>>> >> >> On Tue, Dec 29, 2015 at 2:50 PM Wes McKinney >>>> >> >> <wesmckinn@gmail.com> >>>> >> >> wrote: >>>> >> >>> >>>> >> >>> I will write a more detailed response to some of these >>>> >> >>> things >>>> >> >>> after >>>> >> >>> the new year, but, in particular, re: missing values, can >>>> >> >>> you >>>> >> >>> or >>>> >> >>> someone tell me why creating an object that contains a >>>> >> >>> NumPy >>>> >> >>> array and >>>> >> >>> a bitmap is not sufficient? If we we can add a >>>> >> >>> lightweight >>>> >> >>> C/C++ >>>> >> >>> class >>>> >> >>> layer between NumPy function calls (e.g. arithmetic) and >>>> >> >>> pandas >>>> >> >>> function calls, then I see no reason why we cannot have >>>> >> >>> >>>> >> >>> Int32Array->add >>>> >> >>> >>>> >> >>> and >>>> >> >>> >>>> >> >>> Float32Array->add >>>> >> >>> >>>> >> >>> do the right thing (the former would be responsible for >>>> >> >>> bitmasking to >>>> >> >>> propagate NA values; the latter would defer to NumPy). If >>>> >> >>> we >>>> >> >>> can >>>> >> >>> put >>>> >> >>> all the internals of pandas objects inside a black box, >>>> >> >>> we >>>> >> >>> can >>>> >> >>> add >>>> >> >>> layers of virtual function indirection without a >>>> >> >>> performance >>>> >> >>> penalty >>>> >> >>> (e.g. adding more interpreter overhead with more >>>> >> >>> abstraction >>>> >> >>> layers >>>> >> >>> does add up to a perf penalty). >>>> >> >>> >>>> >> >>> I don't think this is too scary -- I would be willing to >>>> >> >>> create a >>>> >> >>> small POC C++ library to prototype something like what >>>> >> >>> I'm >>>> >> >>> talking >>>> >> >>> about. >>>> >> >>> >>>> >> >>> Since pandas has limited points of contact with NumPy I >>>> >> >>> don't >>>> >> >>> think >>>> >> >>> this would end up being too onerous. >>>> >> >>> >>>> >> >>> For the record, I'm pretty allergic to "advanced C++"; I >>>> >> >>> think it >>>> >> >>> is a >>>> >> >>> useful tool if you pick a sane 20% subset of the C++11 >>>> >> >>> spec >>>> >> >>> and >>>> >> >>> follow >>>> >> >>> Google C++ style it's not very inaccessible to >>>> >> >>> intermediate >>>> >> >>> developers. More or less "C plus OOP and easier object >>>> >> >>> lifetime >>>> >> >>> management (shared/unique_ptr, etc.)". As soon as you add >>>> >> >>> a >>>> >> >>> lot >>>> >> >>> of >>>> >> >>> template metaprogramming C++ library development quickly >>>> >> >>> becomes >>>> >> >>> inaccessible except to the C++-Jedi. >>>> >> >>> >>>> >> >>> Maybe let's start a Google document on "pandas roadmap" >>>> >> >>> where >>>> >> >>> we >>>> >> >>> can >>>> >> >>> break down the 1-2 year goals and some of these >>>> >> >>> infrastructure >>>> >> >>> issues >>>> >> >>> and have our discussion there? (obviously publish this >>>> >> >>> someplace >>>> >> >>> once >>>> >> >>> we're done) >>>> >> >>> >>>> >> >>> - Wes >>>> >> >>> >>>> >> >>> On Fri, Dec 25, 2015 at 2:14 PM, Jeff Reback >>>> >> >>> <jeffreback@gmail.com> >>>> >> >>> wrote: >>>> >> >>> > Here are some of my thoughts about pandas Roadmap / >>>> >> >>> > status >>>> >> >>> > and >>>> >> >>> > some >>>> >> >>> > responses to Wes's thoughts. >>>> >> >>> > >>>> >> >>> > In the last few (and upcoming) major releases we have >>>> >> >>> > been >>>> >> >>> > made >>>> >> >>> > the >>>> >> >>> > following changes: >>>> >> >>> > >>>> >> >>> > - dtype enhancements (Categorical, Timedelta, Datetime >>>> >> >>> > w/tz) & >>>> >> >>> > making >>>> >> >>> > these >>>> >> >>> > first class objects >>>> >> >>> > - code refactoring to remove subclassing of ndarrays >>>> >> >>> > for >>>> >> >>> > Series >>>> >> >>> > & >>>> >> >>> > Index >>>> >> >>> > - carving out / deprecating non-core parts of pandas >>>> >> >>> > - datareader >>>> >> >>> > - SparsePanel, WidePanel & other aliases (TImeSeries) >>>> >> >>> > - rpy, rplot, irow et al. >>>> >> >>> > - google-analytics >>>> >> >>> > - API changes to make things more consistent >>>> >> >>> > - pd.rolling/expanding * -> .rolling/expanding (this >>>> >> >>> > is >>>> >> >>> > in >>>> >> >>> > master >>>> >> >>> > now) >>>> >> >>> > - .resample becoming a full defered like groupby. >>>> >> >>> > - multi-index slicing along any level (obviates need >>>> >> >>> > for >>>> >> >>> > .xs) >>>> >> >>> > and >>>> >> >>> > allows >>>> >> >>> > assignment >>>> >> >>> > - .loc/.iloc - for the most part obviates use of .ix >>>> >> >>> > - .pipe & .assign >>>> >> >>> > - plotting accessors >>>> >> >>> > - fixing of the sorting API >>>> >> >>> > - many performance enhancements both micro & macro >>>> >> >>> > (e.g. >>>> >> >>> > release >>>> >> >>> > GIL) >>>> >> >>> > >>>> >> >>> > Some on-deck enhancements are (meaning these are >>>> >> >>> > basically >>>> >> >>> > ready to >>>> >> >>> > go >>>> >> >>> > in): >>>> >> >>> > - IntervalIndex (and eventually make PeriodIndex just >>>> >> >>> > a >>>> >> >>> > sub-class >>>> >> >>> > of >>>> >> >>> > this) >>>> >> >>> > - RangeIndex >>>> >> >>> > >>>> >> >>> > so lots of changes, though nothing really earth >>>> >> >>> > shaking, >>>> >> >>> > just >>>> >> >>> > more >>>> >> >>> > convenience, reducing magicness somewhat >>>> >> >>> > and providing flexibility. >>>> >> >>> > >>>> >> >>> > Of course we are getting increasing issues, mostly bug >>>> >> >>> > reports >>>> >> >>> > (and >>>> >> >>> > lots >>>> >> >>> > of >>>> >> >>> > dupes), some edge case enhancements >>>> >> >>> > which can add to the existing API's and of course, >>>> >> >>> > requests >>>> >> >>> > to >>>> >> >>> > expand >>>> >> >>> > the >>>> >> >>> > (already) large code to other usecases. >>>> >> >>> > Balancing this are a good many pull-requests from many >>>> >> >>> > different >>>> >> >>> > users, >>>> >> >>> > some >>>> >> >>> > even deep into the internals. >>>> >> >>> > >>>> >> >>> > Here are some things that I have talked about and could >>>> >> >>> > be >>>> >> >>> > considered >>>> >> >>> > for >>>> >> >>> > the roadmap. Disclaimer: I do work for Continuum >>>> >> >>> > but these views are of course my own; furthermore >>>> >> >>> > obviously >>>> >> >>> > I >>>> >> >>> > am a >>>> >> >>> > bit >>>> >> >>> > more >>>> >> >>> > familiar with some of the 'sponsored' open-source >>>> >> >>> > libraries, but always open to new things. >>>> >> >>> > >>>> >> >>> > - integration / automatic deferral to numba for JIT >>>> >> >>> > (this >>>> >> >>> > would >>>> >> >>> > be >>>> >> >>> > thru >>>> >> >>> > .apply) >>>> >> >>> > - automatic deferal to dask from groubpy where >>>> >> >>> > appropriate >>>> >> >>> > / >>>> >> >>> > maybe a >>>> >> >>> > .to_parallel (to simply return a dask.DataFrame object) >>>> >> >>> > - incorporation of quantities / units (as part of the >>>> >> >>> > dtype) >>>> >> >>> > - use of DyND to allow missing values for int dtypes >>>> >> >>> > - make Period a first class dtype. >>>> >> >>> > - provide some copy-on-write semantics to alleviate the >>>> >> >>> > chained-indexing >>>> >> >>> > issues which occasionaly come up with the mis-use of >>>> >> >>> > the >>>> >> >>> > indexing >>>> >> >>> > API >>>> >> >>> > - allow a 'policy' to automatically provide column >>>> >> >>> > blocks >>>> >> >>> > for >>>> >> >>> > dict-like >>>> >> >>> > input (e.g. each column would be a block), this would >>>> >> >>> > allow >>>> >> >>> > a >>>> >> >>> > pass-thru >>>> >> >>> > API >>>> >> >>> > where you could >>>> >> >>> > put in numpy arrays where you have views and have them >>>> >> >>> > preserved >>>> >> >>> > rather >>>> >> >>> > than >>>> >> >>> > copied automatically. Note that this would also allow >>>> >> >>> > what >>>> >> >>> > I >>>> >> >>> > call >>>> >> >>> > 'split' >>>> >> >>> > where a passed in >>>> >> >>> > multi-dim numpy array could be split up to individual >>>> >> >>> > blocks >>>> >> >>> > (which >>>> >> >>> > actually >>>> >> >>> > gives a nice perf boost after the splitting costs). >>>> >> >>> > >>>> >> >>> > In working towards some of these goals. I have come to >>>> >> >>> > the >>>> >> >>> > opinion >>>> >> >>> > that >>>> >> >>> > it >>>> >> >>> > would make sense to have a neutral API protocol layer >>>> >> >>> > that would allow us to swap out different engines as >>>> >> >>> > needed, >>>> >> >>> > for >>>> >> >>> > particular >>>> >> >>> > dtypes, or *maybe* out-of-core type computations. E.g. >>>> >> >>> > imagine that we replaced the in-memory block structure >>>> >> >>> > with >>>> >> >>> > a >>>> >> >>> > bclolz >>>> >> >>> > / >>>> >> >>> > memap >>>> >> >>> > type; in theory this should be 'easy' and just work. >>>> >> >>> > I could also see us adopting *some* of the SFrame code >>>> >> >>> > to >>>> >> >>> > allow >>>> >> >>> > easier >>>> >> >>> > interop with this API layer. >>>> >> >>> > >>>> >> >>> > In practice, I think a nice API layer would need to be >>>> >> >>> > created >>>> >> >>> > to >>>> >> >>> > make >>>> >> >>> > this >>>> >> >>> > clean / nice. >>>> >> >>> > >>>> >> >>> > So this comes around to Wes's point about creating a >>>> >> >>> > c++ >>>> >> >>> > library for >>>> >> >>> > the >>>> >> >>> > internals (and possibly even some of the indexing >>>> >> >>> > routines). >>>> >> >>> > In an ideal world, or course this would be desirable. >>>> >> >>> > Getting >>>> >> >>> > there >>>> >> >>> > is a >>>> >> >>> > bit >>>> >> >>> > non-trivial I think, and IMHO might not be worth the >>>> >> >>> > effort. I >>>> >> >>> > don't >>>> >> >>> > really see big performance bottlenecks. We *already* >>>> >> >>> > defer >>>> >> >>> > much >>>> >> >>> > of >>>> >> >>> > the >>>> >> >>> > computation to libraries like numexpr & bottleneck >>>> >> >>> > (where >>>> >> >>> > appropriate). >>>> >> >>> > Adding numba / dask to the list would be helpful. >>>> >> >>> > >>>> >> >>> > I think that almost all performance issues are the >>>> >> >>> > result >>>> >> >>> > of: >>>> >> >>> > >>>> >> >>> > a) gross misuse of the pandas API. How much code have >>>> >> >>> > you >>>> >> >>> > seen >>>> >> >>> > that >>>> >> >>> > does >>>> >> >>> > df.apply(lambda x: x.sum()) >>>> >> >>> > b) routines which operate column-by-column rather >>>> >> >>> > block-by-block and >>>> >> >>> > are >>>> >> >>> > in >>>> >> >>> > python space (e.g. we have an issue right now about >>>> >> >>> > .quantile) >>>> >> >>> > >>>> >> >>> > So I am glossing over a big goal of having a c++ >>>> >> >>> > library >>>> >> >>> > that >>>> >> >>> > represents >>>> >> >>> > the >>>> >> >>> > pandas internals. This would by definition have a c-API >>>> >> >>> > that so >>>> >> >>> > you *could* use pandas like semantics in c/c++ and just >>>> >> >>> > have it >>>> >> >>> > work >>>> >> >>> > (and >>>> >> >>> > then pandas would be a thin wrapper around this >>>> >> >>> > library). >>>> >> >>> > >>>> >> >>> > I am not averse to this, but I think would be quite a >>>> >> >>> > big >>>> >> >>> > effort, >>>> >> >>> > and >>>> >> >>> > not a >>>> >> >>> > huge perf boost IMHO. Further there are a number of API >>>> >> >>> > issues >>>> >> >>> > w.r.t. >>>> >> >>> > indexing >>>> >> >>> > which need to be clarified / worked out (e.g. should we >>>> >> >>> > simply >>>> >> >>> > deprecate >>>> >> >>> > []) >>>> >> >>> > that are much easier to test / figure out in python >>>> >> >>> > space. >>>> >> >>> > >>>> >> >>> > I also thing that we have quite a large number of >>>> >> >>> > contributors. >>>> >> >>> > Moving >>>> >> >>> > to >>>> >> >>> > c++ might make the internals a bit more impenetrable >>>> >> >>> > that >>>> >> >>> > the >>>> >> >>> > current >>>> >> >>> > internals. >>>> >> >>> > (though this would allow c++ people to contribute, so >>>> >> >>> > that >>>> >> >>> > might >>>> >> >>> > balance >>>> >> >>> > out). >>>> >> >>> > >>>> >> >>> > We have a limited core of devs whom right now are >>>> >> >>> > familar >>>> >> >>> > with >>>> >> >>> > things. >>>> >> >>> > If >>>> >> >>> > someone happened to have a starting base for a c++ >>>> >> >>> > library, >>>> >> >>> > then I >>>> >> >>> > might >>>> >> >>> > change >>>> >> >>> > opinions here. >>>> >> >>> > >>>> >> >>> > >>>> >> >>> > my 4c. >>>> >> >>> > >>>> >> >>> > Jeff >>>> >> >>> > >>>> >> >>> > >>>> >> >>> > >>>> >> >>> > >>>> >> >>> > On Thu, Dec 24, 2015 at 7:18 PM, Wes McKinney >>>> >> >>> > <wesmckinn@gmail.com> >>>> >> >>> > wrote: >>>> >> >>> >> >>>> >> >>> >> Deep thoughts during the holidays. >>>> >> >>> >> >>>> >> >>> >> I might be out of line here, but the >>>> >> >>> >> interpreter-heaviness >>>> >> >>> >> of >>>> >> >>> >> the >>>> >> >>> >> inside of pandas objects is likely to be a long-term >>>> >> >>> >> liability >>>> >> >>> >> and >>>> >> >>> >> source of performance problems and technical debt. >>>> >> >>> >> >>>> >> >>> >> Has anyone put any thought into planning and beginning >>>> >> >>> >> to >>>> >> >>> >> execute >>>> >> >>> >> on a >>>> >> >>> >> rewrite that moves as much as possible of the >>>> >> >>> >> internals >>>> >> >>> >> into >>>> >> >>> >> native >>>> >> >>> >> / >>>> >> >>> >> compiled code? I'm talking about: >>>> >> >>> >> >>>> >> >>> >> - pandas/core/internals >>>> >> >>> >> - indexing and assignment >>>> >> >>> >> - much of pandas/core/common >>>> >> >>> >> - categorical and custom dtypes >>>> >> >>> >> - all indexing mechanisms >>>> >> >>> >> >>>> >> >>> >> I'm concerned we've already exposed too much internals >>>> >> >>> >> to >>>> >> >>> >> users, so >>>> >> >>> >> this might lead to a lot of API breakage, but it might >>>> >> >>> >> be >>>> >> >>> >> for >>>> >> >>> >> the >>>> >> >>> >> Greater Good. As a first step, beginning a partial >>>> >> >>> >> migration >>>> >> >>> >> of >>>> >> >>> >> internals into some C++ classes that encapsulate the >>>> >> >>> >> insides >>>> >> >>> >> of >>>> >> >>> >> DataFrame objects and implement indexing and >>>> >> >>> >> block-level >>>> >> >>> >> manipulations >>>> >> >>> >> would be a good place to start. I think you could do >>>> >> >>> >> this >>>> >> >>> >> wouldn't >>>> >> >>> >> too >>>> >> >>> >> much disruption. >>>> >> >>> >> >>>> >> >>> >> As part of this internal retooling we might give >>>> >> >>> >> consideration >>>> >> >>> >> to >>>> >> >>> >> alternative data structures for representing data >>>> >> >>> >> internal >>>> >> >>> >> to >>>> >> >>> >> pandas >>>> >> >>> >> objects. Now in 2015/2016, continuing to be hamstrung >>>> >> >>> >> by >>>> >> >>> >> NumPy's >>>> >> >>> >> limitations feels somewhat anachronistic. User code is >>>> >> >>> >> riddled >>>> >> >>> >> with >>>> >> >>> >> workarounds for data type fidelity issues and the >>>> >> >>> >> like. >>>> >> >>> >> Like, >>>> >> >>> >> really, >>>> >> >>> >> why not add a bitndarray (similar to >>>> >> >>> >> ilanschnell/bitarray) >>>> >> >>> >> for >>>> >> >>> >> storing >>>> >> >>> >> nullness for problematic types and hide this from the >>>> >> >>> >> user? =) >>>> >> >>> >> >>>> >> >>> >> Since we are now a NumFOCUS-sponsored project, I feel >>>> >> >>> >> like >>>> >> >>> >> we >>>> >> >>> >> might >>>> >> >>> >> consider establishing some formal governance over >>>> >> >>> >> pandas >>>> >> >>> >> and >>>> >> >>> >> publishing meetings notes and roadmap documents >>>> >> >>> >> describing >>>> >> >>> >> plans >>>> >> >>> >> for >>>> >> >>> >> the project and meetings notes from committers. >>>> >> >>> >> There's no >>>> >> >>> >> real >>>> >> >>> >> "committer culture" for NumFOCUS projects like there >>>> >> >>> >> is >>>> >> >>> >> with >>>> >> >>> >> the >>>> >> >>> >> Apache Software Foundation, but we might try leading >>>> >> >>> >> by >>>> >> >>> >> example! >>>> >> >>> >> >>>> >> >>> >> Also, I believe pandas as a project has reached a >>>> >> >>> >> level of >>>> >> >>> >> importance >>>> >> >>> >> where we ought to consider planning and execution on >>>> >> >>> >> larger >>>> >> >>> >> scale >>>> >> >>> >> undertakings such as this for safeguarding the future. >>>> >> >>> >> >>>> >> >>> >> As for myself, well, I have my hands full in Big >>>> >> >>> >> Data-land. I >>>> >> >>> >> wish >>>> >> >>> >> I >>>> >> >>> >> could be helping more with pandas, but there a quite a >>>> >> >>> >> few >>>> >> >>> >> fundamental >>>> >> >>> >> issues (like data interoperability nested data >>>> >> >>> >> handling >>>> >> >>> >> and >>>> >> >>> >> file >>>> >> >>> >> format support — e.g. Parquet, see >>>> >> >>> >> >>>> >> >>> >> >>>> >> >>> >> >>>> >> >>> >> >>>> >> >>> >> >>>> >> >>> >> >>>> >> >>> >> http://wesmckinney.com/blog/the-problem-with-the-data-science-language-wars/) >>>> >> >>> >> preventing Python from being more useful in industry >>>> >> >>> >> analytics >>>> >> >>> >> applications. >>>> >> >>> >> >>>> >> >>> >> Aside: one of the bigger mistakes I made with pandas's >>>> >> >>> >> API >>>> >> >>> >> design >>>> >> >>> >> was >>>> >> >>> >> making it acceptable to call class constructors — like >>>> >> >>> >> pandas.DataFrame — directly (versus factory >>>> >> >>> >> functions). >>>> >> >>> >> Sorry >>>> >> >>> >> about >>>> >> >>> >> that! If we could convince everyone to start writing >>>> >> >>> >> pandas.data_frame >>>> >> >>> >> or dataframe instead of using the class reference it >>>> >> >>> >> would >>>> >> >>> >> help a >>>> >> >>> >> lot >>>> >> >>> >> with code cleanup. It's hard to plan for these things >>>> >> >>> >> — >>>> >> >>> >> NumPy >>>> >> >>> >> interoperability seemed a lot more important in 2008 >>>> >> >>> >> than >>>> >> >>> >> it >>>> >> >>> >> does >>>> >> >>> >> now, >>>> >> >>> >> so I forgive myself. >>>> >> >>> >> >>>> >> >>> >> cheers and best wishes for 2016, >>>> >> >>> >> Wes >>>> >> >>> >> _______________________________________________ >>>> >> >>> >> Pandas-dev mailing list >>>> >> >>> >> Pandas-dev@python.org >>>> >> >>> >> https://mail.python.org/mailman/listinfo/pandas-dev >>>> >> >>> > >>>> >> >>> > >>>> >> >>> _______________________________________________ >>>> >> >>> Pandas-dev mailing list >>>> >> >>> Pandas-dev@python.org >>>> >> >>> https://mail.python.org/mailman/listinfo/pandas-dev >>>> >> _______________________________________________ >>>> >> Pandas-dev mailing list >>>> >> Pandas-dev@python.org >>>> >> https://mail.python.org/mailman/listinfo/pandas-dev >>>> > >>>> > >>>> >>>> >>>> _______________________________________________ >>>> Pandas-dev mailing list >>>> Pandas-dev@python.org >>>> https://mail.python.org/mailman/listinfo/pandas-dev >>>> >>> >
Pandas-dev mailing list Pandas-dev@python.org https://mail.python.org/mailman/listinfo/pandas-dev
_______________________________________________ Pandas-dev mailing list Pandas-dev@python.org https://mail.python.org/mailman/listinfo/pandas-dev
_______________________________________________ Pandas-dev mailing list Pandas-dev@python.org https://mail.python.org/mailman/listinfo/pandas-dev
On Wed, Jan 6, 2016 at 11:26 AM, Wes McKinney <wesmckinn@gmail.com> wrote:
hey Stephan,
Thanks for all the thoughts. Let me make a few off-the-cuff comments.
On Wed, Jan 6, 2016 at 10:11 AM, Stephan Hoyer <shoyer@gmail.com> wrote:
I was asked about this off list, so I'll belatedly share my thoughts.
First of all, I am really excited by Wes's renewed engagement in the project and his interest in rewriting pandas internals. This is quite an ambitious plan and nobody is better positioned to tackle it than Wes.
I have mixed feelings about the details of the rewrite itself.
+1 on the simpler internal data model. The block manager is confusing and leads to hard to predict performance issues related to copying data. If we can do all column additions/removals/re-orderings without a copy it will be a clear win.
+0 on moving internals to C++. I do like the performance benefits, but it seems like a lot of work, and it may make pandas less friendly to new contributors.
It really goes beyond performance benefits. If you go back to my 2013 talk http://www.slideshare.net/wesm/practical-medium-data-analytics-with-python there's a long list of architectural problems that now in 2016 haven't found solutions. The only way (that I can fully reason through -- I am happy to look at alternate proposals) to move the internals of pandas closer to the metal is to give Series and DataFrame a C/C++ API -- this is the "libpandas native core" as I've been describing.
I should point out the the main thing that's changed since that preso is "synthetic" data types like Categorical. But seeing what it took for Jeff et al to build that is a prime motivation for this internals refactoring plan.
-0 on writing a brand new dtype system just for pandas -- this stuff really belongs in NumPy (or another array library like DyND), and I am skeptical that pandas can do a complete enough job to be useful without replicating all that functionality.
I'm curious what "a brand new dtype system" means to you. pandas already has its own data type system, but it's a potpourri of inconsistencies and rough edges with self-evident problems for both users and developers. Some indicators:
- Some pandas types use NaN for missing data, others None (or both), others nothing at all. We lose data (integers) or bloat memory (booleans) by upcasting to float-NaN or object-None. - Internal functions full of is_XXX_dtype functions: pandas.core.common, pandas.core.algorithms, etc. - Series.values on synthetic dtypes like Categorical - We use arrays of Python objects for string data
The biggest cause IMHO is that pandas is too tightly coupled to NumPy, but it's coupled in a way that makes development and extensibility difficult. We've already allowed NumPy-specific details to taint the pandas user API in many unpleasant ways. This isn't to say "NumPy is bad" but rather "pandas tries to layer domain-specific functionality [that NumPy was not designed for] on top".
Some things things I'm advocating with the internals refactor:
1) First class "pandas type" objects. This is not the same as a NumPy dtype which has some pretty loaded implications -- in particular, NumPy dtypes are implicitly coupled to an array computing framework (see the function table that is attached to the PyArray_Descr object)
2) Pandas array container types that map user-land API calls to implementation-land API calls (in NumPy, DyND, or pandas-native code like pandas.core.algorithms etc.). This will make it much easier to leverage innovations in NumPy and DyND without those implementation details spilling over into the pandas user API
3) Adding a single pandas.NA singleton to have one library-wide notion of a scalar null value (obviously, we can automatically map NaN and None to NA for backwards compatibility).
4) Layering a bitmask internally on NumPy arrays (especially integer and boolean) to add null-ness to types that need it. Note that this does not prevent us from switching to DyND arrays with option dtype in the future. If the details of how we are implementing NULL are visible to the user, we have failed.
5) Removing the block manager in favor of simpler pandas Array (1D) and Table (2D -- vector of Array) data structures
I believe you can do all this without harming interoperability with the ecosystem of projects that people currently use in conjunction with pandas.
More broadly, I am concerned that this rewrite may improve the tabular computation ecosystem at the cost of inter-operability with the array-based ecosystem (numpy, scipy, sklearn, xarray, etc.). The later has been one of the strengths of pandas and it would be a shame to see that go away.
I have no intention of letting this happen. What I've am asking from you (and others reading) is to help define what constitutes interoperability. What guarantees do we make the user?
For example, we should have very strict guidelines for the output of:
np.asarray(pandas_obj)
For example
In [3]: s = pd.Series([1,2,3]*10).astype('category')
In [4]: np.asarray(s) Out[4]: array([1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3])
I see no reason why this should necessarily behave any differently. The problem will come in when there is pandas data that is not precisely representable in a NumPy array. Example:
In [5]: s = pd.Series([1,2,3, 4])
In [6]: s.dtype Out[6]: dtype('int64')
In [7]: s2 = s.reindex(np.arange(10))
In [8]: s2.dtype Out[8]: dtype('float64')
In [9]: np.asarray(s2) Out[9]: array([ 1., 2., 3., 4., nan, nan, nan, nan, nan, nan])
With the "new internals", s2 will still be int64 type, but we may decide that np.asarray(s2) should raise an exception rather than implicitly make a decision about how to perform a "lossy" conversion to a NumPy array. If you are using DyND with pandas, then the equivalent function would be able to implicitly convert without data loss.
We're already starting to struggle with inter-operability with the new pandas dtypes and a further rewrite would make this even harder. For example, see categoricals and scikit-learn in Tom's recent post [1], or the fact that .values no longer always returns a numpy array. This has also been a challenge for xarray, which can't handle these new dtypes because we lack a suitable array backend for them.
I'm definitely motivated in this initiative by these challenges. The idea here is that with the new internals, Series.values will always return the same type of object, and there will be one consistent code path for getting a NumPy array out. For example, rather than:
if isinstance(s.values, Categorical): # pandas ... else: # NumPy ...
We could have (just an idea)
s.values.to_numpy()
Or simply
np.asarray(s.values)
Personally, I would much rather leverage a full featured library like an improved NumPy or DyND for new dtypes, because that could also be used by the array-based ecosystem. At the very least, it would be good to think about zero-copy inter-operability with array-based tools.
I'm all for zero-copy interoperability when possible, but my gut feeling is that exposing the data type system of an array library (the choice of which is an implementation detail) to pandas users is an inherent leaky abstraction that will continue to cause problems if we plan to keep innovating inside pandas. By better hiding NumPy details and types from the user we will make it much easier to swap out new low level array data structures and compute components (e.g. DyND), or add custom data structures or out-of-core tools (memory maps, bcolz, etc.)
I'm additionally offering to do nearly all of this replumbing of pandas internals myself, and completely in my free time. What I will expect in return from you all is to help enumerate our contracts with the pandas user (i.e. interoperability) and to hold me accountable to not break them. I know I haven't been committing code on pandas since mid-2013 (after a 5 year marathon), but these architectural problems have been on my mind almost constantly since then, I just haven't had the bandwidth to start tackling them.
cheers, Wes
On the other hand, I wonder if maybe it would be better to write a native in-memory backend for Ibis instead of rewriting pandas. Ibis does seem to have improved/simplified API which resolves many of pandas's warts. That said, it's a pretty big change from the "DataFrame as matrix" model, and pandas won't be going away anytime soon. I do like that it would force users to be more explicit about converting between tables and arrays, which might also make distinctions between the tabular and array oriented ecosystems easier to swallow.
Just my two cents, from someone who has lots of opinions but who will likely stay on the sidelines for most of this work.
Cheers, Stephan
[1] http://tomaugspurger.github.io/categorical-pipelines.html
On Fri, Jan 1, 2016 at 6:06 PM, Jeff Reback <jeffreback@gmail.com> wrote:
ok I moved the document to the Pandas folder, where the same group should be able to edit/upload/etc. lmk if any issues
On Fri, Jan 1, 2016 at 8:48 PM, Wes McKinney <wesmckinn@gmail.com> wrote:
Thanks Jeff. Can you create and share a shared Drive folder containing this where I can put other auxiliary / follow up documents?
On Fri, Jan 1, 2016 at 5:23 PM, Jeff Reback <jeffreback@gmail.com> wrote:
I changed the doc so that the core dev people can edit. I *think* that everyone should be able to view/comment though.
On Fri, Jan 1, 2016 at 8:13 PM, Wes McKinney <wesmckinn@gmail.com> wrote:
Jeff -- can you require log-in for editing on this document?
https://docs.google.com/document/d/151ct8jcZWwh7XStptjbLsda6h2b3C0IuiH_hfZnU...
There are a number of anonymous edits.
On Wed, Dec 30, 2015 at 6:04 PM, Wes McKinney <wesmckinn@gmail.com> wrote: > I cobbled together an ugly start of a c++->cython->pandas toolchain > here > > https://github.com/wesm/pandas/tree/libpandas-native-core > > I used a mix of Kudu, Impala, and dynd-python cmake sources, so it's > a > bit messy at the moment but it should be sufficient to run some real > experiments with a little more work. I reckon it's like a 6 month > project to tear out the insides of Series and DataFrame and replace > it > with a new "native core", but we should be able to get enough info > to > see whether it's a viable plan within a month or so. > > The end goal is to create "private" extension types in Cython that > can > be the new base classes for Series and NDFrame; these will hold a > reference to a C++ object that contains wrappered NumPy arrays and > other metadata (like pandas-only dtypes). > > It might be too hard to try to replace a single usage of block > manager > as a first experiment, so I'll try to create a minimal "SeriesLite" > that supports 3 dtypes > > 1) float64 with nans > 2) int64 with a bitmask for NAs > 3) category type for one of these > > Just want to get a feel for the extensibility and offer an NA > singleton Python object (a la None) for getting and setting NAs > across > these 3 dtypes. > > If we end up going down this route, any way to place a moratorium on > invasive work on pandas internals (outside bug fixes)? > > Pedantic aside: I'd rather avoid shipping thirdparty C/C++ libraries > like googletest and friends in pandas if we can. Cloudera folks have > been working on a portable C++ library toolchain for Impala and > other > projects at https://github.com/cloudera/native-toolchain, but it is > only being tested on Linux and OS X. Most google libraries should > build out of the box on MSVC but it'll be something to keep an eye > on. > > BTW thanks to the libdynd developers for pioneering the c++ lib <-> > python-c++ lib <-> cython toolchain; being able to build Cython > extensions directly from cmake is a godsend > > HNY all > Wes > > On Tue, Dec 29, 2015 at 4:17 PM, Irwin Zaid <izaid@continuum.io> > wrote: >> Yeah, that seems reasonable and I totally agree a Pandas wrapper >> layer >> would >> be necessary. >> >> I'll keep an eye on this and I'd like to help if I can. >> >> Irwin >> >> >> On Tue, Dec 29, 2015 at 6:01 PM, Wes McKinney <wesmckinn@gmail.com> >> wrote: >>> >>> I'm not suggesting a rewrite of NumPy functionality but rather >>> pandas >>> functionality that is currently written in a mishmash of Cython >>> and >>> Python. >>> Happy to experiment with changing the internal compute >>> infrastructure >>> and >>> data representation to DyND after this first stage of cleanup is >>> done. >>> Even >>> if we use DyND a pretty extensive pandas wrapper layer will be >>> necessary. >>> >>> >>> On Tuesday, December 29, 2015, Irwin Zaid <izaid@continuum.io> >>> wrote: >>>> >>>> Hi Wes (and others), >>>> >>>> I've been following this conversation with interest. I do think >>>> it >>>> would >>>> be worth exploring DyND, rather than setting up yet another >>>> rewrite >>>> of >>>> NumPy-functionality. Especially because DyND is already an >>>> optional >>>> dependency of Pandas. >>>> >>>> For things like Integer NA and new dtypes, DyND is there and >>>> ready to >>>> do >>>> this. >>>> >>>> Irwin >>>> >>>> On Tue, Dec 29, 2015 at 5:18 PM, Wes McKinney >>>> <wesmckinn@gmail.com> >>>> wrote: >>>>> >>>>> Can you link to the PR you're talking about? >>>>> >>>>> I will see about spending a few hours setting up a libpandas.so >>>>> as a >>>>> C++ >>>>> shared library where we can run some experiments and validate >>>>> whether it can >>>>> solve the integer-NA problem and be a place to put new data >>>>> types >>>>> (categorical and friends). I'm +1 on targeting >>>>> >>>>> Would it also be worth making a wish list of APIs we might >>>>> consider >>>>> breaking in a pandas 1.0 release that also features this new >>>>> "native >>>>> core"? >>>>> Might as well right some wrongs while we're doing some invasive >>>>> work >>>>> on the >>>>> internals; some breakage might be unavoidable. We can always >>>>> maintain a >>>>> pandas legacy 0.x.x maintenance branch (providing a conda binary >>>>> build) for >>>>> legacy users where showstopper bugs can get fixed. >>>>> >>>>> On Tue, Dec 29, 2015 at 1:20 PM, Jeff Reback >>>>> <jeffreback@gmail.com> >>>>> wrote: >>>>> > Wes your last is noted as well. I *think* we can actually do >>>>> > this >>>>> > now >>>>> > (well >>>>> > there is a PR out there). >>>>> > >>>>> > On Tue, Dec 29, 2015 at 4:12 PM, Wes McKinney >>>>> > <wesmckinn@gmail.com> >>>>> > wrote: >>>>> >> >>>>> >> The other huge thing this will enable is to do is >>>>> >> copy-on-write >>>>> >> for >>>>> >> various kinds of views, which should cut down on some of the >>>>> >> defensive >>>>> >> copying in the library and reduce memory usage. >>>>> >> >>>>> >> On Tue, Dec 29, 2015 at 1:02 PM, Wes McKinney >>>>> >> <wesmckinn@gmail.com> >>>>> >> wrote: >>>>> >> > Basically the approach is >>>>> >> > >>>>> >> > 1) Base dtype type >>>>> >> > 2) Base array type with K >= 1 dimensions >>>>> >> > 3) Base scalar type >>>>> >> > 4) Base index type >>>>> >> > 5) "Wrapper" subclasses for all NumPy types fitting into >>>>> >> > categories >>>>> >> > #1, #2, #3, #4 >>>>> >> > 6) Subclasses for pandas-specific types like category, >>>>> >> > datetimeTZ, >>>>> >> > etc. >>>>> >> > 7) NDFrame as cpcloud wrote is just a list of these >>>>> >> > >>>>> >> > Indexes and axis labels / column names can get layered on >>>>> >> > top. >>>>> >> > >>>>> >> > After we do all this we can look at adding nested types >>>>> >> > (arrays, >>>>> >> > maps, >>>>> >> > structs) to better support JSON. >>>>> >> > >>>>> >> > - Wes >>>>> >> > >>>>> >> > On Tue, Dec 29, 2015 at 12:14 PM, Phillip Cloud >>>>> >> > <cpcloud@gmail.com> >>>>> >> > wrote: >>>>> >> >> Maybe this is saying the same thing as Wes, but how far >>>>> >> >> would >>>>> >> >> something >>>>> >> >> like >>>>> >> >> this get us? >>>>> >> >> >>>>> >> >> // warning: things are probably not this simple >>>>> >> >> >>>>> >> >> struct data_array_t { >>>>> >> >> void *primitive; // scalar data >>>>> >> >> data_array_t *nested; // nested data >>>>> >> >> boost::dynamic_bitset isnull; // might have to create >>>>> >> >> our >>>>> >> >> own >>>>> >> >> to >>>>> >> >> avoid >>>>> >> >> boost >>>>> >> >> schema_t schema; // not sure exactly what this looks >>>>> >> >> like >>>>> >> >> }; >>>>> >> >> >>>>> >> >> typedef std::map<string, data_array_t> data_frame_t; // >>>>> >> >> probably >>>>> >> >> not >>>>> >> >> this >>>>> >> >> simple >>>>> >> >> >>>>> >> >> To answer Jeff’s use-case question: I think that the use >>>>> >> >> cases >>>>> >> >> are >>>>> >> >> 1) >>>>> >> >> freedom from numpy (mostly) 2) no more block manager which >>>>> >> >> frees >>>>> >> >> us >>>>> >> >> from the >>>>> >> >> limitations of the block memory layout. In particular, the >>>>> >> >> ability >>>>> >> >> to >>>>> >> >> take >>>>> >> >> advantage of memory mapped IO would be a big win IMO. >>>>> >> >> >>>>> >> >> >>>>> >> >> On Tue, Dec 29, 2015 at 2:50 PM Wes McKinney >>>>> >> >> <wesmckinn@gmail.com> >>>>> >> >> wrote: >>>>> >> >>> >>>>> >> >>> I will write a more detailed response to some of these >>>>> >> >>> things >>>>> >> >>> after >>>>> >> >>> the new year, but, in particular, re: missing values, can >>>>> >> >>> you >>>>> >> >>> or >>>>> >> >>> someone tell me why creating an object that contains a >>>>> >> >>> NumPy >>>>> >> >>> array and >>>>> >> >>> a bitmap is not sufficient? If we we can add a >>>>> >> >>> lightweight >>>>> >> >>> C/C++ >>>>> >> >>> class >>>>> >> >>> layer between NumPy function calls (e.g. arithmetic) and >>>>> >> >>> pandas >>>>> >> >>> function calls, then I see no reason why we cannot have >>>>> >> >>> >>>>> >> >>> Int32Array->add >>>>> >> >>> >>>>> >> >>> and >>>>> >> >>> >>>>> >> >>> Float32Array->add >>>>> >> >>> >>>>> >> >>> do the right thing (the former would be responsible for >>>>> >> >>> bitmasking to >>>>> >> >>> propagate NA values; the latter would defer to NumPy). If >>>>> >> >>> we >>>>> >> >>> can >>>>> >> >>> put >>>>> >> >>> all the internals of pandas objects inside a black box, >>>>> >> >>> we >>>>> >> >>> can >>>>> >> >>> add >>>>> >> >>> layers of virtual function indirection without a >>>>> >> >>> performance >>>>> >> >>> penalty >>>>> >> >>> (e.g. adding more interpreter overhead with more >>>>> >> >>> abstraction >>>>> >> >>> layers >>>>> >> >>> does add up to a perf penalty). >>>>> >> >>> >>>>> >> >>> I don't think this is too scary -- I would be willing to >>>>> >> >>> create a >>>>> >> >>> small POC C++ library to prototype something like what >>>>> >> >>> I'm >>>>> >> >>> talking >>>>> >> >>> about. >>>>> >> >>> >>>>> >> >>> Since pandas has limited points of contact with NumPy I >>>>> >> >>> don't >>>>> >> >>> think >>>>> >> >>> this would end up being too onerous. >>>>> >> >>> >>>>> >> >>> For the record, I'm pretty allergic to "advanced C++"; I >>>>> >> >>> think it >>>>> >> >>> is a >>>>> >> >>> useful tool if you pick a sane 20% subset of the C++11 >>>>> >> >>> spec >>>>> >> >>> and >>>>> >> >>> follow >>>>> >> >>> Google C++ style it's not very inaccessible to >>>>> >> >>> intermediate >>>>> >> >>> developers. More or less "C plus OOP and easier object >>>>> >> >>> lifetime >>>>> >> >>> management (shared/unique_ptr, etc.)". As soon as you add >>>>> >> >>> a >>>>> >> >>> lot >>>>> >> >>> of >>>>> >> >>> template metaprogramming C++ library development quickly >>>>> >> >>> becomes >>>>> >> >>> inaccessible except to the C++-Jedi. >>>>> >> >>> >>>>> >> >>> Maybe let's start a Google document on "pandas roadmap" >>>>> >> >>> where >>>>> >> >>> we >>>>> >> >>> can >>>>> >> >>> break down the 1-2 year goals and some of these >>>>> >> >>> infrastructure >>>>> >> >>> issues >>>>> >> >>> and have our discussion there? (obviously publish this >>>>> >> >>> someplace >>>>> >> >>> once >>>>> >> >>> we're done) >>>>> >> >>> >>>>> >> >>> - Wes >>>>> >> >>> >>>>> >> >>> On Fri, Dec 25, 2015 at 2:14 PM, Jeff Reback >>>>> >> >>> <jeffreback@gmail.com> >>>>> >> >>> wrote: >>>>> >> >>> > Here are some of my thoughts about pandas Roadmap / >>>>> >> >>> > status >>>>> >> >>> > and >>>>> >> >>> > some >>>>> >> >>> > responses to Wes's thoughts. >>>>> >> >>> > >>>>> >> >>> > In the last few (and upcoming) major releases we have >>>>> >> >>> > been >>>>> >> >>> > made >>>>> >> >>> > the >>>>> >> >>> > following changes: >>>>> >> >>> > >>>>> >> >>> > - dtype enhancements (Categorical, Timedelta, Datetime >>>>> >> >>> > w/tz) & >>>>> >> >>> > making >>>>> >> >>> > these >>>>> >> >>> > first class objects >>>>> >> >>> > - code refactoring to remove subclassing of ndarrays >>>>> >> >>> > for >>>>> >> >>> > Series >>>>> >> >>> > & >>>>> >> >>> > Index >>>>> >> >>> > - carving out / deprecating non-core parts of pandas >>>>> >> >>> > - datareader >>>>> >> >>> > - SparsePanel, WidePanel & other aliases (TImeSeries) >>>>> >> >>> > - rpy, rplot, irow et al. >>>>> >> >>> > - google-analytics >>>>> >> >>> > - API changes to make things more consistent >>>>> >> >>> > - pd.rolling/expanding * -> .rolling/expanding (this >>>>> >> >>> > is >>>>> >> >>> > in >>>>> >> >>> > master >>>>> >> >>> > now) >>>>> >> >>> > - .resample becoming a full defered like groupby. >>>>> >> >>> > - multi-index slicing along any level (obviates need >>>>> >> >>> > for >>>>> >> >>> > .xs) >>>>> >> >>> > and >>>>> >> >>> > allows >>>>> >> >>> > assignment >>>>> >> >>> > - .loc/.iloc - for the most part obviates use of .ix >>>>> >> >>> > - .pipe & .assign >>>>> >> >>> > - plotting accessors >>>>> >> >>> > - fixing of the sorting API >>>>> >> >>> > - many performance enhancements both micro & macro >>>>> >> >>> > (e.g. >>>>> >> >>> > release >>>>> >> >>> > GIL) >>>>> >> >>> > >>>>> >> >>> > Some on-deck enhancements are (meaning these are >>>>> >> >>> > basically >>>>> >> >>> > ready to >>>>> >> >>> > go >>>>> >> >>> > in): >>>>> >> >>> > - IntervalIndex (and eventually make PeriodIndex just >>>>> >> >>> > a >>>>> >> >>> > sub-class >>>>> >> >>> > of >>>>> >> >>> > this) >>>>> >> >>> > - RangeIndex >>>>> >> >>> > >>>>> >> >>> > so lots of changes, though nothing really earth >>>>> >> >>> > shaking, >>>>> >> >>> > just >>>>> >> >>> > more >>>>> >> >>> > convenience, reducing magicness somewhat >>>>> >> >>> > and providing flexibility. >>>>> >> >>> > >>>>> >> >>> > Of course we are getting increasing issues, mostly bug >>>>> >> >>> > reports >>>>> >> >>> > (and >>>>> >> >>> > lots >>>>> >> >>> > of >>>>> >> >>> > dupes), some edge case enhancements >>>>> >> >>> > which can add to the existing API's and of course, >>>>> >> >>> > requests >>>>> >> >>> > to >>>>> >> >>> > expand >>>>> >> >>> > the >>>>> >> >>> > (already) large code to other usecases. >>>>> >> >>> > Balancing this are a good many pull-requests from many >>>>> >> >>> > different >>>>> >> >>> > users, >>>>> >> >>> > some >>>>> >> >>> > even deep into the internals. >>>>> >> >>> > >>>>> >> >>> > Here are some things that I have talked about and could >>>>> >> >>> > be >>>>> >> >>> > considered >>>>> >> >>> > for >>>>> >> >>> > the roadmap. Disclaimer: I do work for Continuum >>>>> >> >>> > but these views are of course my own; furthermore >>>>> >> >>> > obviously >>>>> >> >>> > I >>>>> >> >>> > am a >>>>> >> >>> > bit >>>>> >> >>> > more >>>>> >> >>> > familiar with some of the 'sponsored' open-source >>>>> >> >>> > libraries, but always open to new things. >>>>> >> >>> > >>>>> >> >>> > - integration / automatic deferral to numba for JIT >>>>> >> >>> > (this >>>>> >> >>> > would >>>>> >> >>> > be >>>>> >> >>> > thru >>>>> >> >>> > .apply) >>>>> >> >>> > - automatic deferal to dask from groubpy where >>>>> >> >>> > appropriate >>>>> >> >>> > / >>>>> >> >>> > maybe a >>>>> >> >>> > .to_parallel (to simply return a dask.DataFrame object) >>>>> >> >>> > - incorporation of quantities / units (as part of the >>>>> >> >>> > dtype) >>>>> >> >>> > - use of DyND to allow missing values for int dtypes >>>>> >> >>> > - make Period a first class dtype. >>>>> >> >>> > - provide some copy-on-write semantics to alleviate the >>>>> >> >>> > chained-indexing >>>>> >> >>> > issues which occasionaly come up with the mis-use of >>>>> >> >>> > the >>>>> >> >>> > indexing >>>>> >> >>> > API >>>>> >> >>> > - allow a 'policy' to automatically provide column >>>>> >> >>> > blocks >>>>> >> >>> > for >>>>> >> >>> > dict-like >>>>> >> >>> > input (e.g. each column would be a block), this would >>>>> >> >>> > allow >>>>> >> >>> > a >>>>> >> >>> > pass-thru >>>>> >> >>> > API >>>>> >> >>> > where you could >>>>> >> >>> > put in numpy arrays where you have views and have them >>>>> >> >>> > preserved >>>>> >> >>> > rather >>>>> >> >>> > than >>>>> >> >>> > copied automatically. Note that this would also allow >>>>> >> >>> > what >>>>> >> >>> > I >>>>> >> >>> > call >>>>> >> >>> > 'split' >>>>> >> >>> > where a passed in >>>>> >> >>> > multi-dim numpy array could be split up to individual >>>>> >> >>> > blocks >>>>> >> >>> > (which >>>>> >> >>> > actually >>>>> >> >>> > gives a nice perf boost after the splitting costs). >>>>> >> >>> > >>>>> >> >>> > In working towards some of these goals. I have come to >>>>> >> >>> > the >>>>> >> >>> > opinion >>>>> >> >>> > that >>>>> >> >>> > it >>>>> >> >>> > would make sense to have a neutral API protocol layer >>>>> >> >>> > that would allow us to swap out different engines as >>>>> >> >>> > needed, >>>>> >> >>> > for >>>>> >> >>> > particular >>>>> >> >>> > dtypes, or *maybe* out-of-core type computations. E.g. >>>>> >> >>> > imagine that we replaced the in-memory block structure >>>>> >> >>> > with >>>>> >> >>> > a >>>>> >> >>> > bclolz >>>>> >> >>> > / >>>>> >> >>> > memap >>>>> >> >>> > type; in theory this should be 'easy' and just work. >>>>> >> >>> > I could also see us adopting *some* of the SFrame code >>>>> >> >>> > to >>>>> >> >>> > allow >>>>> >> >>> > easier >>>>> >> >>> > interop with this API layer. >>>>> >> >>> > >>>>> >> >>> > In practice, I think a nice API layer would need to be >>>>> >> >>> > created >>>>> >> >>> > to >>>>> >> >>> > make >>>>> >> >>> > this >>>>> >> >>> > clean / nice. >>>>> >> >>> > >>>>> >> >>> > So this comes around to Wes's point about creating a >>>>> >> >>> > c++ >>>>> >> >>> > library for >>>>> >> >>> > the >>>>> >> >>> > internals (and possibly even some of the indexing >>>>> >> >>> > routines). >>>>> >> >>> > In an ideal world, or course this would be desirable. >>>>> >> >>> > Getting >>>>> >> >>> > there >>>>> >> >>> > is a >>>>> >> >>> > bit >>>>> >> >>> > non-trivial I think, and IMHO might not be worth the >>>>> >> >>> > effort. I >>>>> >> >>> > don't >>>>> >> >>> > really see big performance bottlenecks. We *already* >>>>> >> >>> > defer >>>>> >> >>> > much >>>>> >> >>> > of >>>>> >> >>> > the >>>>> >> >>> > computation to libraries like numexpr & bottleneck >>>>> >> >>> > (where >>>>> >> >>> > appropriate). >>>>> >> >>> > Adding numba / dask to the list would be helpful. >>>>> >> >>> > >>>>> >> >>> > I think that almost all performance issues are the >>>>> >> >>> > result >>>>> >> >>> > of: >>>>> >> >>> > >>>>> >> >>> > a) gross misuse of the pandas API. How much code have >>>>> >> >>> > you >>>>> >> >>> > seen >>>>> >> >>> > that >>>>> >> >>> > does >>>>> >> >>> > df.apply(lambda x: x.sum()) >>>>> >> >>> > b) routines which operate column-by-column rather >>>>> >> >>> > block-by-block and >>>>> >> >>> > are >>>>> >> >>> > in >>>>> >> >>> > python space (e.g. we have an issue right now about >>>>> >> >>> > .quantile) >>>>> >> >>> > >>>>> >> >>> > So I am glossing over a big goal of having a c++ >>>>> >> >>> > library >>>>> >> >>> > that >>>>> >> >>> > represents >>>>> >> >>> > the >>>>> >> >>> > pandas internals. This would by definition have a c-API >>>>> >> >>> > that so >>>>> >> >>> > you *could* use pandas like semantics in c/c++ and just >>>>> >> >>> > have it >>>>> >> >>> > work >>>>> >> >>> > (and >>>>> >> >>> > then pandas would be a thin wrapper around this >>>>> >> >>> > library). >>>>> >> >>> > >>>>> >> >>> > I am not averse to this, but I think would be quite a >>>>> >> >>> > big >>>>> >> >>> > effort, >>>>> >> >>> > and >>>>> >> >>> > not a >>>>> >> >>> > huge perf boost IMHO. Further there are a number of API >>>>> >> >>> > issues >>>>> >> >>> > w.r.t. >>>>> >> >>> > indexing >>>>> >> >>> > which need to be clarified / worked out (e.g. should we >>>>> >> >>> > simply >>>>> >> >>> > deprecate >>>>> >> >>> > []) >>>>> >> >>> > that are much easier to test / figure out in python >>>>> >> >>> > space. >>>>> >> >>> > >>>>> >> >>> > I also thing that we have quite a large number of >>>>> >> >>> > contributors. >>>>> >> >>> > Moving >>>>> >> >>> > to >>>>> >> >>> > c++ might make the internals a bit more impenetrable >>>>> >> >>> > that >>>>> >> >>> > the >>>>> >> >>> > current >>>>> >> >>> > internals. >>>>> >> >>> > (though this would allow c++ people to contribute, so >>>>> >> >>> > that >>>>> >> >>> > might >>>>> >> >>> > balance >>>>> >> >>> > out). >>>>> >> >>> > >>>>> >> >>> > We have a limited core of devs whom right now are >>>>> >> >>> > familar >>>>> >> >>> > with >>>>> >> >>> > things. >>>>> >> >>> > If >>>>> >> >>> > someone happened to have a starting base for a c++ >>>>> >> >>> > library, >>>>> >> >>> > then I >>>>> >> >>> > might >>>>> >> >>> > change >>>>> >> >>> > opinions here. >>>>> >> >>> > >>>>> >> >>> > >>>>> >> >>> > my 4c. >>>>> >> >>> > >>>>> >> >>> > Jeff >>>>> >> >>> > >>>>> >> >>> > >>>>> >> >>> > >>>>> >> >>> > >>>>> >> >>> > On Thu, Dec 24, 2015 at 7:18 PM, Wes McKinney >>>>> >> >>> > <wesmckinn@gmail.com> >>>>> >> >>> > wrote: >>>>> >> >>> >> >>>>> >> >>> >> Deep thoughts during the holidays. >>>>> >> >>> >> >>>>> >> >>> >> I might be out of line here, but the >>>>> >> >>> >> interpreter-heaviness >>>>> >> >>> >> of >>>>> >> >>> >> the >>>>> >> >>> >> inside of pandas objects is likely to be a long-term >>>>> >> >>> >> liability >>>>> >> >>> >> and >>>>> >> >>> >> source of performance problems and technical debt. >>>>> >> >>> >> >>>>> >> >>> >> Has anyone put any thought into planning and beginning >>>>> >> >>> >> to >>>>> >> >>> >> execute >>>>> >> >>> >> on a >>>>> >> >>> >> rewrite that moves as much as possible of the >>>>> >> >>> >> internals >>>>> >> >>> >> into >>>>> >> >>> >> native >>>>> >> >>> >> / >>>>> >> >>> >> compiled code? I'm talking about: >>>>> >> >>> >> >>>>> >> >>> >> - pandas/core/internals >>>>> >> >>> >> - indexing and assignment >>>>> >> >>> >> - much of pandas/core/common >>>>> >> >>> >> - categorical and custom dtypes >>>>> >> >>> >> - all indexing mechanisms >>>>> >> >>> >> >>>>> >> >>> >> I'm concerned we've already exposed too much internals >>>>> >> >>> >> to >>>>> >> >>> >> users, so >>>>> >> >>> >> this might lead to a lot of API breakage, but it might >>>>> >> >>> >> be >>>>> >> >>> >> for >>>>> >> >>> >> the >>>>> >> >>> >> Greater Good. As a first step, beginning a partial >>>>> >> >>> >> migration >>>>> >> >>> >> of >>>>> >> >>> >> internals into some C++ classes that encapsulate the >>>>> >> >>> >> insides >>>>> >> >>> >> of >>>>> >> >>> >> DataFrame objects and implement indexing and >>>>> >> >>> >> block-level >>>>> >> >>> >> manipulations >>>>> >> >>> >> would be a good place to start. I think you could do >>>>> >> >>> >> this >>>>> >> >>> >> wouldn't >>>>> >> >>> >> too >>>>> >> >>> >> much disruption. >>>>> >> >>> >> >>>>> >> >>> >> As part of this internal retooling we might give >>>>> >> >>> >> consideration >>>>> >> >>> >> to >>>>> >> >>> >> alternative data structures for representing data >>>>> >> >>> >> internal >>>>> >> >>> >> to >>>>> >> >>> >> pandas >>>>> >> >>> >> objects. Now in 2015/2016, continuing to be hamstrung >>>>> >> >>> >> by >>>>> >> >>> >> NumPy's >>>>> >> >>> >> limitations feels somewhat anachronistic. User code is >>>>> >> >>> >> riddled >>>>> >> >>> >> with >>>>> >> >>> >> workarounds for data type fidelity issues and the >>>>> >> >>> >> like. >>>>> >> >>> >> Like, >>>>> >> >>> >> really, >>>>> >> >>> >> why not add a bitndarray (similar to >>>>> >> >>> >> ilanschnell/bitarray) >>>>> >> >>> >> for >>>>> >> >>> >> storing >>>>> >> >>> >> nullness for problematic types and hide this from the >>>>> >> >>> >> user? =) >>>>> >> >>> >> >>>>> >> >>> >> Since we are now a NumFOCUS-sponsored project, I feel >>>>> >> >>> >> like >>>>> >> >>> >> we >>>>> >> >>> >> might >>>>> >> >>> >> consider establishing some formal governance over >>>>> >> >>> >> pandas >>>>> >> >>> >> and >>>>> >> >>> >> publishing meetings notes and roadmap documents >>>>> >> >>> >> describing >>>>> >> >>> >> plans >>>>> >> >>> >> for >>>>> >> >>> >> the project and meetings notes from committers. >>>>> >> >>> >> There's no >>>>> >> >>> >> real >>>>> >> >>> >> "committer culture" for NumFOCUS projects like there >>>>> >> >>> >> is >>>>> >> >>> >> with >>>>> >> >>> >> the >>>>> >> >>> >> Apache Software Foundation, but we might try leading >>>>> >> >>> >> by >>>>> >> >>> >> example! >>>>> >> >>> >> >>>>> >> >>> >> Also, I believe pandas as a project has reached a >>>>> >> >>> >> level of >>>>> >> >>> >> importance >>>>> >> >>> >> where we ought to consider planning and execution on >>>>> >> >>> >> larger >>>>> >> >>> >> scale >>>>> >> >>> >> undertakings such as this for safeguarding the future. >>>>> >> >>> >> >>>>> >> >>> >> As for myself, well, I have my hands full in Big >>>>> >> >>> >> Data-land. I >>>>> >> >>> >> wish >>>>> >> >>> >> I >>>>> >> >>> >> could be helping more with pandas, but there a quite a >>>>> >> >>> >> few >>>>> >> >>> >> fundamental >>>>> >> >>> >> issues (like data interoperability nested data >>>>> >> >>> >> handling >>>>> >> >>> >> and >>>>> >> >>> >> file >>>>> >> >>> >> format support — e.g. Parquet, see >>>>> >> >>> >> >>>>> >> >>> >> >>>>> >> >>> >> >>>>> >> >>> >> >>>>> >> >>> >> >>>>> >> >>> >> >>>>> >> >>> >> http://wesmckinney.com/blog/the-problem-with-the-data-science-language-wars/) >>>>> >> >>> >> preventing Python from being more useful in industry >>>>> >> >>> >> analytics >>>>> >> >>> >> applications. >>>>> >> >>> >> >>>>> >> >>> >> Aside: one of the bigger mistakes I made with pandas's >>>>> >> >>> >> API >>>>> >> >>> >> design >>>>> >> >>> >> was >>>>> >> >>> >> making it acceptable to call class constructors — like >>>>> >> >>> >> pandas.DataFrame — directly (versus factory >>>>> >> >>> >> functions). >>>>> >> >>> >> Sorry >>>>> >> >>> >> about >>>>> >> >>> >> that! If we could convince everyone to start writing >>>>> >> >>> >> pandas.data_frame >>>>> >> >>> >> or dataframe instead of using the class reference it >>>>> >> >>> >> would >>>>> >> >>> >> help a >>>>> >> >>> >> lot >>>>> >> >>> >> with code cleanup. It's hard to plan for these things >>>>> >> >>> >> — >>>>> >> >>> >> NumPy >>>>> >> >>> >> interoperability seemed a lot more important in 2008 >>>>> >> >>> >> than >>>>> >> >>> >> it >>>>> >> >>> >> does >>>>> >> >>> >> now, >>>>> >> >>> >> so I forgive myself. >>>>> >> >>> >> >>>>> >> >>> >> cheers and best wishes for 2016, >>>>> >> >>> >> Wes >>>>> >> >>> >> _______________________________________________ >>>>> >> >>> >> Pandas-dev mailing list >>>>> >> >>> >> Pandas-dev@python.org >>>>> >> >>> >> https://mail.python.org/mailman/listinfo/pandas-dev >>>>> >> >>> > >>>>> >> >>> > >>>>> >> >>> _______________________________________________ >>>>> >> >>> Pandas-dev mailing list >>>>> >> >>> Pandas-dev@python.org >>>>> >> >>> https://mail.python.org/mailman/listinfo/pandas-dev >>>>> >> _______________________________________________ >>>>> >> Pandas-dev mailing list >>>>> >> Pandas-dev@python.org >>>>> >> https://mail.python.org/mailman/listinfo/pandas-dev >>>>> > >>>>> > >>>>> >>>>> >>>>> _______________________________________________ >>>>> Pandas-dev mailing list >>>>> Pandas-dev@python.org >>>>> https://mail.python.org/mailman/listinfo/pandas-dev >>>>> >>>> >> _______________________________________________ Pandas-dev mailing list Pandas-dev@python.org https://mail.python.org/mailman/listinfo/pandas-dev
_______________________________________________ Pandas-dev mailing list Pandas-dev@python.org https://mail.python.org/mailman/listinfo/pandas-dev
_______________________________________________ Pandas-dev mailing list Pandas-dev@python.org https://mail.python.org/mailman/listinfo/pandas-dev
I'll just apologize right up front! hahah. No I think I have been pushing on these extras in pandas to help move it forward. I have commented a bit on Stephan's issue here <https://github.com/pydata/pandas/issues/8350> about why I didn't push for these in numpy. numpy is fairly slow moving (though moves faster lately, I suspect the pace when Wes was developing pandas was not much faster). So pandas was essentially 'fixing' lots of bug / compat issues in numpy. To the extent that we can keep the current user facing API the same (high likelihood I think), willing to acccept *some* breakage with the pandas->duck-like array container API in order to provide swappable containers. For example I recall that in doing datetime w/tz, that we wanted Series.values to return a numpy array (which it DOES!) but it is actually lossy (its loses the tz). Samething with the Categorical example wes gave. I dont' think these requirements should hold pandas back! People are increasingly using pandas as the API for there work. That makes it very important that we can handle lots of input properly, w/o the handcuffs of numpy. All this said, I'll reiterate Wes (and others points). That back-compat is extremely important. (I in fact try to bend over backwards to provide this, sometimes its too much of course!). E.g. take the resample changes to API Was originally going to just do a hard break, but this turns off people when they have to update there code or else. my 4c (incrementing!) Jeff On Wed, Jan 6, 2016 at 2:37 PM, Wes McKinney <wesmckinn@gmail.com> wrote:
hey Stephan,
Thanks for all the thoughts. Let me make a few off-the-cuff comments.
On Wed, Jan 6, 2016 at 10:11 AM, Stephan Hoyer <shoyer@gmail.com> wrote:
I was asked about this off list, so I'll belatedly share my thoughts.
First of all, I am really excited by Wes's renewed engagement in the
On Wed, Jan 6, 2016 at 11:26 AM, Wes McKinney <wesmckinn@gmail.com> wrote: project
and his interest in rewriting pandas internals. This is quite an ambitious plan and nobody is better positioned to tackle it than Wes.
I have mixed feelings about the details of the rewrite itself.
+1 on the simpler internal data model. The block manager is confusing and leads to hard to predict performance issues related to copying data. If we can do all column additions/removals/re-orderings without a copy it will be a clear win.
+0 on moving internals to C++. I do like the performance benefits, but it seems like a lot of work, and it may make pandas less friendly to new contributors.
It really goes beyond performance benefits. If you go back to my 2013 talk http://www.slideshare.net/wesm/practical-medium-data-analytics-with-python there's a long list of architectural problems that now in 2016 haven't found solutions. The only way (that I can fully reason through -- I am happy to look at alternate proposals) to move the internals of pandas closer to the metal is to give Series and DataFrame a C/C++ API -- this is the "libpandas native core" as I've been describing.
I should point out the the main thing that's changed since that preso is "synthetic" data types like Categorical. But seeing what it took for Jeff et al to build that is a prime motivation for this internals refactoring plan.
-0 on writing a brand new dtype system just for pandas -- this stuff
belongs in NumPy (or another array library like DyND), and I am skeptical that pandas can do a complete enough job to be useful without replicating all that functionality.
I'm curious what "a brand new dtype system" means to you. pandas already has its own data type system, but it's a potpourri of inconsistencies and rough edges with self-evident problems for both users and developers. Some indicators:
- Some pandas types use NaN for missing data, others None (or both), others nothing at all. We lose data (integers) or bloat memory (booleans) by upcasting to float-NaN or object-None. - Internal functions full of is_XXX_dtype functions: pandas.core.common, pandas.core.algorithms, etc. - Series.values on synthetic dtypes like Categorical - We use arrays of Python objects for string data
The biggest cause IMHO is that pandas is too tightly coupled to NumPy, but it's coupled in a way that makes development and extensibility difficult. We've already allowed NumPy-specific details to taint the pandas user API in many unpleasant ways. This isn't to say "NumPy is bad" but rather "pandas tries to layer domain-specific functionality [that NumPy was not designed for] on top".
Some things things I'm advocating with the internals refactor:
1) First class "pandas type" objects. This is not the same as a NumPy dtype which has some pretty loaded implications -- in particular, NumPy dtypes are implicitly coupled to an array computing framework (see the function table that is attached to the PyArray_Descr object)
2) Pandas array container types that map user-land API calls to implementation-land API calls (in NumPy, DyND, or pandas-native code like pandas.core.algorithms etc.). This will make it much easier to leverage innovations in NumPy and DyND without those implementation details spilling over into the pandas user API
3) Adding a single pandas.NA singleton to have one library-wide notion of a scalar null value (obviously, we can automatically map NaN and None to NA for backwards compatibility).
4) Layering a bitmask internally on NumPy arrays (especially integer and boolean) to add null-ness to types that need it. Note that this does not prevent us from switching to DyND arrays with option dtype in the future. If the details of how we are implementing NULL are visible to the user, we have failed.
5) Removing the block manager in favor of simpler pandas Array (1D) and Table (2D -- vector of Array) data structures
I believe you can do all this without harming interoperability with the ecosystem of projects that people currently use in conjunction with pandas.
More broadly, I am concerned that this rewrite may improve the tabular computation ecosystem at the cost of inter-operability with the array-based ecosystem (numpy, scipy, sklearn, xarray, etc.). The later has been one of the strengths of pandas and it would be a shame to see that go away.
I have no intention of letting this happen. What I've am asking from you (and others reading) is to help define what constitutes interoperability. What guarantees do we make the user?
For example, we should have very strict guidelines for the output of:
np.asarray(pandas_obj)
For example
In [3]: s = pd.Series([1,2,3]*10).astype('category')
In [4]: np.asarray(s) Out[4]: array([1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3])
I see no reason why this should necessarily behave any differently. The problem will come in when there is pandas data that is not precisely representable in a NumPy array. Example:
In [5]: s = pd.Series([1,2,3, 4])
In [6]: s.dtype Out[6]: dtype('int64')
In [7]: s2 = s.reindex(np.arange(10))
In [8]: s2.dtype Out[8]: dtype('float64')
In [9]: np.asarray(s2) Out[9]: array([ 1., 2., 3., 4., nan, nan, nan, nan, nan, nan])
With the "new internals", s2 will still be int64 type, but we may decide that np.asarray(s2) should raise an exception rather than implicitly make a decision about how to perform a "lossy" conversion to a NumPy array. If you are using DyND with pandas, then the equivalent function would be able to implicitly convert without data loss.
We're already starting to struggle with inter-operability with the new pandas dtypes and a further rewrite would make this even harder. For example, see categoricals and scikit-learn in Tom's recent post [1], or the fact that .values no longer always returns a numpy array. This has also been a challenge for xarray, which can't handle these new dtypes because we lack a suitable array backend for them.
I'm definitely motivated in this initiative by these challenges. The idea here is that with the new internals, Series.values will always return the same type of object, and there will be one consistent code path for getting a NumPy array out. For example, rather than:
if isinstance(s.values, Categorical): # pandas ... else: # NumPy ...
We could have (just an idea)
s.values.to_numpy()
Or simply
np.asarray(s.values)
Personally, I would much rather leverage a full featured library like an improved NumPy or DyND for new dtypes, because that could also be used
by
the array-based ecosystem. At the very least, it would be good to think about zero-copy inter-operability with array-based tools.
I'm all for zero-copy interoperability when possible, but my gut feeling is that exposing the data type system of an array library (the choice of which is an implementation detail) to pandas users is an inherent leaky abstraction that will continue to cause problems if we plan to keep innovating inside pandas. By better hiding NumPy details and types from the user we will make it much easier to swap out new low level array data structures and compute components (e.g. DyND), or add custom data structures or out-of-core tools (memory maps, bcolz, etc.)
I'm additionally offering to do nearly all of this replumbing of pandas internals myself, and completely in my free time. What I will expect in return from you all is to help enumerate our contracts with the pandas user (i.e. interoperability) and to hold me accountable to not break them. I know I haven't been committing code on pandas since mid-2013 (after a 5 year marathon), but these architectural problems have been on my mind almost constantly since then, I just haven't had the bandwidth to start tackling them.
cheers, Wes
On the other hand, I wonder if maybe it would be better to write a native in-memory backend for Ibis instead of rewriting pandas. Ibis does seem to have improved/simplified API which resolves many of pandas's warts. That said, it's a pretty big change from the "DataFrame as matrix" model, and pandas won't be going away anytime soon. I do like that it would force users to be more explicit about converting between tables and arrays, which might also make distinctions between the tabular and array oriented ecosystems easier to swallow.
Just my two cents, from someone who has lots of opinions but who will
stay on the sidelines for most of this work.
Cheers, Stephan
[1] http://tomaugspurger.github.io/categorical-pipelines.html
On Fri, Jan 1, 2016 at 6:06 PM, Jeff Reback <jeffreback@gmail.com> wrote:
ok I moved the document to the Pandas folder, where the same group
should
be able to edit/upload/etc. lmk if any issues
On Fri, Jan 1, 2016 at 8:48 PM, Wes McKinney <wesmckinn@gmail.com> wrote:
Thanks Jeff. Can you create and share a shared Drive folder containing this where I can put other auxiliary / follow up documents?
On Fri, Jan 1, 2016 at 5:23 PM, Jeff Reback <jeffreback@gmail.com>
wrote:
I changed the doc so that the core dev people can edit. I *think*
everyone should be able to view/comment though.
On Fri, Jan 1, 2016 at 8:13 PM, Wes McKinney <wesmckinn@gmail.com> wrote: > > Jeff -- can you require log-in for editing on this document? > > > https://docs.google.com/document/d/151ct8jcZWwh7XStptjbLsda6h2b3C0IuiH_hfZnU... > > There are a number of anonymous edits. > > On Wed, Dec 30, 2015 at 6:04 PM, Wes McKinney <wesmckinn@gmail.com
> wrote: > > I cobbled together an ugly start of a c++->cython->pandas toolchain > > here > > > > https://github.com/wesm/pandas/tree/libpandas-native-core > > > > I used a mix of Kudu, Impala, and dynd-python cmake sources, so it's > > a > > bit messy at the moment but it should be sufficient to run some real > > experiments with a little more work. I reckon it's like a 6 month > > project to tear out the insides of Series and DataFrame and replace > > it > > with a new "native core", but we should be able to get enough info > > to > > see whether it's a viable plan within a month or so. > > > > The end goal is to create "private" extension types in Cython
> > can > > be the new base classes for Series and NDFrame; these will hold a > > reference to a C++ object that contains wrappered NumPy arrays and > > other metadata (like pandas-only dtypes). > > > > It might be too hard to try to replace a single usage of block > > manager > > as a first experiment, so I'll try to create a minimal "SeriesLite" > > that supports 3 dtypes > > > > 1) float64 with nans > > 2) int64 with a bitmask for NAs > > 3) category type for one of these > > > > Just want to get a feel for the extensibility and offer an NA > > singleton Python object (a la None) for getting and setting NAs > > across > > these 3 dtypes. > > > > If we end up going down this route, any way to place a moratorium on > > invasive work on pandas internals (outside bug fixes)? > > > > Pedantic aside: I'd rather avoid shipping thirdparty C/C++
> > like googletest and friends in pandas if we can. Cloudera folks have > > been working on a portable C++ library toolchain for Impala and > > other > > projects at https://github.com/cloudera/native-toolchain, but it is > > only being tested on Linux and OS X. Most google libraries should > > build out of the box on MSVC but it'll be something to keep an eye > > on. > > > > BTW thanks to the libdynd developers for pioneering the c++ lib <-> > > python-c++ lib <-> cython toolchain; being able to build Cython > > extensions directly from cmake is a godsend > > > > HNY all > > Wes > > > > On Tue, Dec 29, 2015 at 4:17 PM, Irwin Zaid <izaid@continuum.io> > > wrote: > >> Yeah, that seems reasonable and I totally agree a Pandas wrapper > >> layer > >> would > >> be necessary. > >> > >> I'll keep an eye on this and I'd like to help if I can. > >> > >> Irwin > >> > >> > >> On Tue, Dec 29, 2015 at 6:01 PM, Wes McKinney < wesmckinn@gmail.com> > >> wrote: > >>> > >>> I'm not suggesting a rewrite of NumPy functionality but rather > >>> pandas > >>> functionality that is currently written in a mishmash of Cython > >>> and > >>> Python. > >>> Happy to experiment with changing the internal compute > >>> infrastructure > >>> and > >>> data representation to DyND after this first stage of cleanup is > >>> done. > >>> Even > >>> if we use DyND a pretty extensive pandas wrapper layer will be > >>> necessary. > >>> > >>> > >>> On Tuesday, December 29, 2015, Irwin Zaid <izaid@continuum.io> > >>> wrote: > >>>> > >>>> Hi Wes (and others), > >>>> > >>>> I've been following this conversation with interest. I do
> >>>> it > >>>> would > >>>> be worth exploring DyND, rather than setting up yet another > >>>> rewrite > >>>> of > >>>> NumPy-functionality. Especially because DyND is already an > >>>> optional > >>>> dependency of Pandas. > >>>> > >>>> For things like Integer NA and new dtypes, DyND is there and > >>>> ready to > >>>> do > >>>> this. > >>>> > >>>> Irwin > >>>> > >>>> On Tue, Dec 29, 2015 at 5:18 PM, Wes McKinney > >>>> <wesmckinn@gmail.com> > >>>> wrote: > >>>>> > >>>>> Can you link to the PR you're talking about? > >>>>> > >>>>> I will see about spending a few hours setting up a
> >>>>> as a > >>>>> C++ > >>>>> shared library where we can run some experiments and validate > >>>>> whether it can > >>>>> solve the integer-NA problem and be a place to put new data > >>>>> types > >>>>> (categorical and friends). I'm +1 on targeting > >>>>> > >>>>> Would it also be worth making a wish list of APIs we might > >>>>> consider > >>>>> breaking in a pandas 1.0 release that also features this new > >>>>> "native > >>>>> core"? > >>>>> Might as well right some wrongs while we're doing some invasive > >>>>> work > >>>>> on the > >>>>> internals; some breakage might be unavoidable. We can always > >>>>> maintain a > >>>>> pandas legacy 0.x.x maintenance branch (providing a conda binary > >>>>> build) for > >>>>> legacy users where showstopper bugs can get fixed. > >>>>> > >>>>> On Tue, Dec 29, 2015 at 1:20 PM, Jeff Reback > >>>>> <jeffreback@gmail.com> > >>>>> wrote: > >>>>> > Wes your last is noted as well. I *think* we can actually do > >>>>> > this > >>>>> > now > >>>>> > (well > >>>>> > there is a PR out there). > >>>>> > > >>>>> > On Tue, Dec 29, 2015 at 4:12 PM, Wes McKinney > >>>>> > <wesmckinn@gmail.com> > >>>>> > wrote: > >>>>> >> > >>>>> >> The other huge thing this will enable is to do is > >>>>> >> copy-on-write > >>>>> >> for > >>>>> >> various kinds of views, which should cut down on some of
> >>>>> >> defensive > >>>>> >> copying in the library and reduce memory usage. > >>>>> >> > >>>>> >> On Tue, Dec 29, 2015 at 1:02 PM, Wes McKinney > >>>>> >> <wesmckinn@gmail.com> > >>>>> >> wrote: > >>>>> >> > Basically the approach is > >>>>> >> > > >>>>> >> > 1) Base dtype type > >>>>> >> > 2) Base array type with K >= 1 dimensions > >>>>> >> > 3) Base scalar type > >>>>> >> > 4) Base index type > >>>>> >> > 5) "Wrapper" subclasses for all NumPy types fitting into > >>>>> >> > categories > >>>>> >> > #1, #2, #3, #4 > >>>>> >> > 6) Subclasses for pandas-specific types like category, > >>>>> >> > datetimeTZ, > >>>>> >> > etc. > >>>>> >> > 7) NDFrame as cpcloud wrote is just a list of these > >>>>> >> > > >>>>> >> > Indexes and axis labels / column names can get layered on > >>>>> >> > top. > >>>>> >> > > >>>>> >> > After we do all this we can look at adding nested types > >>>>> >> > (arrays, > >>>>> >> > maps, > >>>>> >> > structs) to better support JSON. > >>>>> >> > > >>>>> >> > - Wes > >>>>> >> > > >>>>> >> > On Tue, Dec 29, 2015 at 12:14 PM, Phillip Cloud > >>>>> >> > <cpcloud@gmail.com> > >>>>> >> > wrote: > >>>>> >> >> Maybe this is saying the same thing as Wes, but how far > >>>>> >> >> would > >>>>> >> >> something > >>>>> >> >> like > >>>>> >> >> this get us? > >>>>> >> >> > >>>>> >> >> // warning: things are probably not this simple > >>>>> >> >> > >>>>> >> >> struct data_array_t { > >>>>> >> >> void *primitive; // scalar data > >>>>> >> >> data_array_t *nested; // nested data > >>>>> >> >> boost::dynamic_bitset isnull; // might have to create > >>>>> >> >> our > >>>>> >> >> own > >>>>> >> >> to > >>>>> >> >> avoid > >>>>> >> >> boost > >>>>> >> >> schema_t schema; // not sure exactly what this looks > >>>>> >> >> like > >>>>> >> >> }; > >>>>> >> >> > >>>>> >> >> typedef std::map<string, data_array_t> data_frame_t; // > >>>>> >> >> probably > >>>>> >> >> not > >>>>> >> >> this > >>>>> >> >> simple > >>>>> >> >> > >>>>> >> >> To answer Jeff’s use-case question: I think that the use > >>>>> >> >> cases > >>>>> >> >> are > >>>>> >> >> 1) > >>>>> >> >> freedom from numpy (mostly) 2) no more block manager which > >>>>> >> >> frees > >>>>> >> >> us > >>>>> >> >> from the > >>>>> >> >> limitations of the block memory layout. In particular,
> >>>>> >> >> ability > >>>>> >> >> to > >>>>> >> >> take > >>>>> >> >> advantage of memory mapped IO would be a big win IMO. > >>>>> >> >> > >>>>> >> >> > >>>>> >> >> On Tue, Dec 29, 2015 at 2:50 PM Wes McKinney > >>>>> >> >> <wesmckinn@gmail.com> > >>>>> >> >> wrote: > >>>>> >> >>> > >>>>> >> >>> I will write a more detailed response to some of these > >>>>> >> >>> things > >>>>> >> >>> after > >>>>> >> >>> the new year, but, in particular, re: missing values, can > >>>>> >> >>> you > >>>>> >> >>> or > >>>>> >> >>> someone tell me why creating an object that contains a > >>>>> >> >>> NumPy > >>>>> >> >>> array and > >>>>> >> >>> a bitmap is not sufficient? If we we can add a > >>>>> >> >>> lightweight > >>>>> >> >>> C/C++ > >>>>> >> >>> class > >>>>> >> >>> layer between NumPy function calls (e.g. arithmetic) and > >>>>> >> >>> pandas > >>>>> >> >>> function calls, then I see no reason why we cannot have > >>>>> >> >>> > >>>>> >> >>> Int32Array->add > >>>>> >> >>> > >>>>> >> >>> and > >>>>> >> >>> > >>>>> >> >>> Float32Array->add > >>>>> >> >>> > >>>>> >> >>> do the right thing (the former would be responsible for > >>>>> >> >>> bitmasking to > >>>>> >> >>> propagate NA values; the latter would defer to NumPy). If > >>>>> >> >>> we > >>>>> >> >>> can > >>>>> >> >>> put > >>>>> >> >>> all the internals of pandas objects inside a black box, > >>>>> >> >>> we > >>>>> >> >>> can > >>>>> >> >>> add > >>>>> >> >>> layers of virtual function indirection without a > >>>>> >> >>> performance > >>>>> >> >>> penalty > >>>>> >> >>> (e.g. adding more interpreter overhead with more > >>>>> >> >>> abstraction > >>>>> >> >>> layers > >>>>> >> >>> does add up to a perf penalty). > >>>>> >> >>> > >>>>> >> >>> I don't think this is too scary -- I would be willing to > >>>>> >> >>> create a > >>>>> >> >>> small POC C++ library to prototype something like what > >>>>> >> >>> I'm > >>>>> >> >>> talking > >>>>> >> >>> about. > >>>>> >> >>> > >>>>> >> >>> Since pandas has limited points of contact with NumPy I > >>>>> >> >>> don't > >>>>> >> >>> think > >>>>> >> >>> this would end up being too onerous. > >>>>> >> >>> > >>>>> >> >>> For the record, I'm pretty allergic to "advanced C++"; I > >>>>> >> >>> think it > >>>>> >> >>> is a > >>>>> >> >>> useful tool if you pick a sane 20% subset of the C++11 > >>>>> >> >>> spec > >>>>> >> >>> and > >>>>> >> >>> follow > >>>>> >> >>> Google C++ style it's not very inaccessible to > >>>>> >> >>> intermediate > >>>>> >> >>> developers. More or less "C plus OOP and easier object > >>>>> >> >>> lifetime > >>>>> >> >>> management (shared/unique_ptr, etc.)". As soon as you add > >>>>> >> >>> a > >>>>> >> >>> lot > >>>>> >> >>> of > >>>>> >> >>> template metaprogramming C++ library development quickly > >>>>> >> >>> becomes > >>>>> >> >>> inaccessible except to the C++-Jedi. > >>>>> >> >>> > >>>>> >> >>> Maybe let's start a Google document on "pandas roadmap" > >>>>> >> >>> where > >>>>> >> >>> we > >>>>> >> >>> can > >>>>> >> >>> break down the 1-2 year goals and some of these > >>>>> >> >>> infrastructure > >>>>> >> >>> issues > >>>>> >> >>> and have our discussion there? (obviously publish this > >>>>> >> >>> someplace > >>>>> >> >>> once > >>>>> >> >>> we're done) > >>>>> >> >>> > >>>>> >> >>> - Wes > >>>>> >> >>> > >>>>> >> >>> On Fri, Dec 25, 2015 at 2:14 PM, Jeff Reback > >>>>> >> >>> <jeffreback@gmail.com> > >>>>> >> >>> wrote: > >>>>> >> >>> > Here are some of my thoughts about pandas Roadmap / > >>>>> >> >>> > status > >>>>> >> >>> > and > >>>>> >> >>> > some > >>>>> >> >>> > responses to Wes's thoughts. > >>>>> >> >>> > > >>>>> >> >>> > In the last few (and upcoming) major releases we have > >>>>> >> >>> > been > >>>>> >> >>> > made > >>>>> >> >>> > the > >>>>> >> >>> > following changes: > >>>>> >> >>> > > >>>>> >> >>> > - dtype enhancements (Categorical, Timedelta, Datetime > >>>>> >> >>> > w/tz) & > >>>>> >> >>> > making > >>>>> >> >>> > these > >>>>> >> >>> > first class objects > >>>>> >> >>> > - code refactoring to remove subclassing of ndarrays > >>>>> >> >>> > for > >>>>> >> >>> > Series > >>>>> >> >>> > & > >>>>> >> >>> > Index > >>>>> >> >>> > - carving out / deprecating non-core parts of pandas > >>>>> >> >>> > - datareader > >>>>> >> >>> > - SparsePanel, WidePanel & other aliases (TImeSeries) > >>>>> >> >>> > - rpy, rplot, irow et al. > >>>>> >> >>> > - google-analytics > >>>>> >> >>> > - API changes to make things more consistent > >>>>> >> >>> > - pd.rolling/expanding * -> .rolling/expanding (this > >>>>> >> >>> > is > >>>>> >> >>> > in > >>>>> >> >>> > master > >>>>> >> >>> > now) > >>>>> >> >>> > - .resample becoming a full defered like groupby. > >>>>> >> >>> > - multi-index slicing along any level (obviates need > >>>>> >> >>> > for > >>>>> >> >>> > .xs) > >>>>> >> >>> > and > >>>>> >> >>> > allows > >>>>> >> >>> > assignment > >>>>> >> >>> > - .loc/.iloc - for the most part obviates use of .ix > >>>>> >> >>> > - .pipe & .assign > >>>>> >> >>> > - plotting accessors > >>>>> >> >>> > - fixing of the sorting API > >>>>> >> >>> > - many performance enhancements both micro & macro > >>>>> >> >>> > (e.g. > >>>>> >> >>> > release > >>>>> >> >>> > GIL) > >>>>> >> >>> > > >>>>> >> >>> > Some on-deck enhancements are (meaning these are > >>>>> >> >>> > basically > >>>>> >> >>> > ready to > >>>>> >> >>> > go > >>>>> >> >>> > in): > >>>>> >> >>> > - IntervalIndex (and eventually make PeriodIndex just > >>>>> >> >>> > a > >>>>> >> >>> > sub-class > >>>>> >> >>> > of > >>>>> >> >>> > this) > >>>>> >> >>> > - RangeIndex > >>>>> >> >>> > > >>>>> >> >>> > so lots of changes, though nothing really earth > >>>>> >> >>> > shaking, > >>>>> >> >>> > just > >>>>> >> >>> > more > >>>>> >> >>> > convenience, reducing magicness somewhat > >>>>> >> >>> > and providing flexibility. > >>>>> >> >>> > > >>>>> >> >>> > Of course we are getting increasing issues, mostly bug > >>>>> >> >>> > reports > >>>>> >> >>> > (and > >>>>> >> >>> > lots > >>>>> >> >>> > of > >>>>> >> >>> > dupes), some edge case enhancements > >>>>> >> >>> > which can add to the existing API's and of course, > >>>>> >> >>> > requests > >>>>> >> >>> > to > >>>>> >> >>> > expand > >>>>> >> >>> > the > >>>>> >> >>> > (already) large code to other usecases. > >>>>> >> >>> > Balancing this are a good many pull-requests from many > >>>>> >> >>> > different > >>>>> >> >>> > users, > >>>>> >> >>> > some > >>>>> >> >>> > even deep into the internals. > >>>>> >> >>> > > >>>>> >> >>> > Here are some things that I have talked about and could > >>>>> >> >>> > be > >>>>> >> >>> > considered > >>>>> >> >>> > for > >>>>> >> >>> > the roadmap. Disclaimer: I do work for Continuum > >>>>> >> >>> > but these views are of course my own; furthermore > >>>>> >> >>> > obviously > >>>>> >> >>> > I > >>>>> >> >>> > am a > >>>>> >> >>> > bit > >>>>> >> >>> > more > >>>>> >> >>> > familiar with some of the 'sponsored' open-source > >>>>> >> >>> > libraries, but always open to new things. > >>>>> >> >>> > > >>>>> >> >>> > - integration / automatic deferral to numba for JIT > >>>>> >> >>> > (this > >>>>> >> >>> > would > >>>>> >> >>> > be > >>>>> >> >>> > thru > >>>>> >> >>> > .apply) > >>>>> >> >>> > - automatic deferal to dask from groubpy where > >>>>> >> >>> > appropriate > >>>>> >> >>> > / > >>>>> >> >>> > maybe a > >>>>> >> >>> > .to_parallel (to simply return a dask.DataFrame object) > >>>>> >> >>> > - incorporation of quantities / units (as part of
> >>>>> >> >>> > dtype) > >>>>> >> >>> > - use of DyND to allow missing values for int dtypes > >>>>> >> >>> > - make Period a first class dtype. > >>>>> >> >>> > - provide some copy-on-write semantics to alleviate
> >>>>> >> >>> > chained-indexing > >>>>> >> >>> > issues which occasionaly come up with the mis-use of > >>>>> >> >>> > the > >>>>> >> >>> > indexing > >>>>> >> >>> > API > >>>>> >> >>> > - allow a 'policy' to automatically provide column > >>>>> >> >>> > blocks > >>>>> >> >>> > for > >>>>> >> >>> > dict-like > >>>>> >> >>> > input (e.g. each column would be a block), this would > >>>>> >> >>> > allow > >>>>> >> >>> > a > >>>>> >> >>> > pass-thru > >>>>> >> >>> > API > >>>>> >> >>> > where you could > >>>>> >> >>> > put in numpy arrays where you have views and have
> >>>>> >> >>> > preserved > >>>>> >> >>> > rather > >>>>> >> >>> > than > >>>>> >> >>> > copied automatically. Note that this would also allow > >>>>> >> >>> > what > >>>>> >> >>> > I > >>>>> >> >>> > call > >>>>> >> >>> > 'split' > >>>>> >> >>> > where a passed in > >>>>> >> >>> > multi-dim numpy array could be split up to individual > >>>>> >> >>> > blocks > >>>>> >> >>> > (which > >>>>> >> >>> > actually > >>>>> >> >>> > gives a nice perf boost after the splitting costs). > >>>>> >> >>> > > >>>>> >> >>> > In working towards some of these goals. I have come to > >>>>> >> >>> > the > >>>>> >> >>> > opinion > >>>>> >> >>> > that > >>>>> >> >>> > it > >>>>> >> >>> > would make sense to have a neutral API protocol layer > >>>>> >> >>> > that would allow us to swap out different engines as > >>>>> >> >>> > needed, > >>>>> >> >>> > for > >>>>> >> >>> > particular > >>>>> >> >>> > dtypes, or *maybe* out-of-core type computations. E.g. > >>>>> >> >>> > imagine that we replaced the in-memory block structure > >>>>> >> >>> > with > >>>>> >> >>> > a > >>>>> >> >>> > bclolz > >>>>> >> >>> > / > >>>>> >> >>> > memap > >>>>> >> >>> > type; in theory this should be 'easy' and just work. > >>>>> >> >>> > I could also see us adopting *some* of the SFrame code > >>>>> >> >>> > to > >>>>> >> >>> > allow > >>>>> >> >>> > easier > >>>>> >> >>> > interop with this API layer. > >>>>> >> >>> > > >>>>> >> >>> > In practice, I think a nice API layer would need to be > >>>>> >> >>> > created > >>>>> >> >>> > to > >>>>> >> >>> > make > >>>>> >> >>> > this > >>>>> >> >>> > clean / nice. > >>>>> >> >>> > > >>>>> >> >>> > So this comes around to Wes's point about creating a > >>>>> >> >>> > c++ > >>>>> >> >>> > library for > >>>>> >> >>> > the > >>>>> >> >>> > internals (and possibly even some of the indexing > >>>>> >> >>> > routines). > >>>>> >> >>> > In an ideal world, or course this would be desirable. > >>>>> >> >>> > Getting > >>>>> >> >>> > there > >>>>> >> >>> > is a > >>>>> >> >>> > bit > >>>>> >> >>> > non-trivial I think, and IMHO might not be worth the > >>>>> >> >>> > effort. I > >>>>> >> >>> > don't > >>>>> >> >>> > really see big performance bottlenecks. We *already* > >>>>> >> >>> > defer > >>>>> >> >>> > much > >>>>> >> >>> > of > >>>>> >> >>> > the > >>>>> >> >>> > computation to libraries like numexpr & bottleneck > >>>>> >> >>> > (where > >>>>> >> >>> > appropriate). > >>>>> >> >>> > Adding numba / dask to the list would be helpful. > >>>>> >> >>> > > >>>>> >> >>> > I think that almost all performance issues are the > >>>>> >> >>> > result > >>>>> >> >>> > of: > >>>>> >> >>> > > >>>>> >> >>> > a) gross misuse of the pandas API. How much code have > >>>>> >> >>> > you > >>>>> >> >>> > seen > >>>>> >> >>> > that > >>>>> >> >>> > does > >>>>> >> >>> > df.apply(lambda x: x.sum()) > >>>>> >> >>> > b) routines which operate column-by-column rather > >>>>> >> >>> > block-by-block and > >>>>> >> >>> > are > >>>>> >> >>> > in > >>>>> >> >>> > python space (e.g. we have an issue right now about > >>>>> >> >>> > .quantile) > >>>>> >> >>> > > >>>>> >> >>> > So I am glossing over a big goal of having a c++ > >>>>> >> >>> > library > >>>>> >> >>> > that > >>>>> >> >>> > represents > >>>>> >> >>> > the > >>>>> >> >>> > pandas internals. This would by definition have a c-API > >>>>> >> >>> > that so > >>>>> >> >>> > you *could* use pandas like semantics in c/c++ and just > >>>>> >> >>> > have it > >>>>> >> >>> > work > >>>>> >> >>> > (and > >>>>> >> >>> > then pandas would be a thin wrapper around this > >>>>> >> >>> > library). > >>>>> >> >>> > > >>>>> >> >>> > I am not averse to this, but I think would be quite a > >>>>> >> >>> > big > >>>>> >> >>> > effort, > >>>>> >> >>> > and > >>>>> >> >>> > not a > >>>>> >> >>> > huge perf boost IMHO. Further there are a number of API > >>>>> >> >>> > issues > >>>>> >> >>> > w.r.t. > >>>>> >> >>> > indexing > >>>>> >> >>> > which need to be clarified / worked out (e.g. should we > >>>>> >> >>> > simply > >>>>> >> >>> > deprecate > >>>>> >> >>> > []) > >>>>> >> >>> > that are much easier to test / figure out in python > >>>>> >> >>> > space. > >>>>> >> >>> > > >>>>> >> >>> > I also thing that we have quite a large number of > >>>>> >> >>> > contributors. > >>>>> >> >>> > Moving > >>>>> >> >>> > to > >>>>> >> >>> > c++ might make the internals a bit more impenetrable > >>>>> >> >>> > that > >>>>> >> >>> > the > >>>>> >> >>> > current > >>>>> >> >>> > internals. > >>>>> >> >>> > (though this would allow c++ people to contribute, so > >>>>> >> >>> > that > >>>>> >> >>> > might > >>>>> >> >>> > balance > >>>>> >> >>> > out). > >>>>> >> >>> > > >>>>> >> >>> > We have a limited core of devs whom right now are > >>>>> >> >>> > familar > >>>>> >> >>> > with > >>>>> >> >>> > things. > >>>>> >> >>> > If > >>>>> >> >>> > someone happened to have a starting base for a c++ > >>>>> >> >>> > library, > >>>>> >> >>> > then I > >>>>> >> >>> > might > >>>>> >> >>> > change > >>>>> >> >>> > opinions here. > >>>>> >> >>> > > >>>>> >> >>> > > >>>>> >> >>> > my 4c. > >>>>> >> >>> > > >>>>> >> >>> > Jeff > >>>>> >> >>> > > >>>>> >> >>> > > >>>>> >> >>> > > >>>>> >> >>> > > >>>>> >> >>> > On Thu, Dec 24, 2015 at 7:18 PM, Wes McKinney > >>>>> >> >>> > <wesmckinn@gmail.com> > >>>>> >> >>> > wrote: > >>>>> >> >>> >> > >>>>> >> >>> >> Deep thoughts during the holidays. > >>>>> >> >>> >> > >>>>> >> >>> >> I might be out of line here, but the > >>>>> >> >>> >> interpreter-heaviness > >>>>> >> >>> >> of > >>>>> >> >>> >> the > >>>>> >> >>> >> inside of pandas objects is likely to be a long-term > >>>>> >> >>> >> liability > >>>>> >> >>> >> and > >>>>> >> >>> >> source of performance problems and technical debt. > >>>>> >> >>> >> > >>>>> >> >>> >> Has anyone put any thought into planning and beginning > >>>>> >> >>> >> to > >>>>> >> >>> >> execute > >>>>> >> >>> >> on a > >>>>> >> >>> >> rewrite that moves as much as possible of the > >>>>> >> >>> >> internals > >>>>> >> >>> >> into > >>>>> >> >>> >> native > >>>>> >> >>> >> / > >>>>> >> >>> >> compiled code? I'm talking about: > >>>>> >> >>> >> > >>>>> >> >>> >> - pandas/core/internals > >>>>> >> >>> >> - indexing and assignment > >>>>> >> >>> >> - much of pandas/core/common > >>>>> >> >>> >> - categorical and custom dtypes > >>>>> >> >>> >> - all indexing mechanisms > >>>>> >> >>> >> > >>>>> >> >>> >> I'm concerned we've already exposed too much internals > >>>>> >> >>> >> to > >>>>> >> >>> >> users, so > >>>>> >> >>> >> this might lead to a lot of API breakage, but it might > >>>>> >> >>> >> be > >>>>> >> >>> >> for > >>>>> >> >>> >> the > >>>>> >> >>> >> Greater Good. As a first step, beginning a partial > >>>>> >> >>> >> migration > >>>>> >> >>> >> of > >>>>> >> >>> >> internals into some C++ classes that encapsulate
> >>>>> >> >>> >> insides > >>>>> >> >>> >> of > >>>>> >> >>> >> DataFrame objects and implement indexing and > >>>>> >> >>> >> block-level > >>>>> >> >>> >> manipulations > >>>>> >> >>> >> would be a good place to start. I think you could do > >>>>> >> >>> >> this > >>>>> >> >>> >> wouldn't > >>>>> >> >>> >> too > >>>>> >> >>> >> much disruption. > >>>>> >> >>> >> > >>>>> >> >>> >> As part of this internal retooling we might give > >>>>> >> >>> >> consideration > >>>>> >> >>> >> to > >>>>> >> >>> >> alternative data structures for representing data > >>>>> >> >>> >> internal > >>>>> >> >>> >> to > >>>>> >> >>> >> pandas > >>>>> >> >>> >> objects. Now in 2015/2016, continuing to be hamstrung > >>>>> >> >>> >> by > >>>>> >> >>> >> NumPy's > >>>>> >> >>> >> limitations feels somewhat anachronistic. User code is > >>>>> >> >>> >> riddled > >>>>> >> >>> >> with > >>>>> >> >>> >> workarounds for data type fidelity issues and the > >>>>> >> >>> >> like. > >>>>> >> >>> >> Like, > >>>>> >> >>> >> really, > >>>>> >> >>> >> why not add a bitndarray (similar to > >>>>> >> >>> >> ilanschnell/bitarray) > >>>>> >> >>> >> for > >>>>> >> >>> >> storing > >>>>> >> >>> >> nullness for problematic types and hide this from
> >>>>> >> >>> >> user? =) > >>>>> >> >>> >> > >>>>> >> >>> >> Since we are now a NumFOCUS-sponsored project, I feel > >>>>> >> >>> >> like > >>>>> >> >>> >> we > >>>>> >> >>> >> might > >>>>> >> >>> >> consider establishing some formal governance over > >>>>> >> >>> >> pandas > >>>>> >> >>> >> and > >>>>> >> >>> >> publishing meetings notes and roadmap documents > >>>>> >> >>> >> describing > >>>>> >> >>> >> plans > >>>>> >> >>> >> for > >>>>> >> >>> >> the project and meetings notes from committers. > >>>>> >> >>> >> There's no > >>>>> >> >>> >> real > >>>>> >> >>> >> "committer culture" for NumFOCUS projects like
> >>>>> >> >>> >> is > >>>>> >> >>> >> with > >>>>> >> >>> >> the > >>>>> >> >>> >> Apache Software Foundation, but we might try leading > >>>>> >> >>> >> by > >>>>> >> >>> >> example! > >>>>> >> >>> >> > >>>>> >> >>> >> Also, I believe pandas as a project has reached a > >>>>> >> >>> >> level of > >>>>> >> >>> >> importance > >>>>> >> >>> >> where we ought to consider planning and execution on > >>>>> >> >>> >> larger > >>>>> >> >>> >> scale > >>>>> >> >>> >> undertakings such as this for safeguarding the future. > >>>>> >> >>> >> > >>>>> >> >>> >> As for myself, well, I have my hands full in Big > >>>>> >> >>> >> Data-land. I > >>>>> >> >>> >> wish > >>>>> >> >>> >> I > >>>>> >> >>> >> could be helping more with pandas, but there a quite a > >>>>> >> >>> >> few > >>>>> >> >>> >> fundamental > >>>>> >> >>> >> issues (like data interoperability nested data > >>>>> >> >>> >> handling > >>>>> >> >>> >> and > >>>>> >> >>> >> file > >>>>> >> >>> >> format support — e.g. Parquet, see > >>>>> >> >>> >> > >>>>> >> >>> >> > >>>>> >> >>> >> > >>>>> >> >>> >> > >>>>> >> >>> >> > >>>>> >> >>> >> > >>>>> >> >>> >> http://wesmckinney.com/blog/the-problem-with-the-data-science-language-wars/ ) > >>>>> >> >>> >> preventing Python from being more useful in industry > >>>>> >> >>> >> analytics > >>>>> >> >>> >> applications. > >>>>> >> >>> >> > >>>>> >> >>> >> Aside: one of the bigger mistakes I made with
> >>>>> >> >>> >> API > >>>>> >> >>> >> design > >>>>> >> >>> >> was > >>>>> >> >>> >> making it acceptable to call class constructors —
> >>>>> >> >>> >> pandas.DataFrame — directly (versus factory > >>>>> >> >>> >> functions). > >>>>> >> >>> >> Sorry > >>>>> >> >>> >> about > >>>>> >> >>> >> that! If we could convince everyone to start writing > >>>>> >> >>> >> pandas.data_frame > >>>>> >> >>> >> or dataframe instead of using the class reference it > >>>>> >> >>> >> would > >>>>> >> >>> >> help a > >>>>> >> >>> >> lot > >>>>> >> >>> >> with code cleanup. It's hard to plan for these
really likely that that libraries think libpandas.so the the the the them the the there pandas's like things
> >>>>> >> >>> >> — > >>>>> >> >>> >> NumPy > >>>>> >> >>> >> interoperability seemed a lot more important in 2008 > >>>>> >> >>> >> than > >>>>> >> >>> >> it > >>>>> >> >>> >> does > >>>>> >> >>> >> now, > >>>>> >> >>> >> so I forgive myself. > >>>>> >> >>> >> > >>>>> >> >>> >> cheers and best wishes for 2016, > >>>>> >> >>> >> Wes > >>>>> >> >>> >> _______________________________________________ > >>>>> >> >>> >> Pandas-dev mailing list > >>>>> >> >>> >> Pandas-dev@python.org > >>>>> >> >>> >> https://mail.python.org/mailman/listinfo/pandas-dev > >>>>> >> >>> > > >>>>> >> >>> > > >>>>> >> >>> _______________________________________________ > >>>>> >> >>> Pandas-dev mailing list > >>>>> >> >>> Pandas-dev@python.org > >>>>> >> >>> https://mail.python.org/mailman/listinfo/pandas-dev > >>>>> >> _______________________________________________ > >>>>> >> Pandas-dev mailing list > >>>>> >> Pandas-dev@python.org > >>>>> >> https://mail.python.org/mailman/listinfo/pandas-dev > >>>>> > > >>>>> > > >>>>> > >>>>> > >>>>> _______________________________________________ > >>>>> Pandas-dev mailing list > >>>>> Pandas-dev@python.org > >>>>> https://mail.python.org/mailman/listinfo/pandas-dev > >>>>> > >>>> > >> > _______________________________________________ > Pandas-dev mailing list > Pandas-dev@python.org > https://mail.python.org/mailman/listinfo/pandas-dev
_______________________________________________ Pandas-dev mailing list Pandas-dev@python.org https://mail.python.org/mailman/listinfo/pandas-dev
_______________________________________________ Pandas-dev mailing list Pandas-dev@python.org https://mail.python.org/mailman/listinfo/pandas-dev
Pandas-dev mailing list Pandas-dev@python.org https://mail.python.org/mailman/listinfo/pandas-dev
I also will add that there is an ideology that has existed in the scientific Python community since 2011 at least which is this: pandas should not have existed; it should be part of NumPy instead. In my opinion, that misses the point of pandas, both then and now. There's a large and mostly new class of Python users working on domain-specific industry analytics problems for whom pandas is the most important tool that they use on a daily basis. Their knowledge of NumPy is limited, beyond the aspects of the ndarray API that are the same in pandas. High level APIs and accessibility for them is extremely important. But their skill sets and problems they are solving are not the same ones on the whole that you would have heard discussed at SciPy 2010. Sometime in 2015, "Python for Data Analysis" sold it's 100,000th copy. I have 5 foreign translations sitting on my shelf -- this represents a very large group of people that we have all collectively enabled by developing pandas -- for a lot of people, pandas is the main reason they use Python! So the summary of all this is: pandas is much more important as a project now than it was 5 years ago. Our relationship with our library dependencies like NumPy should reflect that. Downstream pandas consumers should similarly eventually concern themselves more with pandas compatibility (rather than always assuming that NumPy arrays are the only intermediary). This is a philosophical shift, but one that will ultimately benefit the usability of the stack. On Wed, Jan 6, 2016 at 11:45 AM, Jeff Reback <jeffreback@gmail.com> wrote:
I'll just apologize right up front! hahah.
No I think I have been pushing on these extras in pandas to help move it forward. I have commented a bit on Stephan's issue here about why I didn't push for these in numpy. numpy is fairly slow moving (though moves faster lately, I suspect the pace when Wes was developing pandas was not much faster).
So pandas was essentially 'fixing' lots of bug / compat issues in numpy.
To the extent that we can keep the current user facing API the same (high likelihood I think), willing to acccept *some* breakage with the pandas->duck-like array container API in order to provide swappable containers.
For example I recall that in doing datetime w/tz, that we wanted Series.values to return a numpy array (which it DOES!) but it is actually lossy (its loses the tz). Samething with the Categorical example wes gave. I dont' think these requirements should hold pandas back!
People are increasingly using pandas as the API for there work. That makes it very important that we can handle lots of input properly, w/o the handcuffs of numpy.
All this said, I'll reiterate Wes (and others points). That back-compat is extremely important. (I in fact try to bend over backwards to provide this, sometimes its too much of course!). E.g. take the resample changes to API
Was originally going to just do a hard break, but this turns off people when they have to update there code or else.
my 4c (incrementing!)
Jeff
Hi Wes, You raise some important points. I agree that pandas's patched version of the numpy dtype system is a mess. But despite its issues, its leaky abstraction on top of NumPy provides benefits. In particular, it makes pandas easy to emulate (e.g., xarray), extend (e.g., geopandas) and integrate with other libraries (e.g., patsy, Scikit-Learn, matplotlib). You are right that pandas has started to supplant numpy as a high level API for data analysis, but of course the robust (and often numpy based) Python ecosystem is part of what has made pandas so successful. In practice, ecosystem projects often want to work with more primitive objects than series/dataframes in their internal data structures and without numpy this becomes more difficult. For example, how do you concatenate a list of categoricals? If these were numpy arrays, we could use np.concatenate, but the current implementation of categorical would require a custom solution. First class compatibility with pandas is harder when pandas data cannot be used with a full ndarray API. Likewise, hiding implementation details retains some flexibility for us (as developers), but in an ideal world, we would know we have the right abstraction, and then could expose the implementation as an advanced API! This is the case for some very mature projects, such as NumPy. Pandas is not really here yet (with the block manager), but it might be something to strive towards in this rewrite. At this point, I suppose the ship has sailed (e.g., with categorical in .values) on full numpy compatibility. So we absolutely do need explicit interfaces to converting to NumPy, rather than the current implicit guarantees about .values -- which we violated with categorical. Something like your suggested .to_numpy() method would indeed be an improvement over the current state, where we half-pretend that NumPy could be used as an advanced API for pandas, even though it doesn't really work. I'm sure you would agree that -- at least in theory -- it would be nice to push dtype improvements upstream to numpy, but that is obviously more work (for a variety of reasons) than starting from scratch in pandas. Of course, I think pandas has a need and right to exist as a separate library. But I do think building off of NumPy made it stronger, and pushing improvements upstream would be a better way to go. This has been my approach, and is why I've worked on both pandas and NumPy. The bottom line is that I don't agree that this is the most productive path forward -- I would opt for improving NumPy or DyND instead, which I believe would cause much less pain downstream -- but given that I'm not going to be the person doing the work, I will defer to your judgment. Pandas is certainly in need of holistic improvements and the maturity of a v1.0 release, and that's not something I'm in a position to push myself. Best, Stephan P.S. apologies for the delay -- it's been a busy week. On Wed, Jan 6, 2016 at 12:15 PM, Wes McKinney <wesmckinn@gmail.com> wrote:
I also will add that there is an ideology that has existed in the scientific Python community since 2011 at least which is this: pandas should not have existed; it should be part of NumPy instead.
In my opinion, that misses the point of pandas, both then and now.
There's a large and mostly new class of Python users working on domain-specific industry analytics problems for whom pandas is the most important tool that they use on a daily basis. Their knowledge of NumPy is limited, beyond the aspects of the ndarray API that are the same in pandas. High level APIs and accessibility for them is extremely important. But their skill sets and problems they are solving are not the same ones on the whole that you would have heard discussed at SciPy 2010.
Sometime in 2015, "Python for Data Analysis" sold it's 100,000th copy. I have 5 foreign translations sitting on my shelf -- this represents a very large group of people that we have all collectively enabled by developing pandas -- for a lot of people, pandas is the main reason they use Python!
So the summary of all this is: pandas is much more important as a project now than it was 5 years ago. Our relationship with our library dependencies like NumPy should reflect that. Downstream pandas consumers should similarly eventually concern themselves more with pandas compatibility (rather than always assuming that NumPy arrays are the only intermediary). This is a philosophical shift, but one that will ultimately benefit the usability of the stack.
On Wed, Jan 6, 2016 at 11:45 AM, Jeff Reback <jeffreback@gmail.com> wrote:
I'll just apologize right up front! hahah.
No I think I have been pushing on these extras in pandas to help move it forward. I have commented a bit on Stephan's issue here about why I didn't push for these in numpy. numpy is fairly slow moving (though moves faster lately, I suspect the pace when Wes was developing pandas was not much faster).
So pandas was essentially 'fixing' lots of bug / compat issues in numpy.
To the extent that we can keep the current user facing API the same (high likelihood I think), willing to acccept *some* breakage with the pandas->duck-like array container API in order to provide swappable containers.
For example I recall that in doing datetime w/tz, that we wanted Series.values to return a numpy array (which it DOES!) but it is actually lossy (its loses the tz). Samething with the Categorical example wes gave. I dont' think these requirements should hold pandas back!
People are increasingly using pandas as the API for there work. That makes it very important that we can handle lots of input properly, w/o the handcuffs of numpy.
All this said, I'll reiterate Wes (and others points). That back-compat is extremely important. (I in fact try to bend over backwards to provide this, sometimes its too much of course!). E.g. take the resample changes to API
Was originally going to just do a hard break, but this turns off people when they have to update there code or else.
my 4c (incrementing!)
Jeff
On Mon, Jan 11, 2016 at 9:36 AM, Stephan Hoyer <shoyer@gmail.com> wrote:
Hi Wes,
You raise some important points.
I agree that pandas's patched version of the numpy dtype system is a mess. But despite its issues, its leaky abstraction on top of NumPy provides benefits. In particular, it makes pandas easy to emulate (e.g., xarray), extend (e.g., geopandas) and integrate with other libraries (e.g., patsy, Scikit-Learn, matplotlib).
You are right that pandas has started to supplant numpy as a high level API for data analysis, but of course the robust (and often numpy based) Python ecosystem is part of what has made pandas so successful. In practice, ecosystem projects often want to work with more primitive objects than series/dataframes in their internal data structures and without numpy this becomes more difficult. For example, how do you concatenate a list of categoricals? If these were numpy arrays, we could use np.concatenate, but the current implementation of categorical would require a custom solution. First class compatibility with pandas is harder when pandas data cannot be used with a full ndarray API.
Likewise, hiding implementation details retains some flexibility for us (as developers), but in an ideal world, we would know we have the right abstraction, and then could expose the implementation as an advanced API! This is the case for some very mature projects, such as NumPy. Pandas is not really here yet (with the block manager), but it might be something to strive towards in this rewrite.
At this point, I suppose the ship has sailed (e.g., with categorical in .values) on full numpy compatibility. So we absolutely do need explicit interfaces to converting to NumPy, rather than the current implicit guarantees about .values -- which we violated with categorical. Something like your suggested .to_numpy() method would indeed be an improvement over the current state, where we half-pretend that NumPy could be used as an advanced API for pandas, even though it doesn't really work.
I'm sure you would agree that -- at least in theory -- it would be nice to push dtype improvements upstream to numpy, but that is obviously more work (for a variety of reasons) than starting from scratch in pandas. Of course, I think pandas has a need and right to exist as a separate library. But I do think building off of NumPy made it stronger, and pushing improvements upstream would be a better way to go. This has been my approach, and is why I've worked on both pandas and NumPy.
The bottom line is that I don't agree that this is the most productive path forward -- I would opt for improving NumPy or DyND instead, which I believe would cause much less pain downstream -- but given that I'm not going to be the person doing the work, I will defer to your judgment. Pandas is certainly in need of holistic improvements and the maturity of a v1.0 release, and that's not something I'm in a position to push myself.
This seems like a false dichotomy to me. I'm not arguing for forging NumPy-free or DyND-free path, but rather making DyND's or NumPy's physical memory representation and array computing infrastructure more clearly implementation details of pandas that have limited user-visibility (except when using NumPy / DyND-based tools is necessary). The main problem we have faced with NumPy is: - Much more difficult to extend - Legacy code makes major changes difficult or impossible - pandas users likely represent a minority (but perhaps a plurality, at this point) of users DyND's scope, as I understand it, is to be used for more use cases than an internal detail of pandas objects. It doesn't have the legacy baggage, but it will face similar challenges around being a general purpose array library versus a more domain-specific analytics and data preparation library. pandas already has what can be called a "logical type system" (see e.g. https://github.com/Parquet/parquet-format/blob/master/LogicalTypes.md for other examples of logical type representations). We use NumPy dtypes for the physical memory representation along with various conventions for pandas-specific behavior like missing data, but they are weakly abstracted in a way that's definitely harmful for users. What I am arguing is 1) Introduce a proper (from a software engineering perspective) logical data type abstraction that models the way that pandas already works, but cleaning up all the mess (implicit upcasts, lack of a real "NA" scalar value, making pandas-specific methods like unique, factorize, match, etc. true "array methods") 2) Use NumPy physical dtypes (for now) as the primary target physical representation 3) Layer new machinery (like bitmasks) on top of raw NumPy arrays to add new features to pandas 4) Give pandas objects a real C API so that users can manipulate and create pandas objects with their own native (C/C++/Cython) code. 5) Yes, absolutely improve NumPy and DyND and transition to improved NumPy and DyND facilities as soon as they are available and shipped I don't see alternative ways for pandas to have a truly healthy relationship with more general purpose array / scientific computing libraries without being able to add new pandas functionality in a clean way, and without requiring us to get patches accepted (and released) in NumPy or DyND. Can you clarify what aspects of this plan are disagreeable / contentious? Are you arguing for pandas becoming more of a companion tool / user interface layer for NumPy or DyND? cheers, Wes
Best, Stephan
P.S. apologies for the delay -- it's been a busy week.
On Wed, Jan 6, 2016 at 12:15 PM, Wes McKinney <wesmckinn@gmail.com> wrote:
I also will add that there is an ideology that has existed in the scientific Python community since 2011 at least which is this: pandas should not have existed; it should be part of NumPy instead.
In my opinion, that misses the point of pandas, both then and now.
There's a large and mostly new class of Python users working on domain-specific industry analytics problems for whom pandas is the most important tool that they use on a daily basis. Their knowledge of NumPy is limited, beyond the aspects of the ndarray API that are the same in pandas. High level APIs and accessibility for them is extremely important. But their skill sets and problems they are solving are not the same ones on the whole that you would have heard discussed at SciPy 2010.
Sometime in 2015, "Python for Data Analysis" sold it's 100,000th copy. I have 5 foreign translations sitting on my shelf -- this represents a very large group of people that we have all collectively enabled by developing pandas -- for a lot of people, pandas is the main reason they use Python!
So the summary of all this is: pandas is much more important as a project now than it was 5 years ago. Our relationship with our library dependencies like NumPy should reflect that. Downstream pandas consumers should similarly eventually concern themselves more with pandas compatibility (rather than always assuming that NumPy arrays are the only intermediary). This is a philosophical shift, but one that will ultimately benefit the usability of the stack.
On Wed, Jan 6, 2016 at 11:45 AM, Jeff Reback <jeffreback@gmail.com> wrote:
I'll just apologize right up front! hahah.
No I think I have been pushing on these extras in pandas to help move it forward. I have commented a bit on Stephan's issue here about why I didn't push for these in numpy. numpy is fairly slow moving (though moves faster lately, I suspect the pace when Wes was developing pandas was not much faster).
So pandas was essentially 'fixing' lots of bug / compat issues in numpy.
To the extent that we can keep the current user facing API the same (high likelihood I think), willing to acccept *some* breakage with the pandas->duck-like array container API in order to provide swappable containers.
For example I recall that in doing datetime w/tz, that we wanted Series.values to return a numpy array (which it DOES!) but it is actually lossy (its loses the tz). Samething with the Categorical example wes gave. I dont' think these requirements should hold pandas back!
People are increasingly using pandas as the API for there work. That makes it very important that we can handle lots of input properly, w/o the handcuffs of numpy.
All this said, I'll reiterate Wes (and others points). That back-compat is extremely important. (I in fact try to bend over backwards to provide this, sometimes its too much of course!). E.g. take the resample changes to API
Was originally going to just do a hard break, but this turns off people when they have to update there code or else.
my 4c (incrementing!)
Jeff
On Mon, Jan 11, 2016 at 10:45 AM, Wes McKinney <wesmckinn@gmail.com> wrote:
On Mon, Jan 11, 2016 at 9:36 AM, Stephan Hoyer <shoyer@gmail.com> wrote:
Hi Wes,
You raise some important points.
I agree that pandas's patched version of the numpy dtype system is a mess. But despite its issues, its leaky abstraction on top of NumPy provides benefits. In particular, it makes pandas easy to emulate (e.g., xarray), extend (e.g., geopandas) and integrate with other libraries (e.g., patsy, Scikit-Learn, matplotlib).
You are right that pandas has started to supplant numpy as a high level API for data analysis, but of course the robust (and often numpy based) Python ecosystem is part of what has made pandas so successful. In practice, ecosystem projects often want to work with more primitive objects than series/dataframes in their internal data structures and without numpy this becomes more difficult. For example, how do you concatenate a list of categoricals? If these were numpy arrays, we could use np.concatenate, but the current implementation of categorical would require a custom solution. First class compatibility with pandas is harder when pandas data cannot be used with a full ndarray API.
Likewise, hiding implementation details retains some flexibility for us (as developers), but in an ideal world, we would know we have the right abstraction, and then could expose the implementation as an advanced API! This is the case for some very mature projects, such as NumPy. Pandas is not really here yet (with the block manager), but it might be something to strive towards in this rewrite.
At this point, I suppose the ship has sailed (e.g., with categorical in .values) on full numpy compatibility. So we absolutely do need explicit interfaces to converting to NumPy, rather than the current implicit guarantees about .values -- which we violated with categorical. Something like your suggested .to_numpy() method would indeed be an improvement over the current state, where we half-pretend that NumPy could be used as an advanced API for pandas, even though it doesn't really work.
I'm sure you would agree that -- at least in theory -- it would be nice to push dtype improvements upstream to numpy, but that is obviously more work (for a variety of reasons) than starting from scratch in pandas. Of course, I think pandas has a need and right to exist as a separate library. But I do think building off of NumPy made it stronger, and pushing improvements upstream would be a better way to go. This has been my approach, and is why I've worked on both pandas and NumPy.
The bottom line is that I don't agree that this is the most productive path forward -- I would opt for improving NumPy or DyND instead, which I believe would cause much less pain downstream -- but given that I'm not going to be the person doing the work, I will defer to your judgment. Pandas is certainly in need of holistic improvements and the maturity of a v1.0 release, and that's not something I'm in a position to push myself.
This seems like a false dichotomy to me. I'm not arguing for forging NumPy-free or DyND-free path, but rather making DyND's or NumPy's physical memory representation and array computing infrastructure more clearly implementation details of pandas that have limited user-visibility (except when using NumPy / DyND-based tools is necessary).
The main problem we have faced with NumPy is:
- Much more difficult to extend - Legacy code makes major changes difficult or impossible - pandas users likely represent a minority (but perhaps a plurality, at this point) of users
DyND's scope, as I understand it, is to be used for more use cases than an internal detail of pandas objects. It doesn't have the legacy baggage, but it will face similar challenges around being a general purpose array library versus a more domain-specific analytics and data preparation library.
pandas already has what can be called a "logical type system" (see e.g. https://github.com/Parquet/parquet-format/blob/master/LogicalTypes.md for other examples of logical type representations). We use NumPy dtypes for the physical memory representation along with various conventions for pandas-specific behavior like missing data, but they are weakly abstracted in a way that's definitely harmful for users. What I am arguing is
1) Introduce a proper (from a software engineering perspective) logical data type abstraction that models the way that pandas already works, but cleaning up all the mess (implicit upcasts, lack of a real "NA" scalar value, making pandas-specific methods like unique, factorize, match, etc. true "array methods")
2) Use NumPy physical dtypes (for now) as the primary target physical representation
3) Layer new machinery (like bitmasks) on top of raw NumPy arrays to add new features to pandas
4) Give pandas objects a real C API so that users can manipulate and create pandas objects with their own native (C/C++/Cython) code.
5) Yes, absolutely improve NumPy and DyND and transition to improved NumPy and DyND facilities as soon as they are available and shipped
I don't see alternative ways for pandas to have a truly healthy relationship with more general purpose array / scientific computing libraries without being able to add new pandas functionality in a clean way, and without requiring us to get patches accepted (and released) in NumPy or DyND.
Just to be clear on my stance re: pushing more code upstream into array libraries: if we introduce the right level of coupling / abstraction between pandas and NumPy/DyND, it will be much easier for us to use libpandas as a staging area for code that we are proposing to push upstream into one of those libraries. That's not really possible right now because pandas's internals are not easily portable to other C/C++ codebases (being written in a mix of pure Python and Cython).
Can you clarify what aspects of this plan are disagreeable / contentious? Are you arguing for pandas becoming more of a companion tool / user interface layer for NumPy or DyND?
cheers, Wes
Best, Stephan
P.S. apologies for the delay -- it's been a busy week.
On Wed, Jan 6, 2016 at 12:15 PM, Wes McKinney <wesmckinn@gmail.com> wrote:
I also will add that there is an ideology that has existed in the scientific Python community since 2011 at least which is this: pandas should not have existed; it should be part of NumPy instead.
In my opinion, that misses the point of pandas, both then and now.
There's a large and mostly new class of Python users working on domain-specific industry analytics problems for whom pandas is the most important tool that they use on a daily basis. Their knowledge of NumPy is limited, beyond the aspects of the ndarray API that are the same in pandas. High level APIs and accessibility for them is extremely important. But their skill sets and problems they are solving are not the same ones on the whole that you would have heard discussed at SciPy 2010.
Sometime in 2015, "Python for Data Analysis" sold it's 100,000th copy. I have 5 foreign translations sitting on my shelf -- this represents a very large group of people that we have all collectively enabled by developing pandas -- for a lot of people, pandas is the main reason they use Python!
So the summary of all this is: pandas is much more important as a project now than it was 5 years ago. Our relationship with our library dependencies like NumPy should reflect that. Downstream pandas consumers should similarly eventually concern themselves more with pandas compatibility (rather than always assuming that NumPy arrays are the only intermediary). This is a philosophical shift, but one that will ultimately benefit the usability of the stack.
On Wed, Jan 6, 2016 at 11:45 AM, Jeff Reback <jeffreback@gmail.com> wrote:
I'll just apologize right up front! hahah.
No I think I have been pushing on these extras in pandas to help move it forward. I have commented a bit on Stephan's issue here about why I didn't push for these in numpy. numpy is fairly slow moving (though moves faster lately, I suspect the pace when Wes was developing pandas was not much faster).
So pandas was essentially 'fixing' lots of bug / compat issues in numpy.
To the extent that we can keep the current user facing API the same (high likelihood I think), willing to acccept *some* breakage with the pandas->duck-like array container API in order to provide swappable containers.
For example I recall that in doing datetime w/tz, that we wanted Series.values to return a numpy array (which it DOES!) but it is actually lossy (its loses the tz). Samething with the Categorical example wes gave. I dont' think these requirements should hold pandas back!
People are increasingly using pandas as the API for there work. That makes it very important that we can handle lots of input properly, w/o the handcuffs of numpy.
All this said, I'll reiterate Wes (and others points). That back-compat is extremely important. (I in fact try to bend over backwards to provide this, sometimes its too much of course!). E.g. take the resample changes to API
Was originally going to just do a hard break, but this turns off people when they have to update there code or else.
my 4c (incrementing!)
Jeff
On Mon, Jan 11, 2016 at 11:33 AM, Wes McKinney <wesmckinn@gmail.com> wrote:
Just to be clear on my stance re: pushing more code upstream into array libraries: if we introduce the right level of coupling / abstraction between pandas and NumPy/DyND, it will be much easier for us to use libpandas as a staging area for code that we are proposing to push upstream into one of those libraries. That's not really possible right now because pandas's internals are not easily portable to other C/C++ codebases (being written in a mix of pure Python and Cython).
Yep, also agreed. I think DyND is probably a better target than NumPy here, if only because it's also written in C++. NumPy, of course, has been a beast to extend.
I don't see alternative ways for pandas to have a truly healthy relationship with more general purpose array / scientific computing libraries without being able to add new pandas functionality in a clean way, and without requiring us to get patches accepted (and released) in NumPy or DyND.
Indeed, I think my disagreement is mostly about the order in which we approach these problems.
Can you clarify what aspects of this plan are disagreeable / contentious?
See my comments below.
Are you arguing for pandas becoming more of a companion tool / user interface layer for NumPy or DyND?
Not quite. Pandas has some fantastic and highly useable data (Series, DataFrame, Index). These certainly don't belong in NumPy or DyND. However, the array-based ecosystem certainly could use improvements to dtypes (e.g., datetime and categorical) and dtype specific methods (e.g., for strings) just as much as pandas. I do firmly believe that pushing these types of improvements upstream, rather than implementing them independently for pandas, would yield benefits for the broader ecosystem. With the right infrastructure, generalizing things to arrays is not much more work. I'd like to see pandas itself focus more on the data-structures and less on the data types. This would let us share more work with the "general purpose array / scientific computing libraries". 1) Introduce a proper (from a software engineering perspective)
logical data type abstraction that models the way that pandas already works, but cleaning up all the mess (implicit upcasts, lack of a real "NA" scalar value, making pandas-specific methods like unique, factorize, match, etc. true "array methods")
New abstractions have a cost. A new logical data type abstraction is better than no proper abstraction at all, but (in principle), one data type abstraction should be enough to share. A proper logical data type abstraction would be an improvement over the current situation, but if there's a way we could introduce one less abstraction (by improving things upstream in a general purpose array library) that would help even more. For example, we could imagine pushing to make DyND the new core for pandas. This could be enough of a push to make DyND generally useful -- I know it still has a few kinks to work out. 4) Give pandas objects a real C API so that users can manipulate and
create pandas objects with their own native (C/C++/Cython) code.
5) Yes, absolutely improve NumPy and DyND and transition to improved
NumPy and DyND facilities as soon as they are available and shipped
I like the sound of both of these.
I am in favor of the Wes refactoring, but for some slightly different reasons. I am including some in-line comments. On Mon, Jan 11, 2016 at 2:55 PM, Stephan Hoyer <shoyer@gmail.com> wrote:
I don't see alternative ways for pandas to have a truly healthy
relationship with more general purpose array / scientific computing libraries without being able to add new pandas functionality in a clean way, and without requiring us to get patches accepted (and released) in NumPy or DyND.
Indeed, I think my disagreement is mostly about the order in which we approach these problems.
I agree here. I had started on *some* of this to enable swappable numpy to DyND to support IntNA (all in python, but the fundamental change was to provide an API layer to the back-end).
Can you clarify what aspects of this plan are disagreeable / contentious?
See my comments below.
Are you arguing for pandas becoming more of a companion tool / user interface layer for NumPy or DyND?
Not quite. Pandas has some fantastic and highly useable data (Series, DataFrame, Index). These certainly don't belong in NumPy or DyND.
However, the array-based ecosystem certainly could use improvements to dtypes (e.g., datetime and categorical) and dtype specific methods (e.g., for strings) just as much as pandas. I do firmly believe that pushing these types of improvements upstream, rather than implementing them independently for pandas, would yield benefits for the broader ecosystem. With the right infrastructure, generalizing things to arrays is not much more work.
I dont' think Wes nor I disagree here at all. The problem was (and is), the pace of change in the underlying libraries. It is simply too slow for pandas development efforts. I think the pandas efforts (and other libraries) can result in more powerful fundamental libraries that get pushed upstream. However, it would not benefit ANYONE to slow down downstream efforts. I am not sure why you suggest that we WAIT for the upstream libraries to change? We have been waiting forever for that. Now we have a concrete implementation of certain data types that are useful. They (upstream) can take this and build on (or throw it away and make a better one or whatever). But I don't think it benefits anyone to WAIT for someone to change numpy first. Look at how long it took them to (partially) fix datetimes. xarray in particular has done the same thing to pandas, e.g. you have added additional selection operators and syntax (e.g. passing dicts of named axes). These changes are in fact propogating to pandas. This has taken time (but much much less that this took for any of pandas changes to numpy). Further look at how long you have advocated (correctly) for labeled arrays in numpy (which we are still waiting).
I'd like to see pandas itself focus more on the data-structures and less on the data types. This would let us share more work with the "general purpose array / scientific computing libraries".
Pandas IS about specifying the correct data types. It is simply incorrect
to decouple this problem from the data-structures. A lot of effort over the years has gone into making all dtypes playing nice with each other and within pandas.
1) Introduce a proper (from a software engineering perspective)
logical data type abstraction that models the way that pandas already works, but cleaning up all the mess (implicit upcasts, lack of a real "NA" scalar value, making pandas-specific methods like unique, factorize, match, etc. true "array methods")
New abstractions have a cost. A new logical data type abstraction is better than no proper abstraction at all, but (in principle), one data type abstraction should be enough to share.
A proper logical data type abstraction would be an improvement over the current situation, but if there's a way we could introduce one less abstraction (by improving things upstream in a general purpose array library) that would help even more.
This is just pushing a problem upstream, which ultimately, given the track history of numpy, won't be solved at all. We will be here 1 year from now with the exact same discussion. Why are we waiting on upstream for anything? As I said above, if something is created which upstream finds useful on a general level. great. The great cost here is time.
For example, we could imagine pushing to make DyND the new core for pandas. This could be enough of a push to make DyND generally useful -- I know it still has a few kinks to work out.
maybe, but DyND has to have full compat with what currently is out there (soonish). Then I agree this could be possible. But wouldn't it be even better for pandas to be able to swap back-ends. Why limit ourselves to a particular backend if its not that difficult.
4) Give pandas objects a real C API so that users can manipulate and
create pandas objects with their own native (C/C++/Cython) code.
5) Yes, absolutely improve NumPy and DyND and transition to improved
NumPy and DyND facilities as soon as they are available and shipped
I like the sound of both of these.
Further you made a point above You are right that pandas has started to supplant numpy as a high level API
for data analysis, but of course the robust (and often numpy based) Python ecosystem is part of what has made pandas so successful. In practice, ecosystem projects often want to work with more primitive objects than series/dataframes in their internal data structures and without numpy this becomes more difficult. For example, how do you concatenate a list of categoricals? If these were numpy arrays, we could use np.concatenate, but the current implementation of categorical would require a custom solution. First class compatibility with pandas is harder when pandas data cannotbe used with a full ndarray API.
I disagree entirely here. I think that Series/DataFrame ARE becoming primitive objects. Look at seaborn, statsmodels, and xarray These are first class users of these structures, whom need the additional meta-data attached. Yes categorical are useful in numpy, and they should support them. But lots of libraries can simply use pandas and do lots of really useful stuff. However, why reinvent the wheel and use numpy, when you have DataFrames.
From a user point of view, I don't think they even care about numpy (or whatever drives pandas). It solves a very general problem of working with labeled data.
Jeff
On Mon, Jan 11, 2016 at 3:04 PM, Jeff Reback <jeffreback@gmail.com> wrote:
I think the pandas efforts (and other libraries) can result in more powerful fundamental libraries that get pushed upstream. However, it would not benefit ANYONE to slow down downstream efforts. I am not sure why you suggest that we WAIT for the upstream libraries to change? We have been waiting forever for that. Now we have a concrete implementation of certain data types that are useful. They (upstream) can take this and build on (or throw it away and make a better one or whatever). But I don't think it benefits anyone to WAIT for someone to change numpy first. Look at how long it took them to (partially) fix datetimes.
I agree, it is insane to wait on upstream improvements to spontaneously happen on their own. We (interested downstream developers) would need to push them through. I started on this recently for making datetime64 timezone naive (https://github.com/numpy/numpy/pull/6453) -- though of course, this is one of the easier issue. Of course, this being open source, my suggestions require someone interested in doing all the hard work. And given that that is not me, perhaps I should just shut up :). If the best we think we can realistically do is Wes writing our own data type system, then I'll be a little sad, but it would still be a win.
xarray in particular has done the same thing to pandas, e.g. you have added additional selection operators and syntax (e.g. passing dicts of named axes). These changes are in fact propogating to pandas. This has taken time (but much much less that this took for any of pandas changes to numpy). Further look at how long you have advocated (correctly) for labeled arrays in numpy (which we are still waiting).
I'm actually not convinced NumPy needs labeled arrays. In my mind, libraries like pandas and xarray solve the labeled array problem very well downstream of NumPy. There are costs to making the basic libraries label aware.
I'd like to see pandas itself focus more on the data-structures and less
on the data types. This would let us share more work with the "general purpose array / scientific computing libraries".
Pandas IS about specifying the correct data types. It is simply incorrect to decouple this problem from the data-structures. A lot of effort over the years has gone into making all dtypes playing nice with each other and within pandas.
Yes, a lot of effort has gone into dtypes in pandas. This is great! But wouldn't it be even better if we had a viable path for pushing this stuff upstream? ;)
maybe, but DyND has to have full compat with what currently is out there (soonish). Then I agree this could be possible. But wouldn't it be even better for pandas to be able to swap back-ends. Why limit ourselves to a particular backend if its not that difficult.
Well, Irwin, what do you say? :) I'm just saying that in my ideal world, we would not invent a new dtype standard for pandas (insert obligatory xkcd reference here). I disagree entirely here. I think that Series/DataFrame ARE becoming
primitive objects. Look at seaborn, statsmodels, and xarray These are first class users of these structures, whom need the additional meta-data attached.
Seaborn does use Series/DataFrame internally as first class data structures. But for xarray and statsmodels it is the other way around -- pandas objects are accepted as input, but coerced into NumPy arrays internally for storage and manipulation. This presents issues for new types with metadata like categorical. Best, Stephan
Stephan Seaborn does use Series/DataFrame internally as first class data
structures. But for xarray and statsmodels it is the other way around -- pandas objects are accepted as input, but coerced into NumPy arrays internally for storage and manipulation. This presents issues for new types with metadata like categorical.
care to elaborate on the xarray decision to keep data as numpy arrays, rather than Series in DataArray? (as you do keep the Index objects intact). On Mon, Jan 11, 2016 at 6:35 PM, Stephan Hoyer <shoyer@gmail.com> wrote:
On Mon, Jan 11, 2016 at 3:04 PM, Jeff Reback <jeffreback@gmail.com> wrote:
I think the pandas efforts (and other libraries) can result in more powerful fundamental libraries that get pushed upstream. However, it would not benefit ANYONE to slow down downstream efforts. I am not sure why you suggest that we WAIT for the upstream libraries to change? We have been waiting forever for that. Now we have a concrete implementation of certain data types that are useful. They (upstream) can take this and build on (or throw it away and make a better one or whatever). But I don't think it benefits anyone to WAIT for someone to change numpy first. Look at how long it took them to (partially) fix datetimes.
I agree, it is insane to wait on upstream improvements to spontaneously happen on their own. We (interested downstream developers) would need to push them through. I started on this recently for making datetime64 timezone naive (https://github.com/numpy/numpy/pull/6453) -- though of course, this is one of the easier issue.
Of course, this being open source, my suggestions require someone interested in doing all the hard work. And given that that is not me, perhaps I should just shut up :).
If the best we think we can realistically do is Wes writing our own data type system, then I'll be a little sad, but it would still be a win.
xarray in particular has done the same thing to pandas, e.g. you have added additional selection operators and syntax (e.g. passing dicts of named axes). These changes are in fact propogating to pandas. This has taken time (but much much less that this took for any of pandas changes to numpy). Further look at how long you have advocated (correctly) for labeled arrays in numpy (which we are still waiting).
I'm actually not convinced NumPy needs labeled arrays. In my mind, libraries like pandas and xarray solve the labeled array problem very well downstream of NumPy. There are costs to making the basic libraries label aware.
I'd like to see pandas itself focus more on the data-structures and less
on the data types. This would let us share more work with the "general purpose array / scientific computing libraries".
Pandas IS about specifying the correct data types. It is simply incorrect to decouple this problem from the data-structures. A lot of effort over the years has gone into making all dtypes playing nice with each other and within pandas.
Yes, a lot of effort has gone into dtypes in pandas. This is great! But wouldn't it be even better if we had a viable path for pushing this stuff upstream? ;)
maybe, but DyND has to have full compat with what currently is out there (soonish). Then I agree this could be possible. But wouldn't it be even better for pandas to be able to swap back-ends. Why limit ourselves to a particular backend if its not that difficult.
Well, Irwin, what do you say? :)
I'm just saying that in my ideal world, we would not invent a new dtype standard for pandas (insert obligatory xkcd reference here).
I disagree entirely here. I think that Series/DataFrame ARE becoming
primitive objects. Look at seaborn, statsmodels, and xarray These are first class users of these structures, whom need the additional meta-data attached.
Seaborn does use Series/DataFrame internally as first class data structures. But for xarray and statsmodels it is the other way around -- pandas objects are accepted as input, but coerced into NumPy arrays internally for storage and manipulation. This presents issues for new types with metadata like categorical.
Best, Stephan
_______________________________________________ Pandas-dev mailing list Pandas-dev@python.org https://mail.python.org/mailman/listinfo/pandas-dev
On Mon, Jan 11, 2016 at 4:19 PM, Jeff Reback <jeffreback@gmail.com> wrote:
Seaborn does use Series/DataFrame internally as first class data
structures. But for xarray and statsmodels it is the other way around -- pandas objects are accepted as input, but coerced into NumPy arrays internally for storage and manipulation. This presents issues for new types with metadata like categorical.
care to elaborate on the xarray decision to keep data as numpy arrays, rather than Series in DataArray? (as you do keep the Index objects intact).
Sure -- the main point of xarray is that we need N-dimensional data structures, so we definitely need to support NumPy as a backend. Xarray operations are defined in terms of NumPy (or dask) arrays. In principle, we could store data as a Series, but for the sake of sanity we would need to convert to NumPy arrays before doing any operations. Duck typing compatibility is nice in theory, but in practice lots of subtle issues tend to come up. The alternative is to write our own ndarray abstraction internally to xarray that could handle special types like Categorical, but I'm pretty reluctant to do that. It seems like a lot of work, and numpy is "good enough" in most cases. And, of course, I'd rather solve those problems upstream :). Stephan
On Mon, Jan 11, 2016 at 3:04 PM, Jeff Reback <jeffreback@gmail.com> wrote:
I am in favor of the Wes refactoring, but for some slightly different reasons.
I am including some in-line comments.
On Mon, Jan 11, 2016 at 2:55 PM, Stephan Hoyer <shoyer@gmail.com> wrote:
I don't see alternative ways for pandas to have a truly healthy relationship with more general purpose array / scientific computing libraries without being able to add new pandas functionality in a clean way, and without requiring us to get patches accepted (and released) in NumPy or DyND.
Indeed, I think my disagreement is mostly about the order in which we approach these problems.
I agree here. I had started on *some* of this to enable swappable numpy to DyND to support IntNA (all in python, but the fundamental change was to provide an API layer to the back-end).
Can you clarify what aspects of this plan are disagreeable / contentious?
See my comments below.
Are you arguing for pandas becoming more of a companion tool / user interface layer for NumPy or DyND?
Not quite. Pandas has some fantastic and highly useable data (Series, DataFrame, Index). These certainly don't belong in NumPy or DyND.
However, the array-based ecosystem certainly could use improvements to dtypes (e.g., datetime and categorical) and dtype specific methods (e.g., for strings) just as much as pandas. I do firmly believe that pushing these types of improvements upstream, rather than implementing them independently for pandas, would yield benefits for the broader ecosystem. With the right infrastructure, generalizing things to arrays is not much more work.
I dont' think Wes nor I disagree here at all. The problem was (and is), the pace of change in the underlying libraries. It is simply too slow for pandas development efforts.
I think the pandas efforts (and other libraries) can result in more powerful fundamental libraries that get pushed upstream. However, it would not benefit ANYONE to slow down downstream efforts. I am not sure why you suggest that we WAIT for the upstream libraries to change? We have been waiting forever for that. Now we have a concrete implementation of certain data types that are useful. They (upstream) can take this and build on (or throw it away and make a better one or whatever). But I don't think it benefits anyone to WAIT for someone to change numpy first. Look at how long it took them to (partially) fix datetimes.
xarray in particular has done the same thing to pandas, e.g. you have added additional selection operators and syntax (e.g. passing dicts of named axes). These changes are in fact propogating to pandas. This has taken time (but much much less that this took for any of pandas changes to numpy). Further look at how long you have advocated (correctly) for labeled arrays in numpy (which we are still waiting).
I'd like to see pandas itself focus more on the data-structures and less on the data types. This would let us share more work with the "general purpose array / scientific computing libraries".
Pandas IS about specifying the correct data types. It is simply incorrect to decouple this problem from the data-structures. A lot of effort over the years has gone into making all dtypes playing nice with each other and within pandas.
1) Introduce a proper (from a software engineering perspective) logical data type abstraction that models the way that pandas already works, but cleaning up all the mess (implicit upcasts, lack of a real "NA" scalar value, making pandas-specific methods like unique, factorize, match, etc. true "array methods")
New abstractions have a cost. A new logical data type abstraction is better than no proper abstraction at all, but (in principle), one data type abstraction should be enough to share.
A proper logical data type abstraction would be an improvement over the current situation, but if there's a way we could introduce one less abstraction (by improving things upstream in a general purpose array library) that would help even more.
This is just pushing a problem upstream, which ultimately, given the track history of numpy, won't be solved at all. We will be here 1 year from now with the exact same discussion. Why are we waiting on upstream for anything? As I said above, if something is created which upstream finds useful on a general level. great. The great cost here is time.
For example, we could imagine pushing to make DyND the new core for pandas. This could be enough of a push to make DyND generally useful -- I know it still has a few kinks to work out.
maybe, but DyND has to have full compat with what currently is out there (soonish). Then I agree this could be possible. But wouldn't it be even better for pandas to be able to swap back-ends. Why limit ourselves to a particular backend if its not that difficult.
I think Jeff and I are on the same page here. 5 years ago we were having the *exact same* discussions around NumPy and adding new data type functionality. 5 years is a staggering amount of time in open source. It was less than 5 years between pandas not existing and being a super popular project with 2/3 of a best-selling O'Reilly book written about it. To whit, DyND exists in large part because of the difficulty in making progress within NumPy. Now, as 5 years ago, I think we should be acting in the best interests of pandas users, and what I've been describing is intended as a straightforward (though definitely labor intensive) and relatively low-risk plan that will "future-proof" the pandas user API for at least the next few years, and probably much longer. If we find that enabling some internals to use DyND is the right choice, we can do that in a non-invasive way while carefully minding data interoperability. Meaningful performance benefits would be a clear motivation. To be 100% open and transparent (in the spirit of pandas's new governance docs): Before committing to using DyND in any binding way (i.e. required, as opposed to opt-in) in pandas, I'd really like to see more evidence from 3rd parties without direct financial interest (i.e. employment or equity from Continuum) that DyND is "the future of Python array computing"; in the absence of significant user and community code contribution, it still feels like a political quagmire leftover from the Continuum-Enthought rift in 2011. - Wes
4) Give pandas objects a real C API so that users can manipulate and create pandas objects with their own native (C/C++/Cython) code.
5) Yes, absolutely improve NumPy and DyND and transition to improved NumPy and DyND facilities as soon as they are available and shipped
I like the sound of both of these.
Further you made a point above
You are right that pandas has started to supplant numpy as a high level API for data analysis, but of course the robust (and often numpy based) Python ecosystem is part of what has made pandas so successful. In practice, ecosystem projects often want to work with more primitive objects than series/dataframes in their internal data structures and without numpy this becomes more difficult. For example, how do you concatenate a list of categoricals? If these were numpy arrays, we could use np.concatenate, but the current implementation of categorical would require a custom solution. First class compatibility with pandas is harder when pandas data cannotbe used with a full ndarray API.
I disagree entirely here. I think that Series/DataFrame ARE becoming primitive objects. Look at seaborn, statsmodels, and xarray These are first class users of these structures, whom need the additional meta-data attached.
Yes categorical are useful in numpy, and they should support them. But lots of libraries can simply use pandas and do lots of really useful stuff. However, why reinvent the wheel and use numpy, when you have DataFrames.
From a user point of view, I don't think they even care about numpy (or whatever drives pandas). It solves a very general problem of working with labeled data.
Jeff
Hi all, Stephan Hoyer asked me to comment on DyND and it's relation to the changes in Pandas that we're discussing here, so I'd like to do that. But, before I do, I want to clear up some misconceptions about DyND's history from Wes' most recent email. To be 100% open and transparent (in the spirit of pandas's new
governance docs): Before committing to using DyND in any binding way (i.e. required, as opposed to opt-in) in pandas, I'd really like to see more evidence from 3rd parties without direct financial interest (i.e. employment or equity from Continuum) that DyND is "the future of Python array computing"; in the absence of significant user and community code contribution, it still feels like a political quagmire leftover from the Continuum-Enthought rift in 2011.
Let's be very clear about the history (and present) of DyND -- and I think Travis Oliphant captured it well in his email to the NumPy list some months ago: https://mail.scipy.org/pipermail/numpy-discussion/2015-August/073412.html DyND was started as a personal project of Mark Wiebe in September 2011, and you can see the first commit at https://github.com/libdynd/libdynd/commit/768ac9a30cdb4619d09f4656bfd895ab2b.... At the time, Mark was at the University of British Columbia. He joined Continuum part-time when it was founded in January 2012, and later became full-time in the spring of 2012. DyND, therefore, predates Continuum and never had any relationship with Enthought. As Travis said in his email to the NumPy list (link above), after that "Continuum supported DyND with some fraction of Mark's time". Mark can speak more about this if he wishes, but the point is that DyND's origins are not "a political quagmire leftover from the Continuum-Enthought rift in 2011". Also, Mark left Continuum in December 2014, so everything contributed after that had nothing to do with Continuum. Now let's move to the other main DyND developers, me and Ian Henriksen. Until June 29, 2015, I had no relationship with Continuum, Enthought, or even the people we're speaking about in this thread. I knew Mark and that was it. I started working on DyND in January 2014, meaning I contributed to it just by choice for 1.5 years. And, if you look at my commit contributions at https://github.com/libdynd/libdynd/graphs/contributors, you'll see that represents about 50% of all of my contributions. And I've contributed a lot. Ian was originally a Google Summer of Code student that DyND applied for as an open-source project, through NumFOCUS, in the summer of 2015. He started on May 25, 2015 and went until the end of August. Anything he contributed in this time had nothing to do with Continuum. He formally joined Continuum on September 1, 2015. So, basically, a majority of DyND's commits were given freely by Mark, myself, and Ian. Now, at present, both Ian and I are sponsored by Continuum. And, yes, they are very graciously supporting us to work on DyND, like they did in the past with Mark. While I understand that, in theory, that could potentially be a conflict of interest, let me be very clear about one thing: Continuum has always approached DyND in a very balanced way, letting it grow as it needs while encouraging interaction with Pandas and other open-source projects in the ecosystem. The decisions we make for DyND have been decisions we've taken for the good of the project. And, yes, the eventual goal of DyND is to move from incubation at Continuum to a NumFOCUS-sponsored project. And we'll do that as soon as we can. Irwin
I think I'm mostly on the same page as well. Five years has certainly been too long. I agree that it would be premature to commit to using DyND in a binding way in pandas. A lot seems to be up in the air with regards to dtypes in Python right now (yes, particularly from projects sponsored by Continuum). So I would advocate for proceeding with the refactor for now (which will have numerous other benefits), and see how the situation evolves. If it seems like we are in a plausible position to unify the dtype system with a tool like DyND, then let's seriously consider that down the road. Either way, explicit interfaces (e.g., to_numpy(), to_dynd()) will help. On Mon, Jan 11, 2016 at 4:23 PM, Wes McKinney <wesmckinn@gmail.com> wrote:
I am in favor of the Wes refactoring, but for some slightly different reasons.
I am including some in-line comments.
On Mon, Jan 11, 2016 at 2:55 PM, Stephan Hoyer <shoyer@gmail.com> wrote:
I don't see alternative ways for pandas to have a truly healthy relationship with more general purpose array / scientific computing libraries without being able to add new pandas functionality in a clean way, and without requiring us to get patches accepted (and released) in NumPy or DyND.
Indeed, I think my disagreement is mostly about the order in which we approach these problems.
I agree here. I had started on *some* of this to enable swappable numpy to DyND to support IntNA (all in python, but the fundamental change was to provide an API layer to the back-end).
Can you clarify what aspects of this plan are disagreeable / contentious?
See my comments below.
Are you arguing for pandas becoming more of a companion tool / user interface layer for NumPy or DyND?
Not quite. Pandas has some fantastic and highly useable data (Series, DataFrame, Index). These certainly don't belong in NumPy or DyND.
However, the array-based ecosystem certainly could use improvements to dtypes (e.g., datetime and categorical) and dtype specific methods
(e.g.,
for strings) just as much as pandas. I do firmly believe that pushing
types of improvements upstream, rather than implementing them independently for pandas, would yield benefits for the broader ecosystem. With the right infrastructure, generalizing things to arrays is not much more work.
I dont' think Wes nor I disagree here at all. The problem was (and is),
pace of change in the underlying libraries. It is simply too slow for pandas development efforts.
I think the pandas efforts (and other libraries) can result in more
fundamental libraries that get pushed upstream. However, it would not benefit ANYONE to slow down downstream efforts. I am not sure why you suggest that we WAIT for the upstream libraries to change? We have been waiting forever for that. Now we have a concrete implementation of certain data types that are useful. They (upstream) can take this and build on (or throw it away and make a better one or whatever). But I don't think it benefits anyone to WAIT for someone to change numpy first. Look at how long it took them to (partially) fix datetimes.
xarray in particular has done the same thing to pandas, e.g. you have added additional selection operators and syntax (e.g. passing dicts of named axes). These changes are in fact propogating to pandas. This has taken time (but much much less that this took for any of pandas changes to numpy). Further look at how long you have advocated (correctly) for labeled arrays in numpy (which we are still waiting).
I'd like to see pandas itself focus more on the data-structures and less on the data types. This would let us share more work with the "general purpose array / scientific computing libraries".
Pandas IS about specifying the correct data types. It is simply incorrect to decouple this problem from the data-structures. A lot of effort over the years has gone into making all dtypes playing nice with each other and within pandas.
1) Introduce a proper (from a software engineering perspective) logical data type abstraction that models the way that pandas already works, but cleaning up all the mess (implicit upcasts, lack of a real "NA" scalar value, making pandas-specific methods like unique, factorize, match, etc. true "array methods")
New abstractions have a cost. A new logical data type abstraction is better than no proper abstraction at all, but (in principle), one data type abstraction should be enough to share.
A proper logical data type abstraction would be an improvement over the current situation, but if there's a way we could introduce one less abstraction (by improving things upstream in a general purpose array library) that would help even more.
This is just pushing a problem upstream, which ultimately, given the
history of numpy, won't be solved at all. We will be here 1 year from now with the exact same discussion. Why are we waiting on upstream for anything? As I said above, if something is created which upstream finds useful on a general level. great. The great cost here is time.
For example, we could imagine pushing to make DyND the new core for pandas. This could be enough of a push to make DyND generally useful --
I
know it still has a few kinks to work out.
maybe, but DyND has to have full compat with what currently is out there (soonish). Then I agree this could be possible. But wouldn't it be even better for pandas to be able to swap back-ends. Why limit ourselves to a
On Mon, Jan 11, 2016 at 3:04 PM, Jeff Reback <jeffreback@gmail.com> wrote: these the powerful track particular
backend if its not that difficult.
I think Jeff and I are on the same page here. 5 years ago we were having the *exact same* discussions around NumPy and adding new data type functionality. 5 years is a staggering amount of time in open source. It was less than 5 years between pandas not existing and being a super popular project with 2/3 of a best-selling O'Reilly book written about it. To whit, DyND exists in large part because of the difficulty in making progress within NumPy.
Now, as 5 years ago, I think we should be acting in the best interests of pandas users, and what I've been describing is intended as a straightforward (though definitely labor intensive) and relatively low-risk plan that will "future-proof" the pandas user API for at least the next few years, and probably much longer. If we find that enabling some internals to use DyND is the right choice, we can do that in a non-invasive way while carefully minding data interoperability. Meaningful performance benefits would be a clear motivation.
To be 100% open and transparent (in the spirit of pandas's new governance docs): Before committing to using DyND in any binding way (i.e. required, as opposed to opt-in) in pandas, I'd really like to see more evidence from 3rd parties without direct financial interest (i.e. employment or equity from Continuum) that DyND is "the future of Python array computing"; in the absence of significant user and community code contribution, it still feels like a political quagmire leftover from the Continuum-Enthought rift in 2011.
- Wes
4) Give pandas objects a real C API so that users can manipulate and create pandas objects with their own native (C/C++/Cython) code.
5) Yes, absolutely improve NumPy and DyND and transition to improved NumPy and DyND facilities as soon as they are available and shipped
I like the sound of both of these.
Further you made a point above
You are right that pandas has started to supplant numpy as a high level API for data analysis, but of course the robust (and often numpy based) Python ecosystem is part of what has made pandas so successful. In practice, ecosystem projects often want to work with more primitive objects than series/dataframes in their internal data structures and without numpy this becomes more difficult. For example, how do you concatenate a list of categoricals? If these were numpy arrays, we could use np.concatenate, but the current implementation of categorical would require a custom solution. First class compatibility with pandas is harder when pandas data cannotbe used with a full ndarray API.
I disagree entirely here. I think that Series/DataFrame ARE becoming primitive objects. Look at seaborn, statsmodels, and xarray These are first class users of these structures, whom need the additional meta-data attached.
Yes categorical are useful in numpy, and they should support them. But lots of libraries can simply use pandas and do lots of really useful stuff. However, why reinvent the wheel and use numpy, when you have DataFrames.
From a user point of view, I don't think they even care about numpy (or whatever drives pandas). It solves a very general problem of working with labeled data.
Jeff
On Tue, Jan 12, 2016 at 4:06 PM, Stephan Hoyer <shoyer@gmail.com> wrote:
I think I'm mostly on the same page as well. Five years has certainly been too long.
I agree that it would be premature to commit to using DyND in a binding way in pandas. A lot seems to be up in the air with regards to dtypes in Python right now (yes, particularly from projects sponsored by Continuum).
So I would advocate for proceeding with the refactor for now (which will have numerous other benefits), and see how the situation evolves. If it seems like we are in a plausible position to unify the dtype system with a tool like DyND, then let's seriously consider that down the road. Either way, explicit interfaces (e.g., to_numpy(), to_dynd()) will help.
+1 -- I think our long term goal should be to have a common physical memory representation. If pandas internally stays slightly malleable (in a non-user-visible-way) we can conform to a standard (presuming one develops) with less user-land disruption. If a standard does not develop we can just shrug our shoulders and do what's best for pandas. We'll have to think about how this will affect pandas's future C API (zero-copy interop guarantees): we might make the C API in the first release more clearly not-for-production use. Aside: There doesn't even seem to be consensus at the moment on missing data representation. Sentinels, for example, causes interoperability problems with ODBC / databases, and Apache ecosystem projects (e.g. HDFS file formats, Thrift, Spark, Kafka, etc.). If we build a C interface to Avro or Parquet in pandas right now we'll have to convert bitmasks to pandas's bespoke sentinels. To be clear, R has this problem too. I see good arguments for even nixing NaN in floating point arrays, as heretical as that might sound. Ironically I used to be in favor of sentinels but I realized it was an isolationist view. -W
On Mon, Jan 11, 2016 at 4:23 PM, Wes McKinney <wesmckinn@gmail.com> wrote:
On Mon, Jan 11, 2016 at 3:04 PM, Jeff Reback <jeffreback@gmail.com> wrote:
I am in favor of the Wes refactoring, but for some slightly different reasons.
I am including some in-line comments.
On Mon, Jan 11, 2016 at 2:55 PM, Stephan Hoyer <shoyer@gmail.com> wrote:
I don't see alternative ways for pandas to have a truly healthy relationship with more general purpose array / scientific computing libraries without being able to add new pandas functionality in a clean way, and without requiring us to get patches accepted (and released) in NumPy or DyND.
Indeed, I think my disagreement is mostly about the order in which we approach these problems.
I agree here. I had started on *some* of this to enable swappable numpy to DyND to support IntNA (all in python, but the fundamental change was to provide an API layer to the back-end).
Can you clarify what aspects of this plan are disagreeable / contentious?
See my comments below.
Are you arguing for pandas becoming more of a companion tool / user interface layer for NumPy or DyND?
Not quite. Pandas has some fantastic and highly useable data (Series, DataFrame, Index). These certainly don't belong in NumPy or DyND.
However, the array-based ecosystem certainly could use improvements to dtypes (e.g., datetime and categorical) and dtype specific methods (e.g., for strings) just as much as pandas. I do firmly believe that pushing these types of improvements upstream, rather than implementing them independently for pandas, would yield benefits for the broader ecosystem. With the right infrastructure, generalizing things to arrays is not much more work.
I dont' think Wes nor I disagree here at all. The problem was (and is), the pace of change in the underlying libraries. It is simply too slow for pandas development efforts.
I think the pandas efforts (and other libraries) can result in more powerful fundamental libraries that get pushed upstream. However, it would not benefit ANYONE to slow down downstream efforts. I am not sure why you suggest that we WAIT for the upstream libraries to change? We have been waiting forever for that. Now we have a concrete implementation of certain data types that are useful. They (upstream) can take this and build on (or throw it away and make a better one or whatever). But I don't think it benefits anyone to WAIT for someone to change numpy first. Look at how long it took them to (partially) fix datetimes.
xarray in particular has done the same thing to pandas, e.g. you have added additional selection operators and syntax (e.g. passing dicts of named axes). These changes are in fact propogating to pandas. This has taken time (but much much less that this took for any of pandas changes to numpy). Further look at how long you have advocated (correctly) for labeled arrays in numpy (which we are still waiting).
I'd like to see pandas itself focus more on the data-structures and less on the data types. This would let us share more work with the "general purpose array / scientific computing libraries".
Pandas IS about specifying the correct data types. It is simply incorrect to decouple this problem from the data-structures. A lot of effort over the years has gone into making all dtypes playing nice with each other and within pandas.
1) Introduce a proper (from a software engineering perspective) logical data type abstraction that models the way that pandas already works, but cleaning up all the mess (implicit upcasts, lack of a real "NA" scalar value, making pandas-specific methods like unique, factorize, match, etc. true "array methods")
New abstractions have a cost. A new logical data type abstraction is better than no proper abstraction at all, but (in principle), one data type abstraction should be enough to share.
A proper logical data type abstraction would be an improvement over the current situation, but if there's a way we could introduce one less abstraction (by improving things upstream in a general purpose array library) that would help even more.
This is just pushing a problem upstream, which ultimately, given the track history of numpy, won't be solved at all. We will be here 1 year from now with the exact same discussion. Why are we waiting on upstream for anything? As I said above, if something is created which upstream finds useful on a general level. great. The great cost here is time.
For example, we could imagine pushing to make DyND the new core for pandas. This could be enough of a push to make DyND generally useful -- I know it still has a few kinks to work out.
maybe, but DyND has to have full compat with what currently is out there (soonish). Then I agree this could be possible. But wouldn't it be even better for pandas to be able to swap back-ends. Why limit ourselves to a particular backend if its not that difficult.
I think Jeff and I are on the same page here. 5 years ago we were having the *exact same* discussions around NumPy and adding new data type functionality. 5 years is a staggering amount of time in open source. It was less than 5 years between pandas not existing and being a super popular project with 2/3 of a best-selling O'Reilly book written about it. To whit, DyND exists in large part because of the difficulty in making progress within NumPy.
Now, as 5 years ago, I think we should be acting in the best interests of pandas users, and what I've been describing is intended as a straightforward (though definitely labor intensive) and relatively low-risk plan that will "future-proof" the pandas user API for at least the next few years, and probably much longer. If we find that enabling some internals to use DyND is the right choice, we can do that in a non-invasive way while carefully minding data interoperability. Meaningful performance benefits would be a clear motivation.
To be 100% open and transparent (in the spirit of pandas's new governance docs): Before committing to using DyND in any binding way (i.e. required, as opposed to opt-in) in pandas, I'd really like to see more evidence from 3rd parties without direct financial interest (i.e. employment or equity from Continuum) that DyND is "the future of Python array computing"; in the absence of significant user and community code contribution, it still feels like a political quagmire leftover from the Continuum-Enthought rift in 2011.
- Wes
4) Give pandas objects a real C API so that users can manipulate and create pandas objects with their own native (C/C++/Cython) code.
5) Yes, absolutely improve NumPy and DyND and transition to improved NumPy and DyND facilities as soon as they are available and shipped
I like the sound of both of these.
Further you made a point above
You are right that pandas has started to supplant numpy as a high level API for data analysis, but of course the robust (and often numpy based) Python ecosystem is part of what has made pandas so successful. In practice, ecosystem projects often want to work with more primitive objects than series/dataframes in their internal data structures and without numpy this becomes more difficult. For example, how do you concatenate a list of categoricals? If these were numpy arrays, we could use np.concatenate, but the current implementation of categorical would require a custom solution. First class compatibility with pandas is harder when pandas data cannotbe used with a full ndarray API.
I disagree entirely here. I think that Series/DataFrame ARE becoming primitive objects. Look at seaborn, statsmodels, and xarray These are first class users of these structures, whom need the additional meta-data attached.
Yes categorical are useful in numpy, and they should support them. But lots of libraries can simply use pandas and do lots of really useful stuff. However, why reinvent the wheel and use numpy, when you have DataFrames.
From a user point of view, I don't think they even care about numpy (or whatever drives pandas). It solves a very general problem of working with labeled data.
Jeff
After taking a step back and starting a new job, I am coming around to Wes's perspective here. The lack of integer-NAs and the overly complex/unpredictable internal memory model are major shortcomings (along with the indexing API) for using pandas in production software. Compatibility with the rest of the SciPy ecosystem is important, but it shouldn't hold pandas back. There's no good reason why pandas needs to built on a library for strided n-dimensional arrays -- that's a lot more complexity than we need. Best, Stephan On Tue, Jan 12, 2016 at 5:42 PM, Wes McKinney <wesmckinn@gmail.com> wrote:
On Tue, Jan 12, 2016 at 4:06 PM, Stephan Hoyer <shoyer@gmail.com> wrote:
I think I'm mostly on the same page as well. Five years has certainly been too long.
I agree that it would be premature to commit to using DyND in a binding way in pandas. A lot seems to be up in the air with regards to dtypes in Python right now (yes, particularly from projects sponsored by Continuum).
So I would advocate for proceeding with the refactor for now (which will have numerous other benefits), and see how the situation evolves. If it seems like we are in a plausible position to unify the dtype system with a tool like DyND, then let's seriously consider that down the road. Either way, explicit interfaces (e.g., to_numpy(), to_dynd()) will help.
+1 -- I think our long term goal should be to have a common physical memory representation. If pandas internally stays slightly malleable (in a non-user-visible-way) we can conform to a standard (presuming one develops) with less user-land disruption. If a standard does not develop we can just shrug our shoulders and do what's best for pandas. We'll have to think about how this will affect pandas's future C API (zero-copy interop guarantees): we might make the C API in the first release more clearly not-for-production use.
Aside: There doesn't even seem to be consensus at the moment on missing data representation. Sentinels, for example, causes interoperability problems with ODBC / databases, and Apache ecosystem projects (e.g. HDFS file formats, Thrift, Spark, Kafka, etc.). If we build a C interface to Avro or Parquet in pandas right now we'll have to convert bitmasks to pandas's bespoke sentinels. To be clear, R has this problem too. I see good arguments for even nixing NaN in floating point arrays, as heretical as that might sound. Ironically I used to be in favor of sentinels but I realized it was an isolationist view.
-W
On Mon, Jan 11, 2016 at 4:23 PM, Wes McKinney <wesmckinn@gmail.com> wrote:
On Mon, Jan 11, 2016 at 3:04 PM, Jeff Reback <jeffreback@gmail.com>
I am in favor of the Wes refactoring, but for some slightly different reasons.
I am including some in-line comments.
On Mon, Jan 11, 2016 at 2:55 PM, Stephan Hoyer <shoyer@gmail.com> wrote:
I don't see alternative ways for pandas to have a truly healthy relationship with more general purpose array / scientific computing libraries without being able to add new pandas functionality in a clean way, and without requiring us to get patches accepted (and released) in NumPy or DyND.
Indeed, I think my disagreement is mostly about the order in which we approach these problems.
I agree here. I had started on *some* of this to enable swappable numpy to DyND to support IntNA (all in python, but the fundamental change was to provide an API layer to the back-end).
Can you clarify what aspects of this plan are disagreeable / contentious?
See my comments below.
Are you arguing for pandas becoming more of a companion tool / user interface layer for NumPy or DyND?
Not quite. Pandas has some fantastic and highly useable data (Series, DataFrame, Index). These certainly don't belong in NumPy or DyND.
However, the array-based ecosystem certainly could use improvements
to
dtypes (e.g., datetime and categorical) and dtype specific methods (e.g., for strings) just as much as pandas. I do firmly believe that pushing these types of improvements upstream, rather than implementing them independently for pandas, would yield benefits for the broader ecosystem. With the right infrastructure, generalizing things to arrays is not much more work.
I dont' think Wes nor I disagree here at all. The problem was (and is), the pace of change in the underlying libraries. It is simply too slow for pandas development efforts.
I think the pandas efforts (and other libraries) can result in more powerful fundamental libraries that get pushed upstream. However, it would not benefit ANYONE to slow down downstream efforts. I am not sure why you suggest that we WAIT for the upstream libraries to change? We have been waiting forever for that. Now we have a concrete implementation of certain data types that are useful. They (upstream) can take this and build on (or throw it away and make a better one or whatever). But I don't think it benefits anyone to WAIT for someone to change numpy first. Look at how long it took them to (partially) fix datetimes.
xarray in particular has done the same thing to pandas, e.g. you have added additional selection operators and syntax (e.g. passing dicts of named axes). These changes are in fact propogating to pandas. This has taken time (but much much less that this took for any of pandas changes to numpy). Further look at how long you have advocated (correctly) for labeled arrays in numpy (which we are still waiting).
I'd like to see pandas itself focus more on the data-structures and less on the data types. This would let us share more work with the
"general
purpose array / scientific computing libraries".
Pandas IS about specifying the correct data types. It is simply incorrect to decouple this problem from the data-structures. A lot of effort over
years has gone into making all dtypes playing nice with each other and within pandas.
1) Introduce a proper (from a software engineering perspective) logical data type abstraction that models the way that pandas
already
works, but cleaning up all the mess (implicit upcasts, lack of a real "NA" scalar value, making pandas-specific methods like unique, factorize, match, etc. true "array methods")
New abstractions have a cost. A new logical data type abstraction is better than no proper abstraction at all, but (in principle), one data type abstraction should be enough to share.
A proper logical data type abstraction would be an improvement over
current situation, but if there's a way we could introduce one less abstraction (by improving things upstream in a general purpose array library) that would help even more.
This is just pushing a problem upstream, which ultimately, given the track history of numpy, won't be solved at all. We will be here 1 year from now with the exact same discussion. Why are we waiting on upstream for anything? As I said above, if something is created which upstream finds useful on a general level. great. The great cost here is time.
For example, we could imagine pushing to make DyND the new core for pandas. This could be enough of a push to make DyND generally useful
--
I know it still has a few kinks to work out.
maybe, but DyND has to have full compat with what currently is out
(soonish). Then I agree this could be possible. But wouldn't it be even better for pandas to be able to swap back-ends. Why limit ourselves to a particular backend if its not that difficult.
I think Jeff and I are on the same page here. 5 years ago we were having the *exact same* discussions around NumPy and adding new data type functionality. 5 years is a staggering amount of time in open source. It was less than 5 years between pandas not existing and being a super popular project with 2/3 of a best-selling O'Reilly book written about it. To whit, DyND exists in large part because of the difficulty in making progress within NumPy.
Now, as 5 years ago, I think we should be acting in the best interests of pandas users, and what I've been describing is intended as a straightforward (though definitely labor intensive) and relatively low-risk plan that will "future-proof" the pandas user API for at least the next few years, and probably much longer. If we find that enabling some internals to use DyND is the right choice, we can do that in a non-invasive way while carefully minding data interoperability. Meaningful performance benefits would be a clear motivation.
To be 100% open and transparent (in the spirit of pandas's new governance docs): Before committing to using DyND in any binding way (i.e. required, as opposed to opt-in) in pandas, I'd really like to see more evidence from 3rd parties without direct financial interest (i.e. employment or equity from Continuum) that DyND is "the future of Python array computing"; in the absence of significant user and community code contribution, it still feels like a political quagmire leftover from the Continuum-Enthought rift in 2011.
- Wes
4) Give pandas objects a real C API so that users can manipulate and create pandas objects with their own native (C/C++/Cython) code.
5) Yes, absolutely improve NumPy and DyND and transition to improved NumPy and DyND facilities as soon as they are available and shipped
I like the sound of both of these.
Further you made a point above
You are right that pandas has started to supplant numpy as a high level API for data analysis, but of course the robust (and often numpy
Python ecosystem is part of what has made pandas so successful. In practice, ecosystem projects often want to work with more primitive objects
wrote: the the there based) than
series/dataframes in their internal data structures and without numpy this becomes more difficult. For example, how do you concatenate a list of categoricals? If these were numpy arrays, we could use np.concatenate, but the current implementation of categorical would require a custom solution. First class compatibility with pandas is harder when pandas data cannotbe used with a full ndarray API.
I disagree entirely here. I think that Series/DataFrame ARE becoming primitive objects. Look at seaborn, statsmodels, and xarray These are first class users of these structures, whom need the additional meta-data attached.
Yes categorical are useful in numpy, and they should support them. But lots of libraries can simply use pandas and do lots of really useful stuff. However, why reinvent the wheel and use numpy, when you have DataFrames.
From a user point of view, I don't think they even care about numpy (or whatever drives pandas). It solves a very general problem of working with labeled data.
Jeff
https://github.com/apache/arrow/tree/master/python/pyarrow looking pretty good. assume there is a notion of an extension dtype? (to support dtype/schema that other systems may not) in order to implement things like categorical / datetime tz etc then libpandas becomes a pretty thin wrapper around this
On Mar 16, 2016, at 12:44 PM, Stephan Hoyer <shoyer@gmail.com> wrote:
After taking a step back and starting a new job, I am coming around to Wes's perspective here.
The lack of integer-NAs and the overly complex/unpredictable internal memory model are major shortcomings (along with the indexing API) for using pandas in production software.
Compatibility with the rest of the SciPy ecosystem is important, but it shouldn't hold pandas back. There's no good reason why pandas needs to built on a library for strided n-dimensional arrays -- that's a lot more complexity than we need.
Best, Stephan
On Tue, Jan 12, 2016 at 5:42 PM, Wes McKinney <wesmckinn@gmail.com> wrote: On Tue, Jan 12, 2016 at 4:06 PM, Stephan Hoyer <shoyer@gmail.com> wrote:
I think I'm mostly on the same page as well. Five years has certainly been too long.
I agree that it would be premature to commit to using DyND in a binding way in pandas. A lot seems to be up in the air with regards to dtypes in Python right now (yes, particularly from projects sponsored by Continuum).
So I would advocate for proceeding with the refactor for now (which will have numerous other benefits), and see how the situation evolves. If it seems like we are in a plausible position to unify the dtype system with a tool like DyND, then let's seriously consider that down the road. Either way, explicit interfaces (e.g., to_numpy(), to_dynd()) will help.
+1 -- I think our long term goal should be to have a common physical memory representation. If pandas internally stays slightly malleable (in a non-user-visible-way) we can conform to a standard (presuming one develops) with less user-land disruption. If a standard does not develop we can just shrug our shoulders and do what's best for pandas. We'll have to think about how this will affect pandas's future C API (zero-copy interop guarantees): we might make the C API in the first release more clearly not-for-production use.
Aside: There doesn't even seem to be consensus at the moment on missing data representation. Sentinels, for example, causes interoperability problems with ODBC / databases, and Apache ecosystem projects (e.g. HDFS file formats, Thrift, Spark, Kafka, etc.). If we build a C interface to Avro or Parquet in pandas right now we'll have to convert bitmasks to pandas's bespoke sentinels. To be clear, R has this problem too. I see good arguments for even nixing NaN in floating point arrays, as heretical as that might sound. Ironically I used to be in favor of sentinels but I realized it was an isolationist view.
-W
On Mon, Jan 11, 2016 at 4:23 PM, Wes McKinney <wesmckinn@gmail.com> wrote:
On Mon, Jan 11, 2016 at 3:04 PM, Jeff Reback <jeffreback@gmail.com> wrote:
I am in favor of the Wes refactoring, but for some slightly different reasons.
I am including some in-line comments.
On Mon, Jan 11, 2016 at 2:55 PM, Stephan Hoyer <shoyer@gmail.com> wrote:
> > I don't see alternative ways for pandas to have a truly healthy > relationship with more general purpose array / scientific computing > libraries without being able to add new pandas functionality in a > clean way, and without requiring us to get patches accepted (and > released) in NumPy or DyND.
Indeed, I think my disagreement is mostly about the order in which we approach these problems.
I agree here. I had started on *some* of this to enable swappable numpy to DyND to support IntNA (all in python, but the fundamental change was to provide an API layer to the back-end).
> > Can you clarify what aspects of this plan are disagreeable / > contentious?
See my comments below.
> > Are you arguing for pandas becoming more of a companion > tool / user interface layer for NumPy or DyND?
Not quite. Pandas has some fantastic and highly useable data (Series, DataFrame, Index). These certainly don't belong in NumPy or DyND.
However, the array-based ecosystem certainly could use improvements to dtypes (e.g., datetime and categorical) and dtype specific methods (e.g., for strings) just as much as pandas. I do firmly believe that pushing these types of improvements upstream, rather than implementing them independently for pandas, would yield benefits for the broader ecosystem. With the right infrastructure, generalizing things to arrays is not much more work.
I dont' think Wes nor I disagree here at all. The problem was (and is), the pace of change in the underlying libraries. It is simply too slow for pandas development efforts.
I think the pandas efforts (and other libraries) can result in more powerful fundamental libraries that get pushed upstream. However, it would not benefit ANYONE to slow down downstream efforts. I am not sure why you suggest that we WAIT for the upstream libraries to change? We have been waiting forever for that. Now we have a concrete implementation of certain data types that are useful. They (upstream) can take this and build on (or throw it away and make a better one or whatever). But I don't think it benefits anyone to WAIT for someone to change numpy first. Look at how long it took them to (partially) fix datetimes.
xarray in particular has done the same thing to pandas, e.g. you have added additional selection operators and syntax (e.g. passing dicts of named axes). These changes are in fact propogating to pandas. This has taken time (but much much less that this took for any of pandas changes to numpy). Further look at how long you have advocated (correctly) for labeled arrays in numpy (which we are still waiting).
I'd like to see pandas itself focus more on the data-structures and less on the data types. This would let us share more work with the "general purpose array / scientific computing libraries".
Pandas IS about specifying the correct data types. It is simply incorrect to decouple this problem from the data-structures. A lot of effort over the years has gone into making all dtypes playing nice with each other and within pandas.
> > 1) Introduce a proper (from a software engineering perspective) > logical data type abstraction that models the way that pandas already > works, but cleaning up all the mess (implicit upcasts, lack of a real > "NA" scalar value, making pandas-specific methods like unique, > factorize, match, etc. true "array methods")
New abstractions have a cost. A new logical data type abstraction is better than no proper abstraction at all, but (in principle), one data type abstraction should be enough to share.
A proper logical data type abstraction would be an improvement over the current situation, but if there's a way we could introduce one less abstraction (by improving things upstream in a general purpose array library) that would help even more.
This is just pushing a problem upstream, which ultimately, given the track history of numpy, won't be solved at all. We will be here 1 year from now with the exact same discussion. Why are we waiting on upstream for anything? As I said above, if something is created which upstream finds useful on a general level. great. The great cost here is time.
For example, we could imagine pushing to make DyND the new core for pandas. This could be enough of a push to make DyND generally useful -- I know it still has a few kinks to work out.
maybe, but DyND has to have full compat with what currently is out there (soonish). Then I agree this could be possible. But wouldn't it be even better for pandas to be able to swap back-ends. Why limit ourselves to a particular backend if its not that difficult.
I think Jeff and I are on the same page here. 5 years ago we were having the *exact same* discussions around NumPy and adding new data type functionality. 5 years is a staggering amount of time in open source. It was less than 5 years between pandas not existing and being a super popular project with 2/3 of a best-selling O'Reilly book written about it. To whit, DyND exists in large part because of the difficulty in making progress within NumPy.
Now, as 5 years ago, I think we should be acting in the best interests of pandas users, and what I've been describing is intended as a straightforward (though definitely labor intensive) and relatively low-risk plan that will "future-proof" the pandas user API for at least the next few years, and probably much longer. If we find that enabling some internals to use DyND is the right choice, we can do that in a non-invasive way while carefully minding data interoperability. Meaningful performance benefits would be a clear motivation.
To be 100% open and transparent (in the spirit of pandas's new governance docs): Before committing to using DyND in any binding way (i.e. required, as opposed to opt-in) in pandas, I'd really like to see more evidence from 3rd parties without direct financial interest (i.e. employment or equity from Continuum) that DyND is "the future of Python array computing"; in the absence of significant user and community code contribution, it still feels like a political quagmire leftover from the Continuum-Enthought rift in 2011.
- Wes
> > 4) Give pandas objects a real C API so that users can manipulate and > create pandas objects with their own native (C/C++/Cython) code.
> 5) Yes, absolutely improve NumPy and DyND and transition to improved > NumPy and DyND facilities as soon as they are available and shipped
I like the sound of both of these.
Further you made a point above
You are right that pandas has started to supplant numpy as a high level API for data analysis, but of course the robust (and often numpy based) Python ecosystem is part of what has made pandas so successful. In practice, ecosystem projects often want to work with more primitive objects than series/dataframes in their internal data structures and without numpy this becomes more difficult. For example, how do you concatenate a list of categoricals? If these were numpy arrays, we could use np.concatenate, but the current implementation of categorical would require a custom solution. First class compatibility with pandas is harder when pandas data cannotbe used with a full ndarray API.
I disagree entirely here. I think that Series/DataFrame ARE becoming primitive objects. Look at seaborn, statsmodels, and xarray These are first class users of these structures, whom need the additional meta-data attached.
Yes categorical are useful in numpy, and they should support them. But lots of libraries can simply use pandas and do lots of really useful stuff. However, why reinvent the wheel and use numpy, when you have DataFrames.
From a user point of view, I don't think they even care about numpy (or whatever drives pandas). It solves a very general problem of working with labeled data.
Jeff
_______________________________________________ Pandas-dev mailing list Pandas-dev@python.org https://mail.python.org/mailman/listinfo/pandas-dev
If it is of interest, I'll just mention that we recently split off DyND's type system to be its own independent library -- libdyndt (for the types) and libdynd (for the callables and array). Distributing them separately still needs work, but the binaries are there. DyND / Arrow compatibility would be interesting, and I'm always keen to avoid duplication of effort.
At the moment, I don't have plans for PyArrow that extend beyond being the point of contact with systems that use Arrow natively. For example, pandas users will soon be able to read and write Parquet format files via pyarrow (which will handle low-level conversion to/from pandas's NumPy memory representation) I'd like to continue the pandas refactoring / reorganization effort (+ organizing deprecations) to be able to encapsulate pandas's interactions with NumPy so that alternate backends can be conceivable at all (possibly in 2017-2018). I don't have a lot of bandwidth for this until 2nd half of April at earliest, though. Happy to respond to inquiries at the meantime. - Wes On Wed, Mar 23, 2016 at 2:22 PM, Irwin Zaid <izaid@continuum.io> wrote:
If it is of interest, I'll just mention that we recently split off DyND's type system to be its own independent library -- libdyndt (for the types) and libdynd (for the callables and array). Distributing them separately still needs work, but the binaries are there. DyND / Arrow compatibility would be interesting, and I'm always keen to avoid duplication of effort.
_______________________________________________ Pandas-dev mailing list Pandas-dev@python.org https://mail.python.org/mailman/listinfo/pandas-dev
participants (5)
-
Irwin Zaid -
Jeff Reback -
Phillip Cloud -
Stephan Hoyer -
Wes McKinney