Re: [Pandas-dev] A case for a simplified (non-consolidating) BlockManager with 1D blocks
We actually *have* prototypes: the prototype of the split-policy discussed
AFAICT that is a 5 year old branch. Is there a version of this based off of master that you can show asv results for?
Also, if performance is in the end the decisive criterion, I repeat my earlier remark in this thread: we need to be clearer about what we want / expect.
In principle, this is pretty much exactly what the asvs are supposed to represent. ---- You have demonstrated that you are willing to repeat yourself more than I am, to the point that I find pandas interactions more frustrating than fulfilling. I'm going to step away for a little while. On Thu, Jun 11, 2020 at 1:29 PM Daniel Scott <scottdaniel539@yahoo.com> wrote:
Sent from Yahoo Mail on Android <https://go.onelink.me/107872968?pid=InProduct&c=Global_Internal_YGrowth_AndroidEmailSig__AndroidUsers&af_wl=ym&af_sub1=Internal&af_sub2=Global_YGrowth&af_sub3=EmailSignature>
On Thu, Jun 11, 2020 at 3:10 PM, Joris Van den Bossche <jorisvandenbossche@gmail.com> wrote: _______________________________________________ Pandas-dev mailing list Pandas-dev@python.org https://mail.python.org/mailman/listinfo/pandas-dev
On Thu, 11 Jun 2020 at 23:35, Brock Mendel <jbrockmendel@gmail.com> wrote:
We actually *have* prototypes: the prototype of the split-policy discussed
AFAICT that is a 5 year old branch. Is there a version of this based off of master that you can show asv results for?
A correction here: that branch has been updated several times over the last 5 years, and a last time two weeks ago when I started this thread, as I explained in the github issue comment I linked to: https://github.com/pandas-dev/pandas/issues/10556#issuecomment-633703160
Also, if performance is in the end the decisive criterion, I repeat my earlier remark in this thread: we need to be clearer about what we want / expect.
In principle, this is pretty much exactly what the asvs are supposed to represent.
Well, I am repeating myself .. but I already mentioned that I am not sure ASV is fully useful for this, as that requires a complete working replacement, which is IMO too much to ask for an initial prototype. But OK, the message is clear: we need a more concrete implementation / prototype. So let's put this discussion aside for a moment, and focus on that instead. I will try to look at that in the coming weeks, but any help is welcome (and I will try to get it running with ASV, or at least a part of it).
---- You have demonstrated that you are willing to repeat yourself more than I am, to the point that I find pandas interactions more frustrating than fulfilling. I'm going to step away for a little while.
I am sincerely sorry that you find this discussion frustrating. Healthy disagreement and discussion are an essential part of (open source)
collaborative projects, but we also need to avoid getting tired of it. So maybe we should evaluate at some point the way this discussion went (including my own interactions) or how to improve our discussions in general.
Joris
On Fri, 12 Jun 2020 at 22:34, Joris Van den Bossche < jorisvandenbossche@gmail.com> wrote:
On Thu, 11 Jun 2020 at 23:35, Brock Mendel <jbrockmendel@gmail.com> wrote:
We actually *have* prototypes: the prototype of the split-policy discussed
AFAICT that is a 5 year old branch. Is there a version of this based off of master that you can show asv results for?
A correction here: that branch has been updated several times over the last 5 years, and a last time two weeks ago when I started this thread, as I explained in the github issue comment I linked to: https://github.com/pandas-dev/pandas/issues/10556#issuecomment-633703160
Also, if performance is in the end the decisive criterion, I repeat my earlier remark in this thread: we need to be clearer about what we want / expect.
In principle, this is pretty much exactly what the asvs are supposed to represent.
Well, I am repeating myself .. but I already mentioned that I am not sure ASV is fully useful for this, as that requires a complete working replacement, which is IMO too much to ask for an initial prototype.
But OK, the message is clear: we need a more concrete implementation / prototype. So let's put this discussion aside for a moment, and focus on that instead. I will try to look at that in the coming weeks, but any help is welcome (and I will try to get it running with ASV, or at least a part of it).
To come back to this: I cleaned up a proof-of-concept implementation that I started after the above discussed, and put it in a PR to view/discuss: https://github.com/pandas-dev/pandas/pull/36010
On Mon, 31 Aug 2020 at 16:20, Joris Van den Bossche < jorisvandenbossche@gmail.com> wrote:
On Fri, 12 Jun 2020 at 22:34, Joris Van den Bossche < jorisvandenbossche@gmail.com> wrote:
On Thu, 11 Jun 2020 at 23:35, Brock Mendel <jbrockmendel@gmail.com> wrote:
We actually *have* prototypes: the prototype of the split-policy discussed
AFAICT that is a 5 year old branch. Is there a version of this based off of master that you can show asv results for?
A correction here: that branch has been updated several times over the last 5 years, and a last time two weeks ago when I started this thread, as I explained in the github issue comment I linked to: https://github.com/pandas-dev/pandas/issues/10556#issuecomment-633703160
Also, if performance is in the end the decisive criterion, I repeat my earlier remark in this thread: we need to be clearer about what we want / expect.
In principle, this is pretty much exactly what the asvs are supposed to represent.
Well, I am repeating myself .. but I already mentioned that I am not sure ASV is fully useful for this, as that requires a complete working replacement, which is IMO too much to ask for an initial prototype.
But OK, the message is clear: we need a more concrete implementation / prototype. So let's put this discussion aside for a moment, and focus on that instead. I will try to look at that in the coming weeks, but any help is welcome (and I will try to get it running with ASV, or at least a part of it).
To come back to this: I cleaned up a proof-of-concept implementation that I started after the above discussed, and put it in a PR to view/discuss: https://github.com/pandas-dev/pandas/pull/36010
Another follow-up: the proof-of-concept now is merged in the master branch, and I am currently working on making it more feature complete (see https://github.com/pandas-dev/pandas/issues/39146 for an overview issue) Joris
And to give another update on this topic: the development branch of pandas now contains an experimental version of this "columnar store" (using an ArrayManager class instead of the BlockManager under the hood, which stores the columns as a list of 1D arrays), which is almost feature-complete (the biggest missing links are JSON and PyTables IO). At the moment, there is an option to enable it for experimenting with it (not yet documented, as it might still see behaviour changes): # set the default manager to ArrayManager pd.options.mode.data_manager = "array" # when creating a DataFrame, you will now get one with an ArrayManager instead of BlockManager df = pd.DataFrame(...) df = pd.read_csv(...) There are still some remaining work items (more IO, ironing out some known bugs/todo's, checking performance), see https://github.com/pandas-dev/pandas/issues/39146 to keep track of this. Best, Joris On Tue, 9 Feb 2021 at 19:17, Joris Van den Bossche < jorisvandenbossche@gmail.com> wrote:
On Mon, 31 Aug 2020 at 16:20, Joris Van den Bossche < jorisvandenbossche@gmail.com> wrote:
On Fri, 12 Jun 2020 at 22:34, Joris Van den Bossche < jorisvandenbossche@gmail.com> wrote:
On Thu, 11 Jun 2020 at 23:35, Brock Mendel <jbrockmendel@gmail.com> wrote:
We actually *have* prototypes: the prototype of the split-policy discussed
AFAICT that is a 5 year old branch. Is there a version of this based off of master that you can show asv results for?
A correction here: that branch has been updated several times over the last 5 years, and a last time two weeks ago when I started this thread, as I explained in the github issue comment I linked to: https://github.com/pandas-dev/pandas/issues/10556#issuecomment-633703160
Also, if performance is in the end the decisive criterion, I repeat my earlier remark in this thread: we need to be clearer about what we want / expect.
In principle, this is pretty much exactly what the asvs are supposed to represent.
Well, I am repeating myself .. but I already mentioned that I am not sure ASV is fully useful for this, as that requires a complete working replacement, which is IMO too much to ask for an initial prototype.
But OK, the message is clear: we need a more concrete implementation / prototype. So let's put this discussion aside for a moment, and focus on that instead. I will try to look at that in the coming weeks, but any help is welcome (and I will try to get it running with ASV, or at least a part of it).
To come back to this: I cleaned up a proof-of-concept implementation that I started after the above discussed, and put it in a PR to view/discuss: https://github.com/pandas-dev/pandas/pull/36010
Another follow-up: the proof-of-concept now is merged in the master branch, and I am currently working on making it more feature complete (see https://github.com/pandas-dev/pandas/issues/39146 for an overview issue)
Joris
Another update on this topic: over the last weeks I have been updating the status of this project (and fixing some regressions), and rerunning the benchmarks. You can find an overview of the results of our ASV benchmarks at https://github.com/pandas-dev/pandas/issues/39146#issuecomment-988002256. Some general points about those benchmark results: - The cases that show big slowdows are mostly related with cases where we do `df.values` or equivalent, i.e. converting the DataFrame to a single 2D array (`.values`, `to_numpy`, `transpose`, ..). Another subset of cases involve row-wise operations (reductions with axis=1, selecting a single row as a Series). I think those are the expected cases where a 1D-column store will always be slower. - Many of our ASV benchmarks use wide dataframes (eg an often-used shape is (1000, 1000), so a square dataframe). While it's of course important to cover this, I also think this is not the most common shape of dataframes, and in any case is giving a bit a biased view. - Our ASV benchmarks are mostly micro-benchmarks, or at least benchmarks that at most take up to 1 to 100 ms in general (by using small enough data to limit the runtime to this). While this is important to keep this benchmark suite usable, it also has the consequence that many of those benchmarks are partly or largely measuring "overhead" which doesn't necessarily increase while increasing the data size (more rows). The ArrayManager will typically increase this overhead, but as long as this overhead is in the "milliseconds" range, it does not necessarily have much influence on larger data workflows (depending on the exact workflow of course). Overall, I find the results quite reassuring: it identifies the cases where a slowdown is to be expected (and we will need to judge whether we find this acceptable), highlight some areas that can use improvement, and also shows that many of the benchmarks are not (or not much) impacted. But I think it also shows that we will need to seek more real-world feedback, either by constructing some macro benchmarks, or by getting user feedback from their real-world workflows. For the first option (macro benchmarks), I quickly cleaned up and pushed an experiment I did over a year ago, which is to run one query of one of the industry-standard benchmark suites (TPC) using pandas ( https://nbviewer.org/github/jorisvandenbossche/pandas-benchmarks/blob/main/t...). This shows basically no difference between BlockManager vs ArrayManager. This if of course also only one single workflow (with narrow long dataframes, doing mostly groupby and merge, and the overall time is dominated by eg the factorize algos, which isn't affected by the dataframe layout), but this is something we could maybe expand with other benchmark cases. --- We now have a prototype implementation people can experiment with + we have an overview of ASV benchmark results. Given this, I think it is a good point to discuss again how we want to move forward with this, and whether we want to communicate the _intent_ to make this the default in some next pandas version (emphasizing "intent", since it will always depend on the feedback we get). Joris On Wed, 7 Apr 2021 at 16:28, Joris Van den Bossche < jorisvandenbossche@gmail.com> wrote:
And to give another update on this topic: the development branch of pandas now contains an experimental version of this "columnar store" (using an ArrayManager class instead of the BlockManager under the hood, which stores the columns as a list of 1D arrays), which is almost feature-complete (the biggest missing links are JSON and PyTables IO).
At the moment, there is an option to enable it for experimenting with it (not yet documented, as it might still see behaviour changes):
# set the default manager to ArrayManager pd.options.mode.data_manager = "array"
# when creating a DataFrame, you will now get one with an ArrayManager instead of BlockManager df = pd.DataFrame(...) df = pd.read_csv(...)
There are still some remaining work items (more IO, ironing out some known bugs/todo's, checking performance), see https://github.com/pandas-dev/pandas/issues/39146 to keep track of this.
Best, Joris
On Tue, 9 Feb 2021 at 19:17, Joris Van den Bossche < jorisvandenbossche@gmail.com> wrote:
On Mon, 31 Aug 2020 at 16:20, Joris Van den Bossche < jorisvandenbossche@gmail.com> wrote:
On Fri, 12 Jun 2020 at 22:34, Joris Van den Bossche < jorisvandenbossche@gmail.com> wrote:
On Thu, 11 Jun 2020 at 23:35, Brock Mendel <jbrockmendel@gmail.com> wrote:
We actually *have* prototypes: the prototype of the split-policy discussed
AFAICT that is a 5 year old branch. Is there a version of this based off of master that you can show asv results for?
A correction here: that branch has been updated several times over the last 5 years, and a last time two weeks ago when I started this thread, as I explained in the github issue comment I linked to: https://github.com/pandas-dev/pandas/issues/10556#issuecomment-633703160
Also, if performance is in the end the decisive criterion, I repeat my earlier remark in this thread: we need to be clearer about what we want / expect.
In principle, this is pretty much exactly what the asvs are supposed to represent.
Well, I am repeating myself .. but I already mentioned that I am not sure ASV is fully useful for this, as that requires a complete working replacement, which is IMO too much to ask for an initial prototype.
But OK, the message is clear: we need a more concrete implementation / prototype. So let's put this discussion aside for a moment, and focus on that instead. I will try to look at that in the coming weeks, but any help is welcome (and I will try to get it running with ASV, or at least a part of it).
To come back to this: I cleaned up a proof-of-concept implementation that I started after the above discussed, and put it in a PR to view/discuss: https://github.com/pandas-dev/pandas/pull/36010
Another follow-up: the proof-of-concept now is merged in the master branch, and I am currently working on making it more feature complete (see https://github.com/pandas-dev/pandas/issues/39146 for an overview issue)
Joris
We have planned a video meeting about this topic next week Wednesday, December 22, at 19:00 UTC. The meeting has been added to the pandas development calendar visible at https://pandas.pydata.org/docs/development/meeting.html, and the zoom meeting link is https://us06web.zoom.us/j/81798190900?pwd=ZEo4SnlGMGZxZkVNRkpOLzg0dld3dz09 Joris On Tue, 7 Dec 2021 at 19:01, Joris Van den Bossche < jorisvandenbossche@gmail.com> wrote:
Another update on this topic: over the last weeks I have been updating the status of this project (and fixing some regressions), and rerunning the benchmarks.
You can find an overview of the results of our ASV benchmarks at https://github.com/pandas-dev/pandas/issues/39146#issuecomment-988002256. Some general points about those benchmark results:
- The cases that show big slowdows are mostly related with cases where we do `df.values` or equivalent, i.e. converting the DataFrame to a single 2D array (`.values`, `to_numpy`, `transpose`, ..). Another subset of cases involve row-wise operations (reductions with axis=1, selecting a single row as a Series). I think those are the expected cases where a 1D-column store will always be slower. - Many of our ASV benchmarks use wide dataframes (eg an often-used shape is (1000, 1000), so a square dataframe). While it's of course important to cover this, I also think this is not the most common shape of dataframes, and in any case is giving a bit a biased view. - Our ASV benchmarks are mostly micro-benchmarks, or at least benchmarks that at most take up to 1 to 100 ms in general (by using small enough data to limit the runtime to this). While this is important to keep this benchmark suite usable, it also has the consequence that many of those benchmarks are partly or largely measuring "overhead" which doesn't necessarily increase while increasing the data size (more rows). The ArrayManager will typically increase this overhead, but as long as this overhead is in the "milliseconds" range, it does not necessarily have much influence on larger data workflows (depending on the exact workflow of course).
Overall, I find the results quite reassuring: it identifies the cases where a slowdown is to be expected (and we will need to judge whether we find this acceptable), highlight some areas that can use improvement, and also shows that many of the benchmarks are not (or not much) impacted. But I think it also shows that we will need to seek more real-world feedback, either by constructing some macro benchmarks, or by getting user feedback from their real-world workflows.
For the first option (macro benchmarks), I quickly cleaned up and pushed an experiment I did over a year ago, which is to run one query of one of the industry-standard benchmark suites (TPC) using pandas ( https://nbviewer.org/github/jorisvandenbossche/pandas-benchmarks/blob/main/t...). This shows basically no difference between BlockManager vs ArrayManager. This if of course also only one single workflow (with narrow long dataframes, doing mostly groupby and merge, and the overall time is dominated by eg the factorize algos, which isn't affected by the dataframe layout), but this is something we could maybe expand with other benchmark cases.
---
We now have a prototype implementation people can experiment with + we have an overview of ASV benchmark results. Given this, I think it is a good point to discuss again how we want to move forward with this, and whether we want to communicate the _intent_ to make this the default in some next pandas version (emphasizing "intent", since it will always depend on the feedback we get).
Joris
On Wed, 7 Apr 2021 at 16:28, Joris Van den Bossche < jorisvandenbossche@gmail.com> wrote:
And to give another update on this topic: the development branch of pandas now contains an experimental version of this "columnar store" (using an ArrayManager class instead of the BlockManager under the hood, which stores the columns as a list of 1D arrays), which is almost feature-complete (the biggest missing links are JSON and PyTables IO).
At the moment, there is an option to enable it for experimenting with it (not yet documented, as it might still see behaviour changes):
# set the default manager to ArrayManager pd.options.mode.data_manager = "array"
# when creating a DataFrame, you will now get one with an ArrayManager instead of BlockManager df = pd.DataFrame(...) df = pd.read_csv(...)
There are still some remaining work items (more IO, ironing out some known bugs/todo's, checking performance), see https://github.com/pandas-dev/pandas/issues/39146 to keep track of this.
Best, Joris
On Tue, 9 Feb 2021 at 19:17, Joris Van den Bossche < jorisvandenbossche@gmail.com> wrote:
On Mon, 31 Aug 2020 at 16:20, Joris Van den Bossche < jorisvandenbossche@gmail.com> wrote:
On Fri, 12 Jun 2020 at 22:34, Joris Van den Bossche < jorisvandenbossche@gmail.com> wrote:
On Thu, 11 Jun 2020 at 23:35, Brock Mendel <jbrockmendel@gmail.com> wrote:
> We actually *have* prototypes: the prototype of the split-policy discussed
AFAICT that is a 5 year old branch. Is there a version of this based off of master that you can show asv results for?
A correction here: that branch has been updated several times over the last 5 years, and a last time two weeks ago when I started this thread, as I explained in the github issue comment I linked to: https://github.com/pandas-dev/pandas/issues/10556#issuecomment-633703160
> Also, if performance is in the end the decisive criterion, I repeat my earlier remark in this thread: we need to be clearer about what we want / expect.
In principle, this is pretty much exactly what the asvs are supposed to represent.
Well, I am repeating myself .. but I already mentioned that I am not sure ASV is fully useful for this, as that requires a complete working replacement, which is IMO too much to ask for an initial prototype.
But OK, the message is clear: we need a more concrete implementation / prototype. So let's put this discussion aside for a moment, and focus on that instead. I will try to look at that in the coming weeks, but any help is welcome (and I will try to get it running with ASV, or at least a part of it).
To come back to this: I cleaned up a proof-of-concept implementation that I started after the above discussed, and put it in a PR to view/discuss: https://github.com/pandas-dev/pandas/pull/36010
Another follow-up: the proof-of-concept now is merged in the master branch, and I am currently working on making it more feature complete (see https://github.com/pandas-dev/pandas/issues/39146 for an overview issue)
Joris
participants (2)
-
Brock Mendel -
Joris Van den Bossche