Another update on this topic: over the last weeks I have been updating the status of this project (and fixing some regressions), and rerunning the benchmarks.

You can find an overview of the results of our ASV benchmarks at https://github.com/pandas-dev/pandas/issues/39146#issuecomment-988002256. Some general points about those benchmark results:

- The cases that show big slowdows are mostly related with cases where we do `df.values` or equivalent, i.e. converting the DataFrame to a single 2D array (`.values`, `to_numpy`, `transpose`, ..). Another subset of cases involve row-wise operations (reductions with axis=1, selecting a single row as a Series). I think those are the expected cases where a 1D-column store will always be slower.
- Many of our ASV benchmarks use wide dataframes (eg an often-used shape is (1000, 1000), so a square dataframe). While it's of course important to cover this, I also think this is not the most common shape of dataframes, and in any case is giving a bit a biased view.
- Our ASV benchmarks are mostly micro-benchmarks, or at least benchmarks that at most take up to 1 to 100 ms in general (by using small enough data to limit the runtime to this). While this is important to keep this benchmark suite usable, it also has the consequence that many of those benchmarks are partly or largely measuring "overhead" which doesn't necessarily increase while increasing the data size (more rows). The ArrayManager will typically increase this overhead, but as long as this overhead is in the "milliseconds" range, it does not necessarily have much influence on larger data workflows (depending on the exact workflow of course).

Overall, I find the results quite reassuring: it identifies the cases where a slowdown is to be expected (and we will need to judge whether we find this acceptable), highlight some areas that can use improvement, and also shows that many of the benchmarks are not (or not much) impacted.
But I think it also shows that we will need to seek more real-world feedback, either by constructing some macro benchmarks, or by getting user feedback from their real-world workflows.

For the first option (macro benchmarks), I quickly cleaned up and pushed an experiment I did over a year ago, which is to run one query of one of the industry-standard benchmark suites (TPC) using pandas (https://nbviewer.org/github/jorisvandenbossche/pandas-benchmarks/blob/main/tpc-ds/query-1.ipynb#Time-the-full-query). This shows basically no difference between BlockManager vs ArrayManager. This if of course also only one single workflow (with narrow long dataframes, doing mostly groupby and merge, and the overall time is dominated by eg the factorize algos, which isn't affected by the dataframe layout), but this is something we could maybe expand with other benchmark cases.

---

We now have a prototype implementation people can experiment with + we have an overview of ASV benchmark results. Given this, I think it is a good point to discuss again how we want to move forward with this, and whether we want to communicate the _intent_ to make this the default in some next pandas version (emphasizing "intent", since it will always depend on the feedback we get).

Joris


On Wed, 7 Apr 2021 at 16:28, Joris Van den Bossche <jorisvandenbossche@gmail.com> wrote:
And to give another update on this topic: the development branch of pandas now contains an experimental version of this "columnar store" (using an ArrayManager class instead of the BlockManager under the hood, which stores the columns as a list of 1D arrays), which is almost feature-complete (the biggest missing links are JSON and PyTables IO).

At the moment, there is an option to enable it for experimenting with it (not yet documented, as it might still see behaviour changes):

# set the default manager to ArrayManager
pd.options.mode.data_manager = "array"

# when creating a DataFrame, you will now get one with an ArrayManager instead of BlockManager
df = pd.DataFrame(...)
df = pd.read_csv(...)

There are still some remaining work items (more IO, ironing out some known bugs/todo's, checking performance), see https://github.com/pandas-dev/pandas/issues/39146 to keep track of this.

Best,
Joris

On Tue, 9 Feb 2021 at 19:17, Joris Van den Bossche <jorisvandenbossche@gmail.com> wrote:

On Mon, 31 Aug 2020 at 16:20, Joris Van den Bossche <jorisvandenbossche@gmail.com> wrote:


On Fri, 12 Jun 2020 at 22:34, Joris Van den Bossche <jorisvandenbossche@gmail.com> wrote:
On Thu, 11 Jun 2020 at 23:35, Brock Mendel <jbrockmendel@gmail.com> wrote:
> We actually have prototypes: the prototype of the split-policy discussed

AFAICT that is a 5 year old branch.  Is there a version of this based off of master that you can show asv results for?

A correction here: that branch has been updated several times over the last 5 years, and a last time two weeks ago when I started this thread, as I explained in the github issue comment I linked to: https://github.com/pandas-dev/pandas/issues/10556#issuecomment-633703160
 
> Also, if performance is in the end the decisive criterion, I repeat my earlier remark in this thread: we need to be clearer about what we want / expect.

In principle, this is pretty much exactly what the asvs are supposed to represent.

Well, I am repeating myself .. but I already mentioned that I am not sure ASV is fully useful for this, as that requires a complete working replacement, which is IMO too much to ask for an initial prototype.

But OK, the message is clear: we need a more concrete implementation / prototype. So let's put this discussion aside for a moment, and focus on that instead. I will try to look at that in the coming weeks, but any help is welcome (and I will try to get it running with ASV, or at least a part of it).
 
To come back to this: I cleaned up a proof-of-concept implementation that I started after the above discussed, and put it in a PR to view/discuss: https://github.com/pandas-dev/pandas/pull/36010
 

Another follow-up: the proof-of-concept now is merged in the master branch, and I am currently working on making it more feature complete (see https://github.com/pandas-dev/pandas/issues/39146 for an overview issue)

Joris