A case for a simplified (non-consolidating) BlockManager with 1D blocks
Hi list, Rewriting the BlockManager based on a simpler collection of 1D-arrays is actually on our roadmap (see here <https://pandas.pydata.org/docs/dev/development/roadmap.html#block-manager-re...>), and I also touched on it in a mailing list discussion about pandas 2.0 earlier this year (see here <https://mail.python.org/pipermail/pandas-dev/2020-February/001180.html>). But since the topic came up again recently at the last online dev meeting (and also Uwe Korn who wrote a nice blog post <https://uwekorn.com/2020/05/24/the-one-pandas-internal.html> about this yesterday), I thought to do a write-up of my thoughts on why I think we should actually move towards a simpler, non-consolidating BlockManager with 1D blocks. *Simplication of the internals* It's regularly brought up as a reason to have 2D EextensionArrays (EAs) because right now we have a lot of special cases for 1D EAs in the internals. But to be clear: the additional complexity does not come from 1D EAs in itself, it comes from the fact that we have a mixture of 2D and 1D blocks. Solving this would require a consistent block dimension, and thus removing this added complexity can be done in two ways: have all 1D blocks, or have all 2D blocks. Just to say: IMO, this is not an argument in favor of 2D blocks / consolidation. Moreover, when going with all 1D blocks, we cannot only remove the added complexity from dealing with the mixture of 1D/2D blocks, we will *also* be able to reduce the complexity of dealing with 2D blocks. A BlockManager with 2D blocks is inherently more complex than with 1D blocks, as one needs to deal with proper alignment of the blocks, a more complex "placement" logic of the blocks, etc. I think we would be able to simplify the internals a lot by going with a BlockManager as a store of 1D arrays. *Performance* Performance is typically given as a reason to have consolidated, 2D blocks. And of course, certain operations (especially row-wise operations, or on dataframes with more columns as rows) will always be faster when done on a 2D numpy array under the hood. However, based on recent experimentation with this (eg triggered by the block-wise frame ops PR <https://github.com/pandas-dev/pandas/pull/32779>, and see also some benchmarks I justed posted in #10556 <https://github.com/pandas-dev/pandas/issues/10556#issuecomment-633703160> / this gist <https://gist.github.com/jorisvandenbossche/b8ae071ab7823f7547567b1ab9d4c20c>), I also think that for many operations and with decent-sized dataframes, this performance penalty is actually quite OK. Further, there are also operations that will *benefit* from 1D blocks. First, operations that now involve aligning/splitting blocks, re-consolidation, .. will benefit (e.g. a large part of the slowdown doing frame/frame operations column-wise is currently due to the consolidation in the end). And operations like adding a column, concatting (with axis=1) or merging dataframes will be much faster when no consolidation is needed. Personally, I am convinced that with some effort, we can get on-par or sometimes even better performance with 1D blocks compared to the performance we have now for those cases that 90+% of our users care about: - With limited effort optimizing the column-wise code paths in the internals, we can get a long way. - After that, if needed, we can still consider if parts of the internals could be cythonized to further improve certain bottlenecks (and actually cythonizing this will also be simpler for a simpler non-consolidating block manager). *Possibility to get better copy/view semantics* Pandas is badly known for how much it copies ("you need 10x the memory available as the size of your dataframe"), and having 1D blocks will allow us to address part of those concerns. *No consolidation = less copying.* Regularly consolidating introduces copies, and thus removing consolidation will mean less copies. For example, this would enable that you can actually add a single column to a dataframe without having to copy to the full dataframe. *Copy / view semantics* Recently there has been discussion again around whether selecting columns should be a copy or a view, and some other issues were opened with questions about views/copies when slicing columns. In the consolidated 2D block layout this will always be inherently messy, and unpredictable (meaning: depending on the actual block layout, which means in practice unpredictable for the user unaware of the block layout). Going with a non-consolidated BlockManager should at least allow us to get better / more understandable semantics around this. ------------------------------ *So what are the reasons to have 2D blocks?* I personally don't directly see reasons to have 2D blocks *for pandas itself* (apart from performance in certain row-wise use cases, and except for the fact that we have "always done it like this"). But quite likely I am missing reasons, so please bring them up. But I think there are certainly use cases where 2D blocks can be useful, but typically "external" (but nonetheless important) use cases: conversion to/from numpy, xarray, etc. A typical example that has recently come up is scikit-learn, where they want to have a cheap dataframe <-> numpy array roundtrip for use in their pipelines. However, I personally think there are possible ways that we can still accommodate for those use cases, with some effort, while still having 1D Blocks in pandas itself. So IMO this is not sufficient to warrant the complexity of 2D blocks in pandas. (but will stop here, as this mail is getting already long ..). Joris
Thanks for writing this up, Joris. Assuming we go down this path, do you have an idea of how we get from here to there incrementally? i.e. presumably this wont just be one massive PR On Mon, May 25, 2020 at 2:39 PM Joris Van den Bossche < jorisvandenbossche@gmail.com> wrote:
Hi list,
Rewriting the BlockManager based on a simpler collection of 1D-arrays is actually on our roadmap (see here <https://pandas.pydata.org/docs/dev/development/roadmap.html#block-manager-re...>), and I also touched on it in a mailing list discussion about pandas 2.0 earlier this year (see here <https://mail.python.org/pipermail/pandas-dev/2020-February/001180.html>).
But since the topic came up again recently at the last online dev meeting (and also Uwe Korn who wrote a nice blog post <https://uwekorn.com/2020/05/24/the-one-pandas-internal.html> about this yesterday), I thought to do a write-up of my thoughts on why I think we should actually move towards a simpler, non-consolidating BlockManager with 1D blocks.
*Simplication of the internals*
It's regularly brought up as a reason to have 2D EextensionArrays (EAs) because right now we have a lot of special cases for 1D EAs in the internals. But to be clear: the additional complexity does not come from 1D EAs in itself, it comes from the fact that we have a mixture of 2D and 1D blocks. Solving this would require a consistent block dimension, and thus removing this added complexity can be done in two ways: have all 1D blocks, or have all 2D blocks. Just to say: IMO, this is not an argument in favor of 2D blocks / consolidation.
Moreover, when going with all 1D blocks, we cannot only remove the added complexity from dealing with the mixture of 1D/2D blocks, we will *also* be able to reduce the complexity of dealing with 2D blocks. A BlockManager with 2D blocks is inherently more complex than with 1D blocks, as one needs to deal with proper alignment of the blocks, a more complex "placement" logic of the blocks, etc.
I think we would be able to simplify the internals a lot by going with a BlockManager as a store of 1D arrays.
*Performance*
Performance is typically given as a reason to have consolidated, 2D blocks. And of course, certain operations (especially row-wise operations, or on dataframes with more columns as rows) will always be faster when done on a 2D numpy array under the hood. However, based on recent experimentation with this (eg triggered by the block-wise frame ops PR <https://github.com/pandas-dev/pandas/pull/32779>, and see also some benchmarks I justed posted in #10556 <https://github.com/pandas-dev/pandas/issues/10556#issuecomment-633703160> / this gist <https://gist.github.com/jorisvandenbossche/b8ae071ab7823f7547567b1ab9d4c20c>), I also think that for many operations and with decent-sized dataframes, this performance penalty is actually quite OK.
Further, there are also operations that will *benefit* from 1D blocks. First, operations that now involve aligning/splitting blocks, re-consolidation, .. will benefit (e.g. a large part of the slowdown doing frame/frame operations column-wise is currently due to the consolidation in the end). And operations like adding a column, concatting (with axis=1) or merging dataframes will be much faster when no consolidation is needed.
Personally, I am convinced that with some effort, we can get on-par or sometimes even better performance with 1D blocks compared to the performance we have now for those cases that 90+% of our users care about:
- With limited effort optimizing the column-wise code paths in the internals, we can get a long way. - After that, if needed, we can still consider if parts of the internals could be cythonized to further improve certain bottlenecks (and actually cythonizing this will also be simpler for a simpler non-consolidating block manager).
*Possibility to get better copy/view semantics*
Pandas is badly known for how much it copies ("you need 10x the memory available as the size of your dataframe"), and having 1D blocks will allow us to address part of those concerns.
*No consolidation = less copying.* Regularly consolidating introduces copies, and thus removing consolidation will mean less copies. For example, this would enable that you can actually add a single column to a dataframe without having to copy to the full dataframe.
*Copy / view semantics* Recently there has been discussion again around whether selecting columns should be a copy or a view, and some other issues were opened with questions about views/copies when slicing columns. In the consolidated 2D block layout this will always be inherently messy, and unpredictable (meaning: depending on the actual block layout, which means in practice unpredictable for the user unaware of the block layout). Going with a non-consolidated BlockManager should at least allow us to get better / more understandable semantics around this.
------------------------------
*So what are the reasons to have 2D blocks?*
I personally don't directly see reasons to have 2D blocks *for pandas itself* (apart from performance in certain row-wise use cases, and except for the fact that we have "always done it like this"). But quite likely I am missing reasons, so please bring them up.
But I think there are certainly use cases where 2D blocks can be useful, but typically "external" (but nonetheless important) use cases: conversion to/from numpy, xarray, etc. A typical example that has recently come up is scikit-learn, where they want to have a cheap dataframe <-> numpy array roundtrip for use in their pipelines. However, I personally think there are possible ways that we can still accommodate for those use cases, with some effort, while still having 1D Blocks in pandas itself. So IMO this is not sufficient to warrant the complexity of 2D blocks in pandas. (but will stop here, as this mail is getting already long ..).
Joris _______________________________________________ Pandas-dev mailing list Pandas-dev@python.org https://mail.python.org/mailman/listinfo/pandas-dev
On Tue, 26 May 2020 at 00:46, Brock Mendel <jbrockmendel@gmail.com> wrote:
Thanks for writing this up, Joris. Assuming we go down this path, do you have an idea of how we get from here to there incrementally? i.e. presumably this wont just be one massive PR
Yes, this is certainly not a one-PR change. I think there are multiple options for working towards this, that are worth discussing. But personally, I would first like to focus on the "assuming we go down this path" part. Let's discuss the pros and cons and trade-offs, and try to turn assumptions in an agreed-upon roadmap. (and of course, it's not because something is on our roadmap that it can't be questioned and discussed again in the future, as we are also doing now). --- Some thoughts on possible options: - We briefly discussed before the idea of using (nullable) extension dtypes for all dtypes by default in pandas 2.0. If we strive towards that, and assuming we keep the current 1D-restriction on ExtensionBlock, then we would "automatically" get a BlockManager with 1D blocks. And we could then focus on optimizing some code paths (eg constructing a new block) specifically for the case of 1D ExtensionBlocks. - A "consolidation policy" option similarly as in the branch discussed in https://github.com/pandas-dev/pandas/issues/10556. Right now, that branch still uses 2D blocks (but separate 2D blocks of shape (1, n) per column) and not actually 1D blocks. So we could add 1D versions of our numeric blocks as well. But that would probably add a lot of complexity, although temporary, to the Blocks, so maybe not an ideal path forward. - Add a version of the ExtensionBlock but that can work with numpy arrays instead of extension arrays, or actually use the "PandasArrays" to store it them in the existing ExtensionBlock (so to already start using the existing 1D blocks without requiring all dtypes to be extension dtypes). Those are all about BlockManager with 1D blocks. Once we only have 1D Blocks, I suppose there are many things we could simplify in the current BlockManager. The intermediate step of the current BlockManager with 1D blocks might not be an optimal situation, but seems the easiest as intermediate goal in practice. It probably also depends on how much "backwards compatibility" or "transition period" we want to provide.
On Mon, May 25, 2020 at 2:39 PM Joris Van den Bossche < jorisvandenbossche@gmail.com> wrote:
Hi list,
Rewriting the BlockManager based on a simpler collection of 1D-arrays is actually on our roadmap (see here <https://pandas.pydata.org/docs/dev/development/roadmap.html#block-manager-re...>), and I also touched on it in a mailing list discussion about pandas 2.0 earlier this year (see here <https://mail.python.org/pipermail/pandas-dev/2020-February/001180.html> ).
But since the topic came up again recently at the last online dev meeting (and also Uwe Korn who wrote a nice blog post <https://uwekorn.com/2020/05/24/the-one-pandas-internal.html> about this yesterday), I thought to do a write-up of my thoughts on why I think we should actually move towards a simpler, non-consolidating BlockManager with 1D blocks.
*Simplication of the internals*
It's regularly brought up as a reason to have 2D EextensionArrays (EAs) because right now we have a lot of special cases for 1D EAs in the internals. But to be clear: the additional complexity does not come from 1D EAs in itself, it comes from the fact that we have a mixture of 2D and 1D blocks. Solving this would require a consistent block dimension, and thus removing this added complexity can be done in two ways: have all 1D blocks, or have all 2D blocks. Just to say: IMO, this is not an argument in favor of 2D blocks / consolidation.
Moreover, when going with all 1D blocks, we cannot only remove the added complexity from dealing with the mixture of 1D/2D blocks, we will *also* be able to reduce the complexity of dealing with 2D blocks. A BlockManager with 2D blocks is inherently more complex than with 1D blocks, as one needs to deal with proper alignment of the blocks, a more complex "placement" logic of the blocks, etc.
I think we would be able to simplify the internals a lot by going with a BlockManager as a store of 1D arrays.
*Performance*
Performance is typically given as a reason to have consolidated, 2D blocks. And of course, certain operations (especially row-wise operations, or on dataframes with more columns as rows) will always be faster when done on a 2D numpy array under the hood. However, based on recent experimentation with this (eg triggered by the block-wise frame ops PR <https://github.com/pandas-dev/pandas/pull/32779>, and see also some benchmarks I justed posted in #10556 <https://github.com/pandas-dev/pandas/issues/10556#issuecomment-633703160> / this gist <https://gist.github.com/jorisvandenbossche/b8ae071ab7823f7547567b1ab9d4c20c>), I also think that for many operations and with decent-sized dataframes, this performance penalty is actually quite OK.
Further, there are also operations that will *benefit* from 1D blocks. First, operations that now involve aligning/splitting blocks, re-consolidation, .. will benefit (e.g. a large part of the slowdown doing frame/frame operations column-wise is currently due to the consolidation in the end). And operations like adding a column, concatting (with axis=1) or merging dataframes will be much faster when no consolidation is needed.
Personally, I am convinced that with some effort, we can get on-par or sometimes even better performance with 1D blocks compared to the performance we have now for those cases that 90+% of our users care about:
- With limited effort optimizing the column-wise code paths in the internals, we can get a long way. - After that, if needed, we can still consider if parts of the internals could be cythonized to further improve certain bottlenecks (and actually cythonizing this will also be simpler for a simpler non-consolidating block manager).
*Possibility to get better copy/view semantics*
Pandas is badly known for how much it copies ("you need 10x the memory available as the size of your dataframe"), and having 1D blocks will allow us to address part of those concerns.
*No consolidation = less copying.* Regularly consolidating introduces copies, and thus removing consolidation will mean less copies. For example, this would enable that you can actually add a single column to a dataframe without having to copy to the full dataframe.
*Copy / view semantics* Recently there has been discussion again around whether selecting columns should be a copy or a view, and some other issues were opened with questions about views/copies when slicing columns. In the consolidated 2D block layout this will always be inherently messy, and unpredictable (meaning: depending on the actual block layout, which means in practice unpredictable for the user unaware of the block layout). Going with a non-consolidated BlockManager should at least allow us to get better / more understandable semantics around this.
------------------------------
*So what are the reasons to have 2D blocks?*
I personally don't directly see reasons to have 2D blocks *for pandas itself* (apart from performance in certain row-wise use cases, and except for the fact that we have "always done it like this"). But quite likely I am missing reasons, so please bring them up.
But I think there are certainly use cases where 2D blocks can be useful, but typically "external" (but nonetheless important) use cases: conversion to/from numpy, xarray, etc. A typical example that has recently come up is scikit-learn, where they want to have a cheap dataframe <-> numpy array roundtrip for use in their pipelines. However, I personally think there are possible ways that we can still accommodate for those use cases, with some effort, while still having 1D Blocks in pandas itself. So IMO this is not sufficient to warrant the complexity of 2D blocks in pandas. (but will stop here, as this mail is getting already long ..).
Joris _______________________________________________ Pandas-dev mailing list Pandas-dev@python.org https://mail.python.org/mailman/listinfo/pandas-dev
Hello all, thanks Joris for starting this thread. For myself, I struggle a bit to understand the cases that are made for the BlockManager benefits. The examples are mostly operations that act on two full DataFrames like "df1 + df2" or come from the fact that one wants to keep a single-type 2D matrix together with column labels but not acutally make use of pandas functionality afterwards. In the code I write on a day-to-day basis, we don't have these use cases thus I'm struggling to understand the real-world benefit of having these operations supported as efficiently as possible in pandas. Even when using scikit-learn pipelines, we have for as long as possible heterogeneously typed DataFrames and only convert to a single-type matrix as late as possible. Thus can anyone enlighten me in which real-world use cases this needs to supported in pandas? Best Uwe Am Di., 26. Mai 2020 um 10:55 Uhr schrieb Joris Van den Bossche < jorisvandenbossche@gmail.com>:
On Tue, 26 May 2020 at 00:46, Brock Mendel <jbrockmendel@gmail.com> wrote:
Thanks for writing this up, Joris. Assuming we go down this path, do you have an idea of how we get from here to there incrementally? i.e. presumably this wont just be one massive PR
Yes, this is certainly not a one-PR change. I think there are multiple options for working towards this, that are worth discussing.
But personally, I would first like to focus on the "assuming we go down this path" part. Let's discuss the pros and cons and trade-offs, and try to turn assumptions in an agreed-upon roadmap. (and of course, it's not because something is on our roadmap that it can't be questioned and discussed again in the future, as we are also doing now).
---
Some thoughts on possible options:
- We briefly discussed before the idea of using (nullable) extension dtypes for all dtypes by default in pandas 2.0. If we strive towards that, and assuming we keep the current 1D-restriction on ExtensionBlock, then we would "automatically" get a BlockManager with 1D blocks. And we could then focus on optimizing some code paths (eg constructing a new block) specifically for the case of 1D ExtensionBlocks. - A "consolidation policy" option similarly as in the branch discussed in https://github.com/pandas-dev/pandas/issues/10556. Right now, that branch still uses 2D blocks (but separate 2D blocks of shape (1, n) per column) and not actually 1D blocks. So we could add 1D versions of our numeric blocks as well. But that would probably add a lot of complexity, although temporary, to the Blocks, so maybe not an ideal path forward. - Add a version of the ExtensionBlock but that can work with numpy arrays instead of extension arrays, or actually use the "PandasArrays" to store it them in the existing ExtensionBlock (so to already start using the existing 1D blocks without requiring all dtypes to be extension dtypes).
Those are all about BlockManager with 1D blocks. Once we only have 1D Blocks, I suppose there are many things we could simplify in the current BlockManager. The intermediate step of the current BlockManager with 1D blocks might not be an optimal situation, but seems the easiest as intermediate goal in practice.
It probably also depends on how much "backwards compatibility" or "transition period" we want to provide.
On Mon, May 25, 2020 at 2:39 PM Joris Van den Bossche < jorisvandenbossche@gmail.com> wrote:
Hi list,
Rewriting the BlockManager based on a simpler collection of 1D-arrays is actually on our roadmap (see here <https://pandas.pydata.org/docs/dev/development/roadmap.html#block-manager-re...>), and I also touched on it in a mailing list discussion about pandas 2.0 earlier this year (see here <https://mail.python.org/pipermail/pandas-dev/2020-February/001180.html> ).
But since the topic came up again recently at the last online dev meeting (and also Uwe Korn who wrote a nice blog post <https://uwekorn.com/2020/05/24/the-one-pandas-internal.html> about this yesterday), I thought to do a write-up of my thoughts on why I think we should actually move towards a simpler, non-consolidating BlockManager with 1D blocks.
*Simplication of the internals*
It's regularly brought up as a reason to have 2D EextensionArrays (EAs) because right now we have a lot of special cases for 1D EAs in the internals. But to be clear: the additional complexity does not come from 1D EAs in itself, it comes from the fact that we have a mixture of 2D and 1D blocks. Solving this would require a consistent block dimension, and thus removing this added complexity can be done in two ways: have all 1D blocks, or have all 2D blocks. Just to say: IMO, this is not an argument in favor of 2D blocks / consolidation.
Moreover, when going with all 1D blocks, we cannot only remove the added complexity from dealing with the mixture of 1D/2D blocks, we will *also* be able to reduce the complexity of dealing with 2D blocks. A BlockManager with 2D blocks is inherently more complex than with 1D blocks, as one needs to deal with proper alignment of the blocks, a more complex "placement" logic of the blocks, etc.
I think we would be able to simplify the internals a lot by going with a BlockManager as a store of 1D arrays.
*Performance*
Performance is typically given as a reason to have consolidated, 2D blocks. And of course, certain operations (especially row-wise operations, or on dataframes with more columns as rows) will always be faster when done on a 2D numpy array under the hood. However, based on recent experimentation with this (eg triggered by the block-wise frame ops PR <https://github.com/pandas-dev/pandas/pull/32779>, and see also some benchmarks I justed posted in #10556 <https://github.com/pandas-dev/pandas/issues/10556#issuecomment-633703160> / this gist <https://gist.github.com/jorisvandenbossche/b8ae071ab7823f7547567b1ab9d4c20c>), I also think that for many operations and with decent-sized dataframes, this performance penalty is actually quite OK.
Further, there are also operations that will *benefit* from 1D blocks. First, operations that now involve aligning/splitting blocks, re-consolidation, .. will benefit (e.g. a large part of the slowdown doing frame/frame operations column-wise is currently due to the consolidation in the end). And operations like adding a column, concatting (with axis=1) or merging dataframes will be much faster when no consolidation is needed.
Personally, I am convinced that with some effort, we can get on-par or sometimes even better performance with 1D blocks compared to the performance we have now for those cases that 90+% of our users care about:
- With limited effort optimizing the column-wise code paths in the internals, we can get a long way. - After that, if needed, we can still consider if parts of the internals could be cythonized to further improve certain bottlenecks (and actually cythonizing this will also be simpler for a simpler non-consolidating block manager).
*Possibility to get better copy/view semantics*
Pandas is badly known for how much it copies ("you need 10x the memory available as the size of your dataframe"), and having 1D blocks will allow us to address part of those concerns.
*No consolidation = less copying.* Regularly consolidating introduces copies, and thus removing consolidation will mean less copies. For example, this would enable that you can actually add a single column to a dataframe without having to copy to the full dataframe.
*Copy / view semantics* Recently there has been discussion again around whether selecting columns should be a copy or a view, and some other issues were opened with questions about views/copies when slicing columns. In the consolidated 2D block layout this will always be inherently messy, and unpredictable (meaning: depending on the actual block layout, which means in practice unpredictable for the user unaware of the block layout). Going with a non-consolidated BlockManager should at least allow us to get better / more understandable semantics around this.
------------------------------
*So what are the reasons to have 2D blocks?*
I personally don't directly see reasons to have 2D blocks *for pandas itself* (apart from performance in certain row-wise use cases, and except for the fact that we have "always done it like this"). But quite likely I am missing reasons, so please bring them up.
But I think there are certainly use cases where 2D blocks can be useful, but typically "external" (but nonetheless important) use cases: conversion to/from numpy, xarray, etc. A typical example that has recently come up is scikit-learn, where they want to have a cheap dataframe <-> numpy array roundtrip for use in their pipelines. However, I personally think there are possible ways that we can still accommodate for those use cases, with some effort, while still having 1D Blocks in pandas itself. So IMO this is not sufficient to warrant the complexity of 2D blocks in pandas. (but will stop here, as this mail is getting already long ..).
Joris _______________________________________________ Pandas-dev mailing list Pandas-dev@python.org https://mail.python.org/mailman/listinfo/pandas-dev
_______________________________________________ Pandas-dev mailing list Pandas-dev@python.org https://mail.python.org/mailman/listinfo/pandas-dev
Hi Joris, Thanks for the summary. I think another missing point is the roundtrip conversion to/from sparse matrices. There are some benchmarks and discussion here; https://github.com/scikit-learn/scikit-learn/pull/16772#issuecomment-6154230... and here's some discussion on the pandas issue tracker: https://github.com/pandas-dev/pandas/issues/33182 and some benchmark by Tom, assuming pandas would accept a 2D sparse array: https://github.com/scikit-learn/scikit-learn/pull/16772#issuecomment-6154408... What do you think of these usecases? Thanks, Adrin On Mon, May 25, 2020 at 11:39 PM Joris Van den Bossche < jorisvandenbossche@gmail.com> wrote:
Hi list,
Rewriting the BlockManager based on a simpler collection of 1D-arrays is actually on our roadmap (see here <https://pandas.pydata.org/docs/dev/development/roadmap.html#block-manager-re...>), and I also touched on it in a mailing list discussion about pandas 2.0 earlier this year (see here <https://mail.python.org/pipermail/pandas-dev/2020-February/001180.html>).
But since the topic came up again recently at the last online dev meeting (and also Uwe Korn who wrote a nice blog post <https://uwekorn.com/2020/05/24/the-one-pandas-internal.html> about this yesterday), I thought to do a write-up of my thoughts on why I think we should actually move towards a simpler, non-consolidating BlockManager with 1D blocks.
*Simplication of the internals*
It's regularly brought up as a reason to have 2D EextensionArrays (EAs) because right now we have a lot of special cases for 1D EAs in the internals. But to be clear: the additional complexity does not come from 1D EAs in itself, it comes from the fact that we have a mixture of 2D and 1D blocks. Solving this would require a consistent block dimension, and thus removing this added complexity can be done in two ways: have all 1D blocks, or have all 2D blocks. Just to say: IMO, this is not an argument in favor of 2D blocks / consolidation.
Moreover, when going with all 1D blocks, we cannot only remove the added complexity from dealing with the mixture of 1D/2D blocks, we will *also* be able to reduce the complexity of dealing with 2D blocks. A BlockManager with 2D blocks is inherently more complex than with 1D blocks, as one needs to deal with proper alignment of the blocks, a more complex "placement" logic of the blocks, etc.
I think we would be able to simplify the internals a lot by going with a BlockManager as a store of 1D arrays.
*Performance*
Performance is typically given as a reason to have consolidated, 2D blocks. And of course, certain operations (especially row-wise operations, or on dataframes with more columns as rows) will always be faster when done on a 2D numpy array under the hood. However, based on recent experimentation with this (eg triggered by the block-wise frame ops PR <https://github.com/pandas-dev/pandas/pull/32779>, and see also some benchmarks I justed posted in #10556 <https://github.com/pandas-dev/pandas/issues/10556#issuecomment-633703160> / this gist <https://gist.github.com/jorisvandenbossche/b8ae071ab7823f7547567b1ab9d4c20c>), I also think that for many operations and with decent-sized dataframes, this performance penalty is actually quite OK.
Further, there are also operations that will *benefit* from 1D blocks. First, operations that now involve aligning/splitting blocks, re-consolidation, .. will benefit (e.g. a large part of the slowdown doing frame/frame operations column-wise is currently due to the consolidation in the end). And operations like adding a column, concatting (with axis=1) or merging dataframes will be much faster when no consolidation is needed.
Personally, I am convinced that with some effort, we can get on-par or sometimes even better performance with 1D blocks compared to the performance we have now for those cases that 90+% of our users care about:
- With limited effort optimizing the column-wise code paths in the internals, we can get a long way. - After that, if needed, we can still consider if parts of the internals could be cythonized to further improve certain bottlenecks (and actually cythonizing this will also be simpler for a simpler non-consolidating block manager).
*Possibility to get better copy/view semantics*
Pandas is badly known for how much it copies ("you need 10x the memory available as the size of your dataframe"), and having 1D blocks will allow us to address part of those concerns.
*No consolidation = less copying.* Regularly consolidating introduces copies, and thus removing consolidation will mean less copies. For example, this would enable that you can actually add a single column to a dataframe without having to copy to the full dataframe.
*Copy / view semantics* Recently there has been discussion again around whether selecting columns should be a copy or a view, and some other issues were opened with questions about views/copies when slicing columns. In the consolidated 2D block layout this will always be inherently messy, and unpredictable (meaning: depending on the actual block layout, which means in practice unpredictable for the user unaware of the block layout). Going with a non-consolidated BlockManager should at least allow us to get better / more understandable semantics around this.
------------------------------
*So what are the reasons to have 2D blocks?*
I personally don't directly see reasons to have 2D blocks *for pandas itself* (apart from performance in certain row-wise use cases, and except for the fact that we have "always done it like this"). But quite likely I am missing reasons, so please bring them up.
But I think there are certainly use cases where 2D blocks can be useful, but typically "external" (but nonetheless important) use cases: conversion to/from numpy, xarray, etc. A typical example that has recently come up is scikit-learn, where they want to have a cheap dataframe <-> numpy array roundtrip for use in their pipelines. However, I personally think there are possible ways that we can still accommodate for those use cases, with some effort, while still having 1D Blocks in pandas itself. So IMO this is not sufficient to warrant the complexity of 2D blocks in pandas. (but will stop here, as this mail is getting already long ..).
Joris _______________________________________________ Pandas-dev mailing list Pandas-dev@python.org https://mail.python.org/mailman/listinfo/pandas-dev
Thanks for those links! Personally, I see the "roundtrip conversion to/from sparse matrices" a bit as in the same bucket as conversion to/from a 2D numpy array. Yes, both are important use cases. But the question we need to ask ourselves is still: is this important enough to hugely complicate the pandas' internals and block several other improvements? It's a trade-off that we need to make. Moreover, I think that we could accommodate the important part of those use cases also with a column-store DataFrame, with some effort (but with less complexity as a consolidated BlockManager). Focusing on scikit-learn: in the end, you mostly care about cheap roundtripping of 2D numpy array or sparse matrix to/from a pandas DataFrame to carry feature labels in between steps of a pipeline, correct? Such cheap roundtripping is only possible anyway if you have a single dtype for all columns (which is typically the case after some transformation step). So you don't necessarily need consolidated blocks specifically, but rather the ability to store a *single* 2D array/matrix in a DataFrame (so kind of a single 2D block). Thinking out loud here, didn't try anything in code: - We could make the DataFrame construction from a 2D array/matrix kind of "lazy" (or have an option to do it like this): upon construction just store the 2D array as is, and only once you perform an actual operation on it, convert to a columnar store. And that would make it possible to still get the 2D array back with zero-copy, if all you did was passing this DataFrame to the next step of the pipeline. - We could take the above a step further and try to preserve the 2D array under the hood in some "easy" operations (but again, limited to a single 2D block/array, not multiple consolidated blocks). This is actually similar to the DataMatrix that pandas had a very long time ago. Of course this adds back complexity, so this would need some more exploration to see if how this would be possible (without duplicating a lot), and some buy-in from people interested in this. I think the first option should be fairly easy to do, and should solve a large part of the concerns for scikit-learn (I think?). I think the second idea is also interesting: IMO such a data structure would be useful to have somewhere in the PyData ecosystem, and a worthwhile discussion to think about where this could fit. Maybe the answer is simply: use xarray for this use case (although there are still differences) ? That are interesting discussions, but personally I would not complicate the core pandas data model for heterogeneous dataframes to accommodate the single-dtype + fixed number of columns use case. Joris On Tue, 26 May 2020 at 09:50, Adrin <adrin.jalali@gmail.com> wrote:
Hi Joris,
Thanks for the summary. I think another missing point is the roundtrip conversion to/from sparse matrices. There are some benchmarks and discussion here; https://github.com/scikit-learn/scikit-learn/pull/16772#issuecomment-6154230... and here's some discussion on the pandas issue tracker: https://github.com/pandas-dev/pandas/issues/33182 and some benchmark by Tom, assuming pandas would accept a 2D sparse array: https://github.com/scikit-learn/scikit-learn/pull/16772#issuecomment-6154408...
What do you think of these usecases?
Thanks, Adrin
On Mon, May 25, 2020 at 11:39 PM Joris Van den Bossche < jorisvandenbossche@gmail.com> wrote:
Hi list,
Rewriting the BlockManager based on a simpler collection of 1D-arrays is actually on our roadmap (see here <https://pandas.pydata.org/docs/dev/development/roadmap.html#block-manager-re...>), and I also touched on it in a mailing list discussion about pandas 2.0 earlier this year (see here <https://mail.python.org/pipermail/pandas-dev/2020-February/001180.html> ).
But since the topic came up again recently at the last online dev meeting (and also Uwe Korn who wrote a nice blog post <https://uwekorn.com/2020/05/24/the-one-pandas-internal.html> about this yesterday), I thought to do a write-up of my thoughts on why I think we should actually move towards a simpler, non-consolidating BlockManager with 1D blocks.
*Simplication of the internals*
It's regularly brought up as a reason to have 2D EextensionArrays (EAs) because right now we have a lot of special cases for 1D EAs in the internals. But to be clear: the additional complexity does not come from 1D EAs in itself, it comes from the fact that we have a mixture of 2D and 1D blocks. Solving this would require a consistent block dimension, and thus removing this added complexity can be done in two ways: have all 1D blocks, or have all 2D blocks. Just to say: IMO, this is not an argument in favor of 2D blocks / consolidation.
Moreover, when going with all 1D blocks, we cannot only remove the added complexity from dealing with the mixture of 1D/2D blocks, we will *also* be able to reduce the complexity of dealing with 2D blocks. A BlockManager with 2D blocks is inherently more complex than with 1D blocks, as one needs to deal with proper alignment of the blocks, a more complex "placement" logic of the blocks, etc.
I think we would be able to simplify the internals a lot by going with a BlockManager as a store of 1D arrays.
*Performance*
Performance is typically given as a reason to have consolidated, 2D blocks. And of course, certain operations (especially row-wise operations, or on dataframes with more columns as rows) will always be faster when done on a 2D numpy array under the hood. However, based on recent experimentation with this (eg triggered by the block-wise frame ops PR <https://github.com/pandas-dev/pandas/pull/32779>, and see also some benchmarks I justed posted in #10556 <https://github.com/pandas-dev/pandas/issues/10556#issuecomment-633703160> / this gist <https://gist.github.com/jorisvandenbossche/b8ae071ab7823f7547567b1ab9d4c20c>), I also think that for many operations and with decent-sized dataframes, this performance penalty is actually quite OK.
Further, there are also operations that will *benefit* from 1D blocks. First, operations that now involve aligning/splitting blocks, re-consolidation, .. will benefit (e.g. a large part of the slowdown doing frame/frame operations column-wise is currently due to the consolidation in the end). And operations like adding a column, concatting (with axis=1) or merging dataframes will be much faster when no consolidation is needed.
Personally, I am convinced that with some effort, we can get on-par or sometimes even better performance with 1D blocks compared to the performance we have now for those cases that 90+% of our users care about:
- With limited effort optimizing the column-wise code paths in the internals, we can get a long way. - After that, if needed, we can still consider if parts of the internals could be cythonized to further improve certain bottlenecks (and actually cythonizing this will also be simpler for a simpler non-consolidating block manager).
*Possibility to get better copy/view semantics*
Pandas is badly known for how much it copies ("you need 10x the memory available as the size of your dataframe"), and having 1D blocks will allow us to address part of those concerns.
*No consolidation = less copying.* Regularly consolidating introduces copies, and thus removing consolidation will mean less copies. For example, this would enable that you can actually add a single column to a dataframe without having to copy to the full dataframe.
*Copy / view semantics* Recently there has been discussion again around whether selecting columns should be a copy or a view, and some other issues were opened with questions about views/copies when slicing columns. In the consolidated 2D block layout this will always be inherently messy, and unpredictable (meaning: depending on the actual block layout, which means in practice unpredictable for the user unaware of the block layout). Going with a non-consolidated BlockManager should at least allow us to get better / more understandable semantics around this.
------------------------------
*So what are the reasons to have 2D blocks?*
I personally don't directly see reasons to have 2D blocks *for pandas itself* (apart from performance in certain row-wise use cases, and except for the fact that we have "always done it like this"). But quite likely I am missing reasons, so please bring them up.
But I think there are certainly use cases where 2D blocks can be useful, but typically "external" (but nonetheless important) use cases: conversion to/from numpy, xarray, etc. A typical example that has recently come up is scikit-learn, where they want to have a cheap dataframe <-> numpy array roundtrip for use in their pipelines. However, I personally think there are possible ways that we can still accommodate for those use cases, with some effort, while still having 1D Blocks in pandas itself. So IMO this is not sufficient to warrant the complexity of 2D blocks in pandas. (but will stop here, as this mail is getting already long ..).
Joris _______________________________________________ Pandas-dev mailing list Pandas-dev@python.org https://mail.python.org/mailman/listinfo/pandas-dev
On Tue, May 26, 2020 at 3:35 AM Joris Van den Bossche < jorisvandenbossche@gmail.com> wrote:
Thanks for those links!
Personally, I see the "roundtrip conversion to/from sparse matrices" a bit as in the same bucket as conversion to/from a 2D numpy array. Yes, both are important use cases. But the question we need to ask ourselves is still: is this important enough to hugely complicate the pandas' internals and block several other improvements? It's a trade-off that we need to make.
Moreover, I think that we could accommodate the important part of those use cases also with a column-store DataFrame, with some effort (but with less complexity as a consolidated BlockManager).
Focusing on scikit-learn: in the end, you mostly care about cheap roundtripping of 2D numpy array or sparse matrix to/from a pandas DataFrame to carry feature labels in between steps of a pipeline, correct? Such cheap roundtripping is only possible anyway if you have a single dtype for all columns (which is typically the case after some transformation step). So you don't necessarily need consolidated blocks specifically, but rather the ability to store a *single* 2D array/matrix in a DataFrame (so kind of a single 2D block).
Thinking out loud here, didn't try anything in code:
- We could make the DataFrame construction from a 2D array/matrix kind of "lazy" (or have an option to do it like this): upon construction just store the 2D array as is, and only once you perform an actual operation on it, convert to a columnar store. And that would make it possible to still get the 2D array back with zero-copy, if all you did was passing this DataFrame to the next step of the pipeline. - We could take the above a step further and try to preserve the 2D array under the hood in some "easy" operations (but again, limited to a single 2D block/array, not multiple consolidated blocks). This is actually similar to the DataMatrix that pandas had a very long time ago. Of course this adds back complexity, so this would need some more exploration to see if how this would be possible (without duplicating a lot), and some buy-in from people interested in this.
I think the first option should be fairly easy to do, and should solve a large part of the concerns for scikit-learn (I think?).
I think the first option would solve that use case for scikit-learn. It sounds feasible, but I'm not sure how easy it would be.
I think the second idea is also interesting: IMO such a data structure would be useful to have somewhere in the PyData ecosystem, and a worthwhile discussion to think about where this could fit. Maybe the answer is simply: use xarray for this use case (although there are still differences) ? That are interesting discussions, but personally I would not complicate the core pandas data model for heterogeneous dataframes to accommodate the single-dtype + fixed number of columns use case.
The current prototype[1] accepts preserves both xarray and pandas data structures. [1]: https://github.com/scikit-learn/scikit-learn/pull/16772
Joris
On Tue, 26 May 2020 at 09:50, Adrin <adrin.jalali@gmail.com> wrote:
Hi Joris,
Thanks for the summary. I think another missing point is the roundtrip conversion to/from sparse matrices. There are some benchmarks and discussion here; https://github.com/scikit-learn/scikit-learn/pull/16772#issuecomment-6154230... and here's some discussion on the pandas issue tracker: https://github.com/pandas-dev/pandas/issues/33182 and some benchmark by Tom, assuming pandas would accept a 2D sparse array: https://github.com/scikit-learn/scikit-learn/pull/16772#issuecomment-6154408...
What do you think of these usecases?
Thanks, Adrin
On Mon, May 25, 2020 at 11:39 PM Joris Van den Bossche < jorisvandenbossche@gmail.com> wrote:
Hi list,
Rewriting the BlockManager based on a simpler collection of 1D-arrays is actually on our roadmap (see here <https://pandas.pydata.org/docs/dev/development/roadmap.html#block-manager-re...>), and I also touched on it in a mailing list discussion about pandas 2.0 earlier this year (see here <https://mail.python.org/pipermail/pandas-dev/2020-February/001180.html> ).
But since the topic came up again recently at the last online dev meeting (and also Uwe Korn who wrote a nice blog post <https://uwekorn.com/2020/05/24/the-one-pandas-internal.html> about this yesterday), I thought to do a write-up of my thoughts on why I think we should actually move towards a simpler, non-consolidating BlockManager with 1D blocks.
*Simplication of the internals*
It's regularly brought up as a reason to have 2D EextensionArrays (EAs) because right now we have a lot of special cases for 1D EAs in the internals. But to be clear: the additional complexity does not come from 1D EAs in itself, it comes from the fact that we have a mixture of 2D and 1D blocks. Solving this would require a consistent block dimension, and thus removing this added complexity can be done in two ways: have all 1D blocks, or have all 2D blocks. Just to say: IMO, this is not an argument in favor of 2D blocks / consolidation.
Moreover, when going with all 1D blocks, we cannot only remove the added complexity from dealing with the mixture of 1D/2D blocks, we will *also* be able to reduce the complexity of dealing with 2D blocks. A BlockManager with 2D blocks is inherently more complex than with 1D blocks, as one needs to deal with proper alignment of the blocks, a more complex "placement" logic of the blocks, etc.
I think we would be able to simplify the internals a lot by going with a BlockManager as a store of 1D arrays.
*Performance*
Performance is typically given as a reason to have consolidated, 2D blocks. And of course, certain operations (especially row-wise operations, or on dataframes with more columns as rows) will always be faster when done on a 2D numpy array under the hood. However, based on recent experimentation with this (eg triggered by the block-wise frame ops PR <https://github.com/pandas-dev/pandas/pull/32779>, and see also some benchmarks I justed posted in #10556 <https://github.com/pandas-dev/pandas/issues/10556#issuecomment-633703160> / this gist <https://gist.github.com/jorisvandenbossche/b8ae071ab7823f7547567b1ab9d4c20c>), I also think that for many operations and with decent-sized dataframes, this performance penalty is actually quite OK.
Further, there are also operations that will *benefit* from 1D blocks. First, operations that now involve aligning/splitting blocks, re-consolidation, .. will benefit (e.g. a large part of the slowdown doing frame/frame operations column-wise is currently due to the consolidation in the end). And operations like adding a column, concatting (with axis=1) or merging dataframes will be much faster when no consolidation is needed.
Personally, I am convinced that with some effort, we can get on-par or sometimes even better performance with 1D blocks compared to the performance we have now for those cases that 90+% of our users care about:
- With limited effort optimizing the column-wise code paths in the internals, we can get a long way. - After that, if needed, we can still consider if parts of the internals could be cythonized to further improve certain bottlenecks (and actually cythonizing this will also be simpler for a simpler non-consolidating block manager).
*Possibility to get better copy/view semantics*
Pandas is badly known for how much it copies ("you need 10x the memory available as the size of your dataframe"), and having 1D blocks will allow us to address part of those concerns.
*No consolidation = less copying.* Regularly consolidating introduces copies, and thus removing consolidation will mean less copies. For example, this would enable that you can actually add a single column to a dataframe without having to copy to the full dataframe.
*Copy / view semantics* Recently there has been discussion again around whether selecting columns should be a copy or a view, and some other issues were opened with questions about views/copies when slicing columns. In the consolidated 2D block layout this will always be inherently messy, and unpredictable (meaning: depending on the actual block layout, which means in practice unpredictable for the user unaware of the block layout). Going with a non-consolidated BlockManager should at least allow us to get better / more understandable semantics around this.
------------------------------
*So what are the reasons to have 2D blocks?*
I personally don't directly see reasons to have 2D blocks *for pandas itself* (apart from performance in certain row-wise use cases, and except for the fact that we have "always done it like this"). But quite likely I am missing reasons, so please bring them up.
But I think there are certainly use cases where 2D blocks can be useful, but typically "external" (but nonetheless important) use cases: conversion to/from numpy, xarray, etc. A typical example that has recently come up is scikit-learn, where they want to have a cheap dataframe <-> numpy array roundtrip for use in their pipelines. However, I personally think there are possible ways that we can still accommodate for those use cases, with some effort, while still having 1D Blocks in pandas itself. So IMO this is not sufficient to warrant the complexity of 2D blocks in pandas. (but will stop here, as this mail is getting already long ..).
Joris _______________________________________________ Pandas-dev mailing list Pandas-dev@python.org https://mail.python.org/mailman/listinfo/pandas-dev
_______________________________________________ Pandas-dev mailing list Pandas-dev@python.org https://mail.python.org/mailman/listinfo/pandas-dev
A little historical perspective 10 years ago the standard input to a Dataframe was a single dtype 2D numpy array. This provides the following nice properties: - 0 cost construction, you can simply wrap Dataframe around the input with very little overhead. This provides a labeled array interface, gaining pandas users - very fast reductions; the block is passed to numpy directly for the reductions; numpy can then reduce with aligned memory access - almost all operations in pandas coerced to float64 on operations The block manager is optimized for this case as this was the original DataMatrix. It serves its purpose pretty well. In the last few years things have changed in the following ways: - dict of 1D numpy arrays is by far the most common construction - heterogenous dtypes have grown quite a bit, eg it’s now very common to use int8, float32; these are also preserved pretty well by pandas operations - non numpy backed dtypes are increasingly common To me removing the block manager is not about performance, rather about simplifying the code and mental model, though we should be mindful of construction from 2D inputs will require splitting and thus be not cheap (note that you can view the 1D slices but these are not memory aligned); this is a typical trap that folks get into; 1D looks all rosy but it all depends on usecase. I think it would be ok for pandas to move to dict of columns and simply document the non performing cases (eg very wide single dtypes or 2D construction); I suppose it’s also possible to reinvent the DataMatrix in a limited form but that of course adds complexity and would like to see that after a refactor. my 3c Jeff On May 26, 2020, at 7:22 AM, Tom Augspurger <tom.augspurger88@gmail.com> wrote:
On Tue, May 26, 2020 at 3:35 AM Joris Van den Bossche <jorisvandenbossche@gmail.com> wrote: Thanks for those links!
Personally, I see the "roundtrip conversion to/from sparse matrices" a bit as in the same bucket as conversion to/from a 2D numpy array. Yes, both are important use cases. But the question we need to ask ourselves is still: is this important enough to hugely complicate the pandas' internals and block several other improvements? It's a trade-off that we need to make.
Moreover, I think that we could accommodate the important part of those use cases also with a column-store DataFrame, with some effort (but with less complexity as a consolidated BlockManager).
Focusing on scikit-learn: in the end, you mostly care about cheap roundtripping of 2D numpy array or sparse matrix to/from a pandas DataFrame to carry feature labels in between steps of a pipeline, correct? Such cheap roundtripping is only possible anyway if you have a single dtype for all columns (which is typically the case after some transformation step). So you don't necessarily need consolidated blocks specifically, but rather the ability to store a *single* 2D array/matrix in a DataFrame (so kind of a single 2D block).
Thinking out loud here, didn't try anything in code:
- We could make the DataFrame construction from a 2D array/matrix kind of "lazy" (or have an option to do it like this): upon construction just store the 2D array as is, and only once you perform an actual operation on it, convert to a columnar store. And that would make it possible to still get the 2D array back with zero-copy, if all you did was passing this DataFrame to the next step of the pipeline. - We could take the above a step further and try to preserve the 2D array under the hood in some "easy" operations (but again, limited to a single 2D block/array, not multiple consolidated blocks). This is actually similar to the DataMatrix that pandas had a very long time ago. Of course this adds back complexity, so this would need some more exploration to see if how this would be possible (without duplicating a lot), and some buy-in from people interested in this.
I think the first option should be fairly easy to do, and should solve a large part of the concerns for scikit-learn (I think?).
I think the first option would solve that use case for scikit-learn. It sounds feasible, but I'm not sure how easy it would be.
I think the second idea is also interesting: IMO such a data structure would be useful to have somewhere in the PyData ecosystem, and a worthwhile discussion to think about where this could fit. Maybe the answer is simply: use xarray for this use case (although there are still differences) ? That are interesting discussions, but personally I would not complicate the core pandas data model for heterogeneous dataframes to accommodate the single-dtype + fixed number of columns use case.
The current prototype[1] accepts preserves both xarray and pandas data structures.
[1]: https://github.com/scikit-learn/scikit-learn/pull/16772
Joris
On Tue, 26 May 2020 at 09:50, Adrin <adrin.jalali@gmail.com> wrote: Hi Joris,
Thanks for the summary. I think another missing point is the roundtrip conversion to/from sparse matrices. There are some benchmarks and discussion here; https://github.com/scikit-learn/scikit-learn/pull/16772#issuecomment-6154230... and here's some discussion on the pandas issue tracker: https://github.com/pandas-dev/pandas/issues/33182 and some benchmark by Tom, assuming pandas would accept a 2D sparse array: https://github.com/scikit-learn/scikit-learn/pull/16772#issuecomment-6154408...
What do you think of these usecases?
Thanks, Adrin
On Mon, May 25, 2020 at 11:39 PM Joris Van den Bossche <jorisvandenbossche@gmail.com> wrote: Hi list,
Rewriting the BlockManager based on a simpler collection of 1D-arrays is actually on our roadmap (see here), and I also touched on it in a mailing list discussion about pandas 2.0 earlier this year (see here).
But since the topic came up again recently at the last online dev meeting (and also Uwe Korn who wrote a nice blog post about this yesterday), I thought to do a write-up of my thoughts on why I think we should actually move towards a simpler, non-consolidating BlockManager with 1D blocks.
Simplication of the internals
It's regularly brought up as a reason to have 2D EextensionArrays (EAs) because right now we have a lot of special cases for 1D EAs in the internals. But to be clear: the additional complexity does not come from 1D EAs in itself, it comes from the fact that we have a mixture of 2D and 1D blocks. Solving this would require a consistent block dimension, and thus removing this added complexity can be done in two ways: have all 1D blocks, or have all 2D blocks. Just to say: IMO, this is not an argument in favor of 2D blocks / consolidation.
Moreover, when going with all 1D blocks, we cannot only remove the added complexity from dealing with the mixture of 1D/2D blocks, we will also be able to reduce the complexity of dealing with 2D blocks. A BlockManager with 2D blocks is inherently more complex than with 1D blocks, as one needs to deal with proper alignment of the blocks, a more complex "placement" logic of the blocks, etc.
I think we would be able to simplify the internals a lot by going with a BlockManager as a store of 1D arrays.
Performance
Performance is typically given as a reason to have consolidated, 2D blocks. And of course, certain operations (especially row-wise operations, or on dataframes with more columns as rows) will always be faster when done on a 2D numpy array under the hood. However, based on recent experimentation with this (eg triggered by the block-wise frame ops PR, and see also some benchmarks I justed posted in #10556 / this gist), I also think that for many operations and with decent-sized dataframes, this performance penalty is actually quite OK.
Further, there are also operations that will benefit from 1D blocks. First, operations that now involve aligning/splitting blocks, re-consolidation, .. will benefit (e.g. a large part of the slowdown doing frame/frame operations column-wise is currently due to the consolidation in the end). And operations like adding a column, concatting (with axis=1) or merging dataframes will be much faster when no consolidation is needed.
Personally, I am convinced that with some effort, we can get on-par or sometimes even better performance with 1D blocks compared to the performance we have now for those cases that 90+% of our users care about:
With limited effort optimizing the column-wise code paths in the internals, we can get a long way. After that, if needed, we can still consider if parts of the internals could be cythonized to further improve certain bottlenecks (and actually cythonizing this will also be simpler for a simpler non-consolidating block manager).
Possibility to get better copy/view semantics
Pandas is badly known for how much it copies ("you need 10x the memory available as the size of your dataframe"), and having 1D blocks will allow us to address part of those concerns.
No consolidation = less copying. Regularly consolidating introduces copies, and thus removing consolidation will mean less copies. For example, this would enable that you can actually add a single column to a dataframe without having to copy to the full dataframe.
Copy / view semantics Recently there has been discussion again around whether selecting columns should be a copy or a view, and some other issues were opened with questions about views/copies when slicing columns. In the consolidated 2D block layout this will always be inherently messy, and unpredictable (meaning: depending on the actual block layout, which means in practice unpredictable for the user unaware of the block layout). Going with a non-consolidated BlockManager should at least allow us to get better / more understandable semantics around this.
So what are the reasons to have 2D blocks?
I personally don't directly see reasons to have 2D blocks for pandas itself (apart from performance in certain row-wise use cases, and except for the fact that we have "always done it like this"). But quite likely I am missing reasons, so please bring them up.
But I think there are certainly use cases where 2D blocks can be useful, but typically "external" (but nonetheless important) use cases: conversion to/from numpy, xarray, etc. A typical example that has recently come up is scikit-learn, where they want to have a cheap dataframe <-> numpy array roundtrip for use in their pipelines. However, I personally think there are possible ways that we can still accommodate for those use cases, with some effort, while still having 1D Blocks in pandas itself. So IMO this is not sufficient to warrant the complexity of 2D blocks in pandas. (but will stop here, as this mail is getting already long ..).
Joris
_______________________________________________ Pandas-dev mailing list Pandas-dev@python.org https://mail.python.org/mailman/listinfo/pandas-dev
Pandas-dev mailing list Pandas-dev@python.org https://mail.python.org/mailman/listinfo/pandas-dev
Pandas-dev mailing list Pandas-dev@python.org https://mail.python.org/mailman/listinfo/pandas-dev
Assuming we go down this path, do you have an idea of how we get from here to there incrementally? i.e. presumably this wont just be one massive PR [...] I would first like to focus on the "assuming we go down this path" part. Let's discuss the pros and cons and trade-offs, and try to turn assumptions in an agreed-upon roadmap. [...]
I think understanding the difficulty/feasibility of the implementation is a pretty important part of the pros/cons. Looking back at #10556, I'm wondering if we could disable _most_ consolidation, e.g. only consolidate when making copies anyway, which might be a never-break-views policy. From a user standpoint would that achieve much/most of th benefits here? On Tue, May 26, 2020 at 5:17 AM Jeff Reback <jeffreback@gmail.com> wrote:
A little historical perspective
10 years ago the standard input to a Dataframe was a single dtype 2D numpy array. This provides the following nice properties:
- 0 cost construction, you can simply wrap Dataframe around the input with very little overhead. This provides a labeled array interface, gaining pandas users - very fast reductions; the block is passed to numpy directly for the reductions; numpy can then reduce with aligned memory access - almost all operations in pandas coerced to float64 on operations
The block manager is optimized for this case as this was the original DataMatrix. It serves its purpose pretty well.
In the last few years things have changed in the following ways:
- dict of 1D numpy arrays is by far the most common construction - heterogenous dtypes have grown quite a bit, eg it’s now very common to use int8, float32; these are also preserved pretty well by pandas operations - non numpy backed dtypes are increasingly common
To me removing the block manager is not about performance, rather about simplifying the code and mental model, though we should be mindful of construction from 2D inputs will require splitting and thus be not cheap (note that you can view the 1D slices but these are not memory aligned); this is a typical trap that folks get into; 1D looks all rosy but it all depends on usecase.
I think it would be ok for pandas to move to dict of columns and simply document the non performing cases (eg very wide single dtypes or 2D construction);
I suppose it’s also possible to reinvent the DataMatrix in a limited form but that of course adds complexity and would like to see that after a refactor.
my 3c
Jeff
On May 26, 2020, at 7:22 AM, Tom Augspurger <tom.augspurger88@gmail.com> wrote:
On Tue, May 26, 2020 at 3:35 AM Joris Van den Bossche < jorisvandenbossche@gmail.com> wrote:
Thanks for those links!
Personally, I see the "roundtrip conversion to/from sparse matrices" a bit as in the same bucket as conversion to/from a 2D numpy array. Yes, both are important use cases. But the question we need to ask ourselves is still: is this important enough to hugely complicate the pandas' internals and block several other improvements? It's a trade-off that we need to make.
Moreover, I think that we could accommodate the important part of those use cases also with a column-store DataFrame, with some effort (but with less complexity as a consolidated BlockManager).
Focusing on scikit-learn: in the end, you mostly care about cheap roundtripping of 2D numpy array or sparse matrix to/from a pandas DataFrame to carry feature labels in between steps of a pipeline, correct? Such cheap roundtripping is only possible anyway if you have a single dtype for all columns (which is typically the case after some transformation step). So you don't necessarily need consolidated blocks specifically, but rather the ability to store a *single* 2D array/matrix in a DataFrame (so kind of a single 2D block).
Thinking out loud here, didn't try anything in code:
- We could make the DataFrame construction from a 2D array/matrix kind of "lazy" (or have an option to do it like this): upon construction just store the 2D array as is, and only once you perform an actual operation on it, convert to a columnar store. And that would make it possible to still get the 2D array back with zero-copy, if all you did was passing this DataFrame to the next step of the pipeline. - We could take the above a step further and try to preserve the 2D array under the hood in some "easy" operations (but again, limited to a single 2D block/array, not multiple consolidated blocks). This is actually similar to the DataMatrix that pandas had a very long time ago. Of course this adds back complexity, so this would need some more exploration to see if how this would be possible (without duplicating a lot), and some buy-in from people interested in this.
I think the first option should be fairly easy to do, and should solve a large part of the concerns for scikit-learn (I think?).
I think the first option would solve that use case for scikit-learn. It sounds feasible, but I'm not sure how easy it would be.
I think the second idea is also interesting: IMO such a data structure would be useful to have somewhere in the PyData ecosystem, and a worthwhile discussion to think about where this could fit. Maybe the answer is simply: use xarray for this use case (although there are still differences) ? That are interesting discussions, but personally I would not complicate the core pandas data model for heterogeneous dataframes to accommodate the single-dtype + fixed number of columns use case.
The current prototype[1] accepts preserves both xarray and pandas data structures.
[1]: https://github.com/scikit-learn/scikit-learn/pull/16772
Joris
On Tue, 26 May 2020 at 09:50, Adrin <adrin.jalali@gmail.com> wrote:
Hi Joris,
Thanks for the summary. I think another missing point is the roundtrip conversion to/from sparse matrices. There are some benchmarks and discussion here; https://github.com/scikit-learn/scikit-learn/pull/16772#issuecomment-6154230... and here's some discussion on the pandas issue tracker: https://github.com/pandas-dev/pandas/issues/33182 and some benchmark by Tom, assuming pandas would accept a 2D sparse array: https://github.com/scikit-learn/scikit-learn/pull/16772#issuecomment-6154408...
What do you think of these usecases?
Thanks, Adrin
On Mon, May 25, 2020 at 11:39 PM Joris Van den Bossche < jorisvandenbossche@gmail.com> wrote:
Hi list,
Rewriting the BlockManager based on a simpler collection of 1D-arrays is actually on our roadmap (see here <https://pandas.pydata.org/docs/dev/development/roadmap.html#block-manager-re...>), and I also touched on it in a mailing list discussion about pandas 2.0 earlier this year (see here <https://mail.python.org/pipermail/pandas-dev/2020-February/001180.html> ).
But since the topic came up again recently at the last online dev meeting (and also Uwe Korn who wrote a nice blog post <https://uwekorn.com/2020/05/24/the-one-pandas-internal.html> about this yesterday), I thought to do a write-up of my thoughts on why I think we should actually move towards a simpler, non-consolidating BlockManager with 1D blocks.
*Simplication of the internals*
It's regularly brought up as a reason to have 2D EextensionArrays (EAs) because right now we have a lot of special cases for 1D EAs in the internals. But to be clear: the additional complexity does not come from 1D EAs in itself, it comes from the fact that we have a mixture of 2D and 1D blocks. Solving this would require a consistent block dimension, and thus removing this added complexity can be done in two ways: have all 1D blocks, or have all 2D blocks. Just to say: IMO, this is not an argument in favor of 2D blocks / consolidation.
Moreover, when going with all 1D blocks, we cannot only remove the added complexity from dealing with the mixture of 1D/2D blocks, we will *also* be able to reduce the complexity of dealing with 2D blocks. A BlockManager with 2D blocks is inherently more complex than with 1D blocks, as one needs to deal with proper alignment of the blocks, a more complex "placement" logic of the blocks, etc.
I think we would be able to simplify the internals a lot by going with a BlockManager as a store of 1D arrays.
*Performance*
Performance is typically given as a reason to have consolidated, 2D blocks. And of course, certain operations (especially row-wise operations, or on dataframes with more columns as rows) will always be faster when done on a 2D numpy array under the hood. However, based on recent experimentation with this (eg triggered by the block-wise frame ops PR <https://github.com/pandas-dev/pandas/pull/32779>, and see also some benchmarks I justed posted in #10556 <https://github.com/pandas-dev/pandas/issues/10556#issuecomment-633703160> / this gist <https://gist.github.com/jorisvandenbossche/b8ae071ab7823f7547567b1ab9d4c20c>), I also think that for many operations and with decent-sized dataframes, this performance penalty is actually quite OK.
Further, there are also operations that will *benefit* from 1D blocks. First, operations that now involve aligning/splitting blocks, re-consolidation, .. will benefit (e.g. a large part of the slowdown doing frame/frame operations column-wise is currently due to the consolidation in the end). And operations like adding a column, concatting (with axis=1) or merging dataframes will be much faster when no consolidation is needed.
Personally, I am convinced that with some effort, we can get on-par or sometimes even better performance with 1D blocks compared to the performance we have now for those cases that 90+% of our users care about:
- With limited effort optimizing the column-wise code paths in the internals, we can get a long way. - After that, if needed, we can still consider if parts of the internals could be cythonized to further improve certain bottlenecks (and actually cythonizing this will also be simpler for a simpler non-consolidating block manager).
*Possibility to get better copy/view semantics*
Pandas is badly known for how much it copies ("you need 10x the memory available as the size of your dataframe"), and having 1D blocks will allow us to address part of those concerns.
*No consolidation = less copying.* Regularly consolidating introduces copies, and thus removing consolidation will mean less copies. For example, this would enable that you can actually add a single column to a dataframe without having to copy to the full dataframe.
*Copy / view semantics* Recently there has been discussion again around whether selecting columns should be a copy or a view, and some other issues were opened with questions about views/copies when slicing columns. In the consolidated 2D block layout this will always be inherently messy, and unpredictable (meaning: depending on the actual block layout, which means in practice unpredictable for the user unaware of the block layout). Going with a non-consolidated BlockManager should at least allow us to get better / more understandable semantics around this.
------------------------------
*So what are the reasons to have 2D blocks?*
I personally don't directly see reasons to have 2D blocks *for pandas itself* (apart from performance in certain row-wise use cases, and except for the fact that we have "always done it like this"). But quite likely I am missing reasons, so please bring them up.
But I think there are certainly use cases where 2D blocks can be useful, but typically "external" (but nonetheless important) use cases: conversion to/from numpy, xarray, etc. A typical example that has recently come up is scikit-learn, where they want to have a cheap dataframe <-> numpy array roundtrip for use in their pipelines. However, I personally think there are possible ways that we can still accommodate for those use cases, with some effort, while still having 1D Blocks in pandas itself. So IMO this is not sufficient to warrant the complexity of 2D blocks in pandas. (but will stop here, as this mail is getting already long ..).
Joris _______________________________________________ Pandas-dev mailing list Pandas-dev@python.org https://mail.python.org/mailman/listinfo/pandas-dev
_______________________________________________ Pandas-dev mailing list Pandas-dev@python.org https://mail.python.org/mailman/listinfo/pandas-dev
_______________________________________________ Pandas-dev mailing list Pandas-dev@python.org https://mail.python.org/mailman/listinfo/pandas-dev
_______________________________________________ Pandas-dev mailing list Pandas-dev@python.org https://mail.python.org/mailman/listinfo/pandas-dev
On Tue, 26 May 2020 at 16:14, Brock Mendel <jbrockmendel@gmail.com> wrote:
Assuming we go down this path, do you have an idea of how we get from here to there incrementally? i.e. presumably this wont just be one massive PR [...] I would first like to focus on the "assuming we go down this path" part. Let's discuss the pros and cons and trade-offs, and try to turn assumptions in an agreed-upon roadmap. [...]
I think understanding the difficulty/feasibility of the implementation is a pretty important part of the pros/cons.
That's true. Personally I think there are enough options to do it to not have to worry about the "how" too much, but for sure it will be a lot of work to do it properly (so rather the "who is going to do this").
Looking back at #10556, I'm wondering if we could disable _most_ consolidation, e.g. only consolidate when making copies anyway, which might be a never-break-views policy. From a user standpoint would that achieve much/most of th benefits here?
That could certainly alleviate some of the drawbacks of the consolidated BlockManager regarding its copying behaviour (but not necessarily regarding the transparency / understandability of it, I would say). But for example for the "complexity of the internals" argument, I think this would rather make it worse. Now, you at least know (after ensuring consolidation) that you have only a single block for a certain dtype. Still having many, potentially-but-not-always consolidated 2D blocks will make it more difficult to optimize the situation of non-consolidated / 1D blocks.
On Tue, 26 May 2020 at 13:21, Tom Augspurger <tom.augspurger88@gmail.com> wrote:
On Tue, May 26, 2020 at 3:35 AM Joris Van den Bossche < jorisvandenbossche@gmail.com> wrote:
- We could make the DataFrame construction from a 2D array/matrix kind of "lazy" (or have an option to do it like this): upon construction just store the 2D array as is, and only once you perform an actual operation on it, convert to a columnar store. And that would make it possible to still get the 2D array back with zero-copy, if all you did was passing this DataFrame to the next step of the pipeline.
I think the first option should be fairly easy to do, and should solve a large part of the concerns for scikit-learn (I think?).
I think the first option would solve that use case for scikit-learn. It sounds feasible, but I'm not sure how easy it would be.
A quick, ugly proof-of-concept: https://github.com/pandas-dev/pandas/commit/cf387dced4803b81ec8709eeaf624369... It allows to create a "DataFrame" from an ndarray without creating a BlockManager, and it allows accessing this original ndarray: In [1]: df = pd.DataFrame._init_lazy(np.random.randn(4, 3), (pd.RangeIndex(4), pd.RangeIndex(3))) In [2]: df._mgr_data Out[2]: (array([[ 1.52971972e-01, -5.69204971e-01, 5.54430115e-01], [-1.09916133e+00, -1.16315362e+00, -1.51071081e+00], [ 7.05185110e-01, -1.53009348e-03, 1.54260335e+00], [-4.60590231e-01, -3.85364427e-01, 1.80760103e+00]]), RangeIndex(start=0, stop=4, step=1), RangeIndex(start=0, stop=3, step=1)) And once you do something with the dataframe, such as printing or calculating something, the BlockManager gets only created at this step: In [3]: df Out[3]: Initializing !!! 0 1 2 0 0.152972 -0.569205 0.554430 1 -1.099161 -1.163154 -1.510711 2 0.705185 -0.001530 1.542603 3 -0.460590 -0.385364 1.807601 In [4]: df = pd.DataFrame._init_lazy(np.random.randn(4, 3), (pd.RangeIndex(4), pd.RangeIndex(3))) In [5]: df.mean() Initializing !!! Out[5]: 0 0.397243 1 0.269996 2 -0.454929 dtype: float64 There are of course many things missing (validation of the input to init_lazy, potentially being able to access df.index/df.columns without initializing the block manager, hooking this up in __array__, what with pickling?, ...) But just to illustrate the idea.
Thanks for verifying the feasibility. Validation is a bit tricky, but I'd hope that we can delay everything except the splitting / forming of blocks. That may result in some non-obvious performance quirks, but at least of the simple case of `data` being an ndarray and index / columns not forcing any reindexing, I'm hopeful that it's not too bad. On Tue, May 26, 2020 at 2:35 PM Joris Van den Bossche < jorisvandenbossche@gmail.com> wrote:
On Tue, 26 May 2020 at 13:21, Tom Augspurger <tom.augspurger88@gmail.com> wrote:
On Tue, May 26, 2020 at 3:35 AM Joris Van den Bossche < jorisvandenbossche@gmail.com> wrote:
- We could make the DataFrame construction from a 2D array/matrix kind of "lazy" (or have an option to do it like this): upon construction just store the 2D array as is, and only once you perform an actual operation on it, convert to a columnar store. And that would make it possible to still get the 2D array back with zero-copy, if all you did was passing this DataFrame to the next step of the pipeline.
I think the first option should be fairly easy to do, and should solve a large part of the concerns for scikit-learn (I think?).
I think the first option would solve that use case for scikit-learn. It sounds feasible, but I'm not sure how easy it would be.
A quick, ugly proof-of-concept: https://github.com/pandas-dev/pandas/commit/cf387dced4803b81ec8709eeaf624369...
It allows to create a "DataFrame" from an ndarray without creating a BlockManager, and it allows accessing this original ndarray:
In [1]: df = pd.DataFrame._init_lazy(np.random.randn(4, 3), (pd.RangeIndex(4), pd.RangeIndex(3)))
In [2]: df._mgr_data Out[2]: (array([[ 1.52971972e-01, -5.69204971e-01, 5.54430115e-01], [-1.09916133e+00, -1.16315362e+00, -1.51071081e+00], [ 7.05185110e-01, -1.53009348e-03, 1.54260335e+00], [-4.60590231e-01, -3.85364427e-01, 1.80760103e+00]]), RangeIndex(start=0, stop=4, step=1), RangeIndex(start=0, stop=3, step=1))
And once you do something with the dataframe, such as printing or calculating something, the BlockManager gets only created at this step:
In [3]: df Out[3]: Initializing !!!
0 1 2 0 0.152972 -0.569205 0.554430 1 -1.099161 -1.163154 -1.510711 2 0.705185 -0.001530 1.542603 3 -0.460590 -0.385364 1.807601
In [4]: df = pd.DataFrame._init_lazy(np.random.randn(4, 3), (pd.RangeIndex(4), pd.RangeIndex(3)))
In [5]: df.mean() Initializing !!! Out[5]: 0 0.397243 1 0.269996 2 -0.454929 dtype: float64
There are of course many things missing (validation of the input to init_lazy, potentially being able to access df.index/df.columns without initializing the block manager, hooking this up in __array__, what with pickling?, ...) But just to illustrate the idea.
Something to add here (in favor of removing the BM) -- and apologies if it's already mentioned in a different form: It is very, very difficult for third party code to construct heterogeneously-typed DataFrames without triggering a memory doubling. To give you an example what I mean, in Apache Arrow, we painstakingly implemented block consolidation in C++ [1] so that we can construct a DataFrame that won't suddenly double memory the first time that a user interacts with it. So the possibility of users having an OOM on their first interaction with an object they created is not great. If avoiding it for library developers were easy then perhaps it would be less of an issue, but avoiding the doubling requires advanced knowledge of pandas's internals. Looking back 9-10 years, the primary motivations I had for creating the BlockManager in the first place don't persuade me anymore: * pandas's success was still very much coupled to vectorized operations on wide row-major data (e.g. as present in certain sectors of the financial industry). I don't think this represents the majority of pandas users now * In 2011 I was uncomfortable writing significant compiled code. Many of the performance issues that the BM tried to ameliorate are non-issues if you're OK writing non-trivial C/C++ code to deal with row-level interactions. Even if there were a 50% performance regression on some of these operations that are faster with 2D blocks because of row-major vs. column-major memory layout, that still seems worth it for the vast code simplification and the memory-use-predictability benefits that others have articulated already. - Wes [1]: https://github.com/apache/arrow/blob/master/cpp/src/arrow/python/arrow_to_pa... On Tue, May 26, 2020 at 2:35 PM Joris Van den Bossche <jorisvandenbossche@gmail.com> wrote:
On Tue, 26 May 2020 at 13:21, Tom Augspurger <tom.augspurger88@gmail.com> wrote:
On Tue, May 26, 2020 at 3:35 AM Joris Van den Bossche <jorisvandenbossche@gmail.com> wrote:
- We could make the DataFrame construction from a 2D array/matrix kind of "lazy" (or have an option to do it like this): upon construction just store the 2D array as is, and only once you perform an actual operation on it, convert to a columnar store. And that would make it possible to still get the 2D array back with zero-copy, if all you did was passing this DataFrame to the next step of the pipeline.
I think the first option should be fairly easy to do, and should solve a large part of the concerns for scikit-learn (I think?).
I think the first option would solve that use case for scikit-learn. It sounds feasible, but I'm not sure how easy it would be.
A quick, ugly proof-of-concept: https://github.com/pandas-dev/pandas/commit/cf387dced4803b81ec8709eeaf624369...
It allows to create a "DataFrame" from an ndarray without creating a BlockManager, and it allows accessing this original ndarray:
In [1]: df = pd.DataFrame._init_lazy(np.random.randn(4, 3), (pd.RangeIndex(4), pd.RangeIndex(3)))
In [2]: df._mgr_data Out[2]: (array([[ 1.52971972e-01, -5.69204971e-01, 5.54430115e-01], [-1.09916133e+00, -1.16315362e+00, -1.51071081e+00], [ 7.05185110e-01, -1.53009348e-03, 1.54260335e+00], [-4.60590231e-01, -3.85364427e-01, 1.80760103e+00]]), RangeIndex(start=0, stop=4, step=1), RangeIndex(start=0, stop=3, step=1))
And once you do something with the dataframe, such as printing or calculating something, the BlockManager gets only created at this step:
In [3]: df Out[3]: Initializing !!!
0 1 2 0 0.152972 -0.569205 0.554430 1 -1.099161 -1.163154 -1.510711 2 0.705185 -0.001530 1.542603 3 -0.460590 -0.385364 1.807601
In [4]: df = pd.DataFrame._init_lazy(np.random.randn(4, 3), (pd.RangeIndex(4), pd.RangeIndex(3)))
In [5]: df.mean() Initializing !!! Out[5]: 0 0.397243 1 0.269996 2 -0.454929 dtype: float64
There are of course many things missing (validation of the input to init_lazy, potentially being able to access df.index/df.columns without initializing the block manager, hooking this up in __array__, what with pickling?, ...) But just to illustrate the idea. _______________________________________________ Pandas-dev mailing list Pandas-dev@python.org https://mail.python.org/mailman/listinfo/pandas-dev
It allows to create a "DataFrame" from an ndarray without creating a BlockManager, and it allows accessing this original ndarray:
This is a neat proof of concept, but it cuts against the "decreases complexity" argument. Is there a viable way to quantify (even very roughly) the complexity effect of going all-1D? A couple ideas for ways to simplify this decision-making problem: 1) ATM there are a handful of places outside of core.internals where we call consolidate/consolidate_inplace. If we can refactor those away, we can focus on the BlockManager in (closer-to-)isolation. 2) IIUC going all-1D will cause column indexing to always return views. Elsewhere you have noted that this is a breaking API change which merited discussion in its own right. xref #33780 <https://github.com/pandas-dev/pandas/issues/33780 >. My takeaway from this part of the last dev call was that people were generally positive on the all-views idea, but were wary of how to handle the potential deprecation. On Tue, May 26, 2020 at 12:49 PM Wes McKinney <wesmckinn@gmail.com> wrote:
Something to add here (in favor of removing the BM) -- and apologies if it's already mentioned in a different form:
It is very, very difficult for third party code to construct heterogeneously-typed DataFrames without triggering a memory doubling. To give you an example what I mean, in Apache Arrow, we painstakingly implemented block consolidation in C++ [1] so that we can construct a DataFrame that won't suddenly double memory the first time that a user interacts with it. So the possibility of users having an OOM on their first interaction with an object they created is not great. If avoiding it for library developers were easy then perhaps it would be less of an issue, but avoiding the doubling requires advanced knowledge of pandas's internals.
Looking back 9-10 years, the primary motivations I had for creating the BlockManager in the first place don't persuade me anymore:
* pandas's success was still very much coupled to vectorized operations on wide row-major data (e.g. as present in certain sectors of the financial industry). I don't think this represents the majority of pandas users now * In 2011 I was uncomfortable writing significant compiled code. Many of the performance issues that the BM tried to ameliorate are non-issues if you're OK writing non-trivial C/C++ code to deal with row-level interactions. Even if there were a 50% performance regression on some of these operations that are faster with 2D blocks because of row-major vs. column-major memory layout, that still seems worth it for the vast code simplification and the memory-use-predictability benefits that others have articulated already.
- Wes
[1]: https://github.com/apache/arrow/blob/master/cpp/src/arrow/python/arrow_to_pa...
On Tue, May 26, 2020 at 2:35 PM Joris Van den Bossche <jorisvandenbossche@gmail.com> wrote:
On Tue, 26 May 2020 at 13:21, Tom Augspurger <tom.augspurger88@gmail.com>
On Tue, May 26, 2020 at 3:35 AM Joris Van den Bossche <
jorisvandenbossche@gmail.com> wrote:
- We could make the DataFrame construction from a 2D array/matrix kind
of "lazy" (or have an option to do it like this): upon construction just store the 2D array as is, and only once you perform an actual operation on it, convert to a columnar store. And that would make it possible to still get the 2D array back with zero-copy, if all you did was passing this DataFrame to the next step of the pipeline.
I think the first option should be fairly easy to do, and should solve
a large part of the concerns for scikit-learn (I think?).
I think the first option would solve that use case for scikit-learn. It sounds feasible, but I'm not sure how easy it would be.
A quick, ugly proof-of-concept: https://github.com/pandas-dev/pandas/commit/cf387dced4803b81ec8709eeaf624369...
It allows to create a "DataFrame" from an ndarray without creating a BlockManager, and it allows accessing this original ndarray:
In [1]: df = pd.DataFrame._init_lazy(np.random.randn(4, 3), (pd.RangeIndex(4), pd.RangeIndex(3)))
In [2]: df._mgr_data Out[2]: (array([[ 1.52971972e-01, -5.69204971e-01, 5.54430115e-01], [-1.09916133e+00, -1.16315362e+00, -1.51071081e+00], [ 7.05185110e-01, -1.53009348e-03, 1.54260335e+00], [-4.60590231e-01, -3.85364427e-01, 1.80760103e+00]]), RangeIndex(start=0, stop=4, step=1), RangeIndex(start=0, stop=3, step=1))
And once you do something with the dataframe, such as printing or calculating something, the BlockManager gets only created at this step:
In [3]: df Out[3]: Initializing !!!
0 1 2 0 0.152972 -0.569205 0.554430 1 -1.099161 -1.163154 -1.510711 2 0.705185 -0.001530 1.542603 3 -0.460590 -0.385364 1.807601
In [4]: df = pd.DataFrame._init_lazy(np.random.randn(4, 3), (pd.RangeIndex(4), pd.RangeIndex(3)))
In [5]: df.mean() Initializing !!! Out[5]: 0 0.397243 1 0.269996 2 -0.454929 dtype: float64
There are of course many things missing (validation of the input to init_lazy, potentially being able to access df.index/df.columns without initializing the block manager, hooking this up in __array__, what with
wrote: pickling?, ...)
But just to illustrate the idea. _______________________________________________ Pandas-dev mailing list Pandas-dev@python.org https://mail.python.org/mailman/listinfo/pandas-dev
Pandas-dev mailing list Pandas-dev@python.org https://mail.python.org/mailman/listinfo/pandas-dev
On Tue, May 26, 2020 at 3:50 PM Brock Mendel <jbrockmendel@gmail.com> wrote:
It allows to create a "DataFrame" from an ndarray without creating a BlockManager, and it allows accessing this original ndarray:
This is a neat proof of concept, but it cuts against the "decreases complexity" argument. Is there a viable way to quantify (even very roughly) the complexity effect of going all-1D?
That complexity is at least localized to a single attribute. That's quite different from the 1D & 2D blocks situation, where many methods (though fewer than a year ago) need to be concerned with whether the array in a block is 1D or 2D, or whether the DataFrame is consolidated, homogenous, ...
A couple ideas for ways to simplify this decision-making problem:
1) ATM there are a handful of places outside of core.internals where we call consolidate/consolidate_inplace. If we can refactor those away, we can focus on the BlockManager in (closer-to-)isolation.
If possible, isolating consolidation to `core.internals` sounds like a generally useful cleanup, regardless of whether we pursue the larger changes.
2) IIUC going all-1D will cause column indexing to always return views. Elsewhere you have noted that this is a breaking API change which merited discussion in its own right. xref #33780 <https://github.com/pandas-dev/pandas/issues/33780>. My takeaway from this part of the last dev call was that people were generally positive on the all-views idea, but were wary of how to handle the potential deprecation.
This type of change would merit a major version bump. If possible, we'd ideally have some kind of option to disable consolidation / enable splitting, which would allow for users to test their code on older versions.
On Tue, May 26, 2020 at 12:49 PM Wes McKinney <wesmckinn@gmail.com> wrote:
Something to add here (in favor of removing the BM) -- and apologies if it's already mentioned in a different form:
It is very, very difficult for third party code to construct heterogeneously-typed DataFrames without triggering a memory doubling. To give you an example what I mean, in Apache Arrow, we painstakingly implemented block consolidation in C++ [1] so that we can construct a DataFrame that won't suddenly double memory the first time that a user interacts with it. So the possibility of users having an OOM on their first interaction with an object they created is not great. If avoiding it for library developers were easy then perhaps it would be less of an issue, but avoiding the doubling requires advanced knowledge of pandas's internals.
Looking back 9-10 years, the primary motivations I had for creating the BlockManager in the first place don't persuade me anymore:
* pandas's success was still very much coupled to vectorized operations on wide row-major data (e.g. as present in certain sectors of the financial industry). I don't think this represents the majority of pandas users now * In 2011 I was uncomfortable writing significant compiled code. Many of the performance issues that the BM tried to ameliorate are non-issues if you're OK writing non-trivial C/C++ code to deal with row-level interactions. Even if there were a 50% performance regression on some of these operations that are faster with 2D blocks because of row-major vs. column-major memory layout, that still seems worth it for the vast code simplification and the memory-use-predictability benefits that others have articulated already.
- Wes
[1]: https://github.com/apache/arrow/blob/master/cpp/src/arrow/python/arrow_to_pa...
On Tue, May 26, 2020 at 2:35 PM Joris Van den Bossche <jorisvandenbossche@gmail.com> wrote:
On Tue, 26 May 2020 at 13:21, Tom Augspurger <
On Tue, May 26, 2020 at 3:35 AM Joris Van den Bossche <
jorisvandenbossche@gmail.com> wrote:
- We could make the DataFrame construction from a 2D array/matrix
kind of "lazy" (or have an option to do it like this): upon construction just store the 2D array as is, and only once you perform an actual operation on it, convert to a columnar store. And that would make it
I think the first option should be fairly easy to do, and should
solve a large part of the concerns for scikit-learn (I think?).
I think the first option would solve that use case for scikit-learn. It sounds feasible, but I'm not sure how easy it would be.
A quick, ugly proof-of-concept: https://github.com/pandas-dev/pandas/commit/cf387dced4803b81ec8709eeaf624369...
It allows to create a "DataFrame" from an ndarray without creating a BlockManager, and it allows accessing this original ndarray:
In [1]: df = pd.DataFrame._init_lazy(np.random.randn(4, 3), (pd.RangeIndex(4), pd.RangeIndex(3)))
In [2]: df._mgr_data Out[2]: (array([[ 1.52971972e-01, -5.69204971e-01, 5.54430115e-01], [-1.09916133e+00, -1.16315362e+00, -1.51071081e+00], [ 7.05185110e-01, -1.53009348e-03, 1.54260335e+00], [-4.60590231e-01, -3.85364427e-01, 1.80760103e+00]]), RangeIndex(start=0, stop=4, step=1), RangeIndex(start=0, stop=3, step=1))
And once you do something with the dataframe, such as printing or calculating something, the BlockManager gets only created at this step:
In [3]: df Out[3]: Initializing !!!
0 1 2 0 0.152972 -0.569205 0.554430 1 -1.099161 -1.163154 -1.510711 2 0.705185 -0.001530 1.542603 3 -0.460590 -0.385364 1.807601
In [4]: df = pd.DataFrame._init_lazy(np.random.randn(4, 3), (pd.RangeIndex(4), pd.RangeIndex(3)))
In [5]: df.mean() Initializing !!! Out[5]: 0 0.397243 1 0.269996 2 -0.454929 dtype: float64
There are of course many things missing (validation of the input to init_lazy, potentially being able to access df.index/df.columns without initializing the block manager, hooking this up in __array__, what with
tom.augspurger88@gmail.com> wrote: possible to still get the 2D array back with zero-copy, if all you did was passing this DataFrame to the next step of the pipeline. pickling?, ...)
But just to illustrate the idea. _______________________________________________ Pandas-dev mailing list Pandas-dev@python.org https://mail.python.org/mailman/listinfo/pandas-dev
Pandas-dev mailing list Pandas-dev@python.org https://mail.python.org/mailman/listinfo/pandas-dev
_______________________________________________ Pandas-dev mailing list Pandas-dev@python.org https://mail.python.org/mailman/listinfo/pandas-dev
On Tue, 26 May 2020 at 23:00, Tom Augspurger <tom.augspurger88@gmail.com> wrote:
On Tue, May 26, 2020 at 3:50 PM Brock Mendel <jbrockmendel@gmail.com> wrote:
It allows to create a "DataFrame" from an ndarray without creating a BlockManager, and it allows accessing this original ndarray:
This is a neat proof of concept, but it cuts against the "decreases complexity" argument. Is there a viable way to quantify (even very roughly) the complexity effect of going all-1D?
That complexity is at least localized to a single attribute. That's quite different from the 1D & 2D blocks situation, where many methods (though fewer than a year ago) need to be concerned with whether the array in a block is 1D or 2D, or whether the DataFrame is consolidated, homogenous, ...
I don't think this "lazy _mgr attribute" is comparable in complexity with the consolidated BlockManager. Furthermore: it's targeted to a very specific and limited use case (and eg also doesn't need to be the default, I think). Now, exactly quantifying the effect of going all-1D, that's of course hard. But just one example: all code that deals with blknos/blklocs (the mapping between the position in the consolidated blocks and the position in the dataframe), which is a significant part of managers.py, could be simplified considerably. But anyway: I think it clear that a BlockManager with only 1D arrays/blocks *can* be simpler as one with interleaved/consolidated blocks. But this is also only one of the arguments. Complexity alone is not a reason to not do something; it's the general trade-off with what you gain or lose with it.
A couple ideas for ways to simplify this decision-making problem:
2) IIUC going all-1D will cause column indexing to always return views. Elsewhere you have noted that this is a breaking API change which merited discussion in its own right. xref #33780 <https://github.com/pandas-dev/pandas/issues/33780>. My takeaway from this part of the last dev call was that people were generally positive on the all-views idea, but were wary of how to handle the potential deprecation.
This type of change would merit a major version bump. If possible, we'd ideally have some kind of option to disable consolidation / enable splitting, which would allow for users to test their code on older versions.
Yes, going to an all-1D-BlockManager would be something for a major version bump, eg pandas 2.0. So I think that is the perfect opportunity to do such a change of making column selections always views.
I don't think this "lazy _mgr attribute" is comparable in complexity with the consolidated BlockManager
Not on its own, no. But my prior is that this isn't the last thing that will merit its own special case.
I think it clear that a BlockManager with only 1D arrays/blocks *can* be simpler as one with interleaved/consolidated blocks.
Absolutely agree. I've spent a big chunk of the last year dealing with BlockManager code and have no great love for it.
But this is also only one of the arguments. Complexity alone is not a reason to not do something; it's the general trade-off with what you gain or lose with it.
The main upsides I see are a) internal complexity reduction, b) downstream library upsides, c) clearer view vs copy semantics, d) perf improvements from making fewer copies, e) clear "dict of Series" data model. The main downside is potential performance degradation (at the extreme end e.g. 3000x <https://github.com/pandas-dev/pandas/issues/24990> for arithmetic). As Wes commented some of that can be ameliorated with compiled code but that cuts against the complexity reduction. I am looking for ways to quantify these tradeoffs so we can make an informed decision. On Wed, May 27, 2020 at 12:57 PM Joris Van den Bossche < jorisvandenbossche@gmail.com> wrote:
On Tue, 26 May 2020 at 23:00, Tom Augspurger <tom.augspurger88@gmail.com> wrote:
On Tue, May 26, 2020 at 3:50 PM Brock Mendel <jbrockmendel@gmail.com> wrote:
It allows to create a "DataFrame" from an ndarray without creating a BlockManager, and it allows accessing this original ndarray:
This is a neat proof of concept, but it cuts against the "decreases complexity" argument. Is there a viable way to quantify (even very roughly) the complexity effect of going all-1D?
That complexity is at least localized to a single attribute. That's quite different from the 1D & 2D blocks situation, where many methods (though fewer than a year ago) need to be concerned with whether the array in a block is 1D or 2D, or whether the DataFrame is consolidated, homogenous, ...
I don't think this "lazy _mgr attribute" is comparable in complexity with the consolidated BlockManager. Furthermore: it's targeted to a very specific and limited use case (and eg also doesn't need to be the default, I think). Now, exactly quantifying the effect of going all-1D, that's of course hard. But just one example: all code that deals with blknos/blklocs (the mapping between the position in the consolidated blocks and the position in the dataframe), which is a significant part of managers.py, could be simplified considerably.
But anyway: I think it clear that a BlockManager with only 1D arrays/blocks *can* be simpler as one with interleaved/consolidated blocks. But this is also only one of the arguments. Complexity alone is not a reason to not do something; it's the general trade-off with what you gain or lose with it.
A couple ideas for ways to simplify this decision-making problem:
2) IIUC going all-1D will cause column indexing to always return views. Elsewhere you have noted that this is a breaking API change which merited discussion in its own right. xref #33780 <https://github.com/pandas-dev/pandas/issues/33780>. My takeaway from this part of the last dev call was that people were generally positive on the all-views idea, but were wary of how to handle the potential deprecation.
This type of change would merit a major version bump. If possible, we'd ideally have some kind of option to disable consolidation / enable splitting, which would allow for users to test their code on older versions.
Yes, going to an all-1D-BlockManager would be something for a major version bump, eg pandas 2.0. So I think that is the perfect opportunity to do such a change of making column selections always views.
On Wed, 27 May 2020 at 23:07, Brock Mendel <jbrockmendel@gmail.com> wrote:
I don't think this "lazy _mgr attribute" is comparable in complexity with the consolidated BlockManager
Not on its own, no. But my prior is that this isn't the last thing that will merit its own special case.
I think it clear that a BlockManager with only 1D arrays/blocks *can* be simpler as one with interleaved/consolidated blocks.
Absolutely agree. I've spent a big chunk of the last year dealing with BlockManager code and have no great love for it.
But this is also only one of the arguments. Complexity alone is not a reason to not do something; it's the general trade-off with what you gain or lose with it.
The main upsides I see are a) internal complexity reduction, b) downstream library upsides, c) clearer view vs copy semantics, d) perf improvements from making fewer copies, e) clear "dict of Series" data model.
The main downside is potential performance degradation (at the extreme end e.g. 3000x <https://github.com/pandas-dev/pandas/issues/24990> for arithmetic). As Wes commented some of that can be ameliorated with compiled code but that cuts against the complexity reduction.
That number is not correct. That was comparing the block-wise operation to a very inefficient convert-each-column-to-a-series operation. We can optimize this column-wise operation a lot (as I already did on master for some cases), and then a slowdown will still be present in such extreme cases, but *much* less.
I am looking for ways to quantify these tradeoffs so we can make an informed decision.
On Wed, 27 May 2020 at 23:07, Brock Mendel <jbrockmendel@gmail.com> wrote:
The main upsides I see are a) internal complexity reduction, b) downstream library upsides, c) clearer view vs copy semantics, d) perf improvements from making fewer copies, e) clear "dict of Series" data model.
The main downside is potential performance degradation (at the extreme end e.g. 3000x <https://github.com/pandas-dev/pandas/issues/24990> for arithmetic). As Wes commented some of that can be ameliorated with compiled code but that cuts against the complexity reduction.
I am looking for ways to quantify these tradeoffs so we can make an informed decision.
Can you try to explain a bit more what kind of quantification you are
looking for? - Complexity: I think we agree a non-consolidating block manager *can* be simpler? (and it's not only the internals, also eg the algos become simpler). But I am not sure this can be expressed in a number. - Clearer view vs copy semantics: this is partly an issue of making pandas easier to understand (both as developer and user), which again seems hard to quantify. And partly an issue of performance / memory usage. This is something that could potentially be measured (eg the memory usage of some typical workflows). But this probably also something that might only show effect after a refactor / implementation of new semantics. - Potential performance degradation: here you can measure things, and I actually did that for some cases, see https://gist.github.com/jorisvandenbossche/b8ae071ab7823f7547567b1ab9d4c20c (the notebook that I posted in #10556 <https://github.com/pandas-dev/pandas/issues/10556> a few days ago). However: 1) a lot depends on what kind of dataframe you take for your benchmarks (number of rows vs number of columns), 2) there are of course a lot of potential operations to test, 3) there will be a set of operations that will always be slower with a columnar dataframe, whatever the optimization, and 4) we would be testing with current pandas, which is often not yet optimized for column-wise operations. I would be fine with choosing a set of example datasets with example operations, on which we can have some comparisons. My notebook linked above is already something like that (in a limited form), I think. From this set of timings, I personally don't see any insurmountable performance degradations. But I also deliberately choose a dataframe where n_rows >> n_columns, because I personally would be fine if operations on wide dataframes (n_rows < n_columns) show a slowdown. But that is of course something to discuss / agree upon (eg up to which dataframe size or n_columns/n_rows ratio do we care about a performance degradation?). Joris
Hi Joris, You said:
But I also deliberately choose a dataframe where n_rows >> n_columns, because I personally would be fine if operations on wide dataframes (n_rows < n_columns) show a slowdown. But that is of course something to discuss / agree upon (eg up to which dataframe size or n_columns/n_rows ratio do we care about a performance degradation?).
This is an (the) important use case for us and probably for a lot of use in finance in general. I can easily imagine many other areas where storing data for 1000’s of elements (sensors, items, people) on grid of time scales of minutes or more. (n*1000 x m*1000 data with n, m ~ 10 .. 100) Why do you think this use case is no longer important? We already have to drop into numpy on occasion to make the performance sufficient. I would really prefer for Pandas to improve in this area not slide back. Have a great weekend, Maarten
On May 29, 2020, at 1:34 PM, Joris Van den Bossche <jorisvandenbossche@gmail.com> wrote:
On Wed, 27 May 2020 at 23:07, Brock Mendel <jbrockmendel@gmail.com <mailto:jbrockmendel@gmail.com>> wrote:
The main upsides I see are a) internal complexity reduction, b) downstream library upsides, c) clearer view vs copy semantics, d) perf improvements from making fewer copies, e) clear "dict of Series" data model.
The main downside is potential performance degradation (at the extreme end e.g. 3000x <https://github.com/pandas-dev/pandas/issues/24990> for arithmetic). As Wes commented some of that can be ameliorated with compiled code but that cuts against the complexity reduction.
I am looking for ways to quantify these tradeoffs so we can make an informed decision.
Can you try to explain a bit more what kind of quantification you are looking for?
- Complexity: I think we agree a non-consolidating block manager can be simpler? (and it's not only the internals, also eg the algos become simpler). But I am not sure this can be expressed in a number. - Clearer view vs copy semantics: this is partly an issue of making pandas easier to understand (both as developer and user), which again seems hard to quantify. And partly an issue of performance / memory usage. This is something that could potentially be measured (eg the memory usage of some typical workflows). But this probably also something that might only show effect after a refactor / implementation of new semantics. - Potential performance degradation: here you can measure things, and I actually did that for some cases, see https://gist.github.com/jorisvandenbossche/b8ae071ab7823f7547567b1ab9d4c20c <https://gist.github.com/jorisvandenbossche/b8ae071ab7823f7547567b1ab9d4c20c> (the notebook that I posted in #10556 <https://github.com/pandas-dev/pandas/issues/10556> a few days ago).
However: 1) a lot depends on what kind of dataframe you take for your benchmarks (number of rows vs number of columns), 2) there are of course a lot of potential operations to test, 3) there will be a set of operations that will always be slower with a columnar dataframe, whatever the optimization, and 4) we would be testing with current pandas, which is often not yet optimized for column-wise operations.
I would be fine with choosing a set of example datasets with example operations, on which we can have some comparisons. My notebook linked above is already something like that (in a limited form), I think. From this set of timings, I personally don't see any insurmountable performance degradations.
But I also deliberately choose a dataframe where n_rows >> n_columns, because I personally would be fine if operations on wide dataframes (n_rows < n_columns) show a slowdown. But that is of course something to discuss / agree upon (eg up to which dataframe size or n_columns/n_rows ratio do we care about a performance degradation?).
Joris
_______________________________________________ Pandas-dev mailing list Pandas-dev@python.org https://mail.python.org/mailman/listinfo/pandas-dev
Hi Maarten, Thanks a lot for the feedback! On Fri, 29 May 2020 at 20:31, Maarten Ballintijn <maartenb@xs4all.nl> wrote:
Hi Joris,
You said:
But I also deliberately choose a dataframe where n_rows >> n_columns, because I personally would be fine if operations on wide dataframes (n_rows < n_columns) show a slowdown. But that is of course something to discuss / agree upon (eg up to which dataframe size or n_columns/n_rows ratio do we care about a performance degradation?).
This is an (the) important use case for us and probably for a lot of use in finance in general. I can easily imagine many other areas where storing data for 1000’s of elements (sensors, items, people) on grid of time scales of minutes or more. (n*1000 x m*1000 data with n, m ~ 10 .. 100)
Why do you think this use case is no longer important?
To be clear up front: I think wide dataframes are still an important use case. But to put my comment from above in more context: we had a performance regression reported (#24990 <https://github.com/pandas-dev/pandas/issues/24990>, which Brock referenced in his last mail) which was about a DataFrame with 1 row and 5000 columns. And yes, for *such* a case, I think it will basically be impossible to preserve exact performance, even with a lot of optimizations, compared to storing this as a single, consolidated (1, 5000) array as is done now. And it is for such a case, that I indeed say: I am willing to accept a limited slowdown for this, *if* it at the same time gives us improved memory usage, performance improvements for more common cases, simplified internals making it easier to contribute to and further optimize pandas, etc. But, I am also quite convinced that, with some optimization effort, we can at least preserve the current performance even for relatively wide dataframes (see eg this <https://gist.github.com/jorisvandenbossche/25f240a221583002720b2edf0886d609> notebook <https://gist.github.com/jorisvandenbossche/25f240a221583002720b2edf0886d609> for some quick experiments). And to be clear: doing such optimizations to ensure good performance for a variety of use cases is part of the proposal. Also, I think that having a simplified pandas internals should actually also make it easier to further explore ways to specifically optimize the "homogeneous-dtype wide dataframe" use case. Now, it is always difficult to make such claims in the abstract. So what I personally think would be very valuable, is if you could give some example use cases that you care about (eg a notebook creating some dummy data with similar characteristics as the data you are working with (or using real data, if openly available, and a few typical operations you do on those). Best, Joris
We already have to drop into numpy on occasion to make the performance sufficient. I would really prefer for Pandas to improve in this area not slide back.
Have a great weekend, Maarten
Although 1 x 5000 may sound an edge case, my whole 4 years of research was on 500 x 450000 data. Those usecases are probably more common than we may think. On Sat., May 30, 2020, 21:03 Joris Van den Bossche, < jorisvandenbossche@gmail.com> wrote:
Hi Maarten,
Thanks a lot for the feedback!
On Fri, 29 May 2020 at 20:31, Maarten Ballintijn <maartenb@xs4all.nl> wrote:
Hi Joris,
You said:
But I also deliberately choose a dataframe where n_rows >> n_columns, because I personally would be fine if operations on wide dataframes (n_rows < n_columns) show a slowdown. But that is of course something to discuss / agree upon (eg up to which dataframe size or n_columns/n_rows ratio do we care about a performance degradation?).
This is an (the) important use case for us and probably for a lot of use in finance in general. I can easily imagine many other areas where storing data for 1000’s of elements (sensors, items, people) on grid of time scales of minutes or more. (n*1000 x m*1000 data with n, m ~ 10 .. 100)
Why do you think this use case is no longer important?
To be clear up front: I think wide dataframes are still an important use case.
But to put my comment from above in more context: we had a performance regression reported (#24990 <https://github.com/pandas-dev/pandas/issues/24990>, which Brock referenced in his last mail) which was about a DataFrame with 1 row and 5000 columns. And yes, for *such* a case, I think it will basically be impossible to preserve exact performance, even with a lot of optimizations, compared to storing this as a single, consolidated (1, 5000) array as is done now. And it is for such a case, that I indeed say: I am willing to accept a limited slowdown for this, *if* it at the same time gives us improved memory usage, performance improvements for more common cases, simplified internals making it easier to contribute to and further optimize pandas, etc.
But, I am also quite convinced that, with some optimization effort, we can at least preserve the current performance even for relatively wide dataframes (see eg this <https://gist.github.com/jorisvandenbossche/25f240a221583002720b2edf0886d609> notebook <https://gist.github.com/jorisvandenbossche/25f240a221583002720b2edf0886d609> for some quick experiments). And to be clear: doing such optimizations to ensure good performance for a variety of use cases is part of the proposal. Also, I think that having a simplified pandas internals should actually also make it easier to further explore ways to specifically optimize the "homogeneous-dtype wide dataframe" use case.
Now, it is always difficult to make such claims in the abstract. So what I personally think would be very valuable, is if you could give some example use cases that you care about (eg a notebook creating some dummy data with similar characteristics as the data you are working with (or using real data, if openly available, and a few typical operations you do on those).
Best, Joris
We already have to drop into numpy on occasion to make the performance sufficient. I would really prefer for Pandas to improve in this area not slide back.
Have a great weekend, Maarten
_______________________________________________
Pandas-dev mailing list Pandas-dev@python.org https://mail.python.org/mailman/listinfo/pandas-dev
On Sat, 30 May 2020 at 23:55, Adrin <adrin.jalali@gmail.com> wrote:
Although 1 x 5000 may sound an edge case, my whole 4 years of research was on 500 x 450000 data. Those usecases are probably more common than we may think.
It's still a lower column/rows ratio as 1x5000 ;) (although not that much) (it is this ratio that mostly determines whether the overhead of performing column by column starts to dominate) But joking aside: yes, that those use cases are more common than I think is quite probable. I never have really used that myself, and therefore again: such feedback is very useful! Also in our user survey from last year, a majority indicated that they occasionally use wide dataframes (although "wide" was described as "100s of columns or more", which is not necessarily that wide). Now, to reiterate: - You will still be able to use pandas with wide dataframes, you only might "pay a price" for using a flexible data structure like a dataframe (that allows heterogenous dtypes, allows inserting columns cheaply, ..) for a use case that might not need that flexibility. And again, with some optimization effort, I think we can keep this "cost" at a minimum. - It might actually be that a different data model fits your use case better, such as xarray (Adrin, since you are a bit familiar with xarray, would you in hindsight rather have used that for your research?) - I think that by simplifying the pandas internals, it would actually *become easier* to better support the wide dataframe use case as well. Jeff mentioned it before as the "DataMatrix", also Stephan mentioned it on twitter. If we can simplify the internals, it would become more realistic to have a DataFrame-version that is for example backed by a single ndarray but supports the familiar DataFrame-API (or at least a subset of it without converting to a columnar DataFrame). On twitter I said "pandas doesn't need to be the best solution for a variety of use cases". But I should probably have said: "pandas *cannot* be the best solution for different use case *at the same time*". Supporting wide dataframes optimally right now comes at the cost of not supporting heterogeneous dataframes as good as we could. But again, if there appears to be enough interest and there are people who want to contribute to this effort, I think we should investigate how we can actually support both cases (my last point in the above list). Joris
On Sat., May 30, 2020, 21:03 Joris Van den Bossche, < jorisvandenbossche@gmail.com> wrote:
Hi Maarten,
Thanks a lot for the feedback!
On Fri, 29 May 2020 at 20:31, Maarten Ballintijn <maartenb@xs4all.nl> wrote:
Hi Joris,
You said:
But I also deliberately choose a dataframe where n_rows >> n_columns, because I personally would be fine if operations on wide dataframes (n_rows < n_columns) show a slowdown. But that is of course something to discuss / agree upon (eg up to which dataframe size or n_columns/n_rows ratio do we care about a performance degradation?).
This is an (the) important use case for us and probably for a lot of use in finance in general. I can easily imagine many other areas where storing data for 1000’s of elements (sensors, items, people) on grid of time scales of minutes or more. (n*1000 x m*1000 data with n, m ~ 10 .. 100)
Why do you think this use case is no longer important?
To be clear up front: I think wide dataframes are still an important use case.
But to put my comment from above in more context: we had a performance regression reported (#24990 <https://github.com/pandas-dev/pandas/issues/24990>, which Brock referenced in his last mail) which was about a DataFrame with 1 row and 5000 columns. And yes, for *such* a case, I think it will basically be impossible to preserve exact performance, even with a lot of optimizations, compared to storing this as a single, consolidated (1, 5000) array as is done now. And it is for such a case, that I indeed say: I am willing to accept a limited slowdown for this, *if* it at the same time gives us improved memory usage, performance improvements for more common cases, simplified internals making it easier to contribute to and further optimize pandas, etc.
But, I am also quite convinced that, with some optimization effort, we can at least preserve the current performance even for relatively wide dataframes (see eg this <https://gist.github.com/jorisvandenbossche/25f240a221583002720b2edf0886d609> notebook <https://gist.github.com/jorisvandenbossche/25f240a221583002720b2edf0886d609> for some quick experiments). And to be clear: doing such optimizations to ensure good performance for a variety of use cases is part of the proposal. Also, I think that having a simplified pandas internals should actually also make it easier to further explore ways to specifically optimize the "homogeneous-dtype wide dataframe" use case.
Now, it is always difficult to make such claims in the abstract. So what I personally think would be very valuable, is if you could give some example use cases that you care about (eg a notebook creating some dummy data with similar characteristics as the data you are working with (or using real data, if openly available, and a few typical operations you do on those).
Best, Joris
We already have to drop into numpy on occasion to make the performance sufficient. I would really prefer for Pandas to improve in this area not slide back.
Have a great weekend, Maarten
_______________________________________________
Pandas-dev mailing list Pandas-dev@python.org https://mail.python.org/mailman/listinfo/pandas-dev
Joris and I accidentally took part of the discussion off-thread. My suggestion boils down to: Let's 1) Identify pieces of this that we want to do regardless of whether we do the rest of it (e.g. consolidate only in internals, view-only indexing on columns). 2) Beef up the asvs to be give closer-to-full measure of the tradeoffs when an eventual proof of concept/PR is made. On Mon, Jun 1, 2020 at 2:44 AM Joris Van den Bossche < jorisvandenbossche@gmail.com> wrote:
On Sat, 30 May 2020 at 23:55, Adrin <adrin.jalali@gmail.com> wrote:
Although 1 x 5000 may sound an edge case, my whole 4 years of research was on 500 x 450000 data. Those usecases are probably more common than we may think.
It's still a lower column/rows ratio as 1x5000 ;) (although not that much) (it is this ratio that mostly determines whether the overhead of performing column by column starts to dominate)
But joking aside: yes, that those use cases are more common than I think is quite probable. I never have really used that myself, and therefore again: such feedback is very useful! Also in our user survey from last year, a majority indicated that they occasionally use wide dataframes (although "wide" was described as "100s of columns or more", which is not necessarily that wide).
Now, to reiterate:
- You will still be able to use pandas with wide dataframes, you only might "pay a price" for using a flexible data structure like a dataframe (that allows heterogenous dtypes, allows inserting columns cheaply, ..) for a use case that might not need that flexibility. And again, with some optimization effort, I think we can keep this "cost" at a minimum. - It might actually be that a different data model fits your use case better, such as xarray (Adrin, since you are a bit familiar with xarray, would you in hindsight rather have used that for your research?) - I think that by simplifying the pandas internals, it would actually *become easier* to better support the wide dataframe use case as well. Jeff mentioned it before as the "DataMatrix", also Stephan mentioned it on twitter. If we can simplify the internals, it would become more realistic to have a DataFrame-version that is for example backed by a single ndarray but supports the familiar DataFrame-API (or at least a subset of it without converting to a columnar DataFrame).
On twitter I said "pandas doesn't need to be the best solution for a variety of use cases". But I should probably have said: "pandas *cannot* be the best solution for different use case *at the same time*". Supporting wide dataframes optimally right now comes at the cost of not supporting heterogeneous dataframes as good as we could. But again, if there appears to be enough interest and there are people who want to contribute to this effort, I think we should investigate how we can actually support both cases (my last point in the above list).
Joris
On Sat., May 30, 2020, 21:03 Joris Van den Bossche, < jorisvandenbossche@gmail.com> wrote:
Hi Maarten,
Thanks a lot for the feedback!
On Fri, 29 May 2020 at 20:31, Maarten Ballintijn <maartenb@xs4all.nl> wrote:
Hi Joris,
You said:
But I also deliberately choose a dataframe where n_rows >> n_columns, because I personally would be fine if operations on wide dataframes (n_rows < n_columns) show a slowdown. But that is of course something to discuss / agree upon (eg up to which dataframe size or n_columns/n_rows ratio do we care about a performance degradation?).
This is an (the) important use case for us and probably for a lot of use in finance in general. I can easily imagine many other areas where storing data for 1000’s of elements (sensors, items, people) on grid of time scales of minutes or more. (n*1000 x m*1000 data with n, m ~ 10 .. 100)
Why do you think this use case is no longer important?
To be clear up front: I think wide dataframes are still an important use case.
But to put my comment from above in more context: we had a performance regression reported (#24990 <https://github.com/pandas-dev/pandas/issues/24990>, which Brock referenced in his last mail) which was about a DataFrame with 1 row and 5000 columns. And yes, for *such* a case, I think it will basically be impossible to preserve exact performance, even with a lot of optimizations, compared to storing this as a single, consolidated (1, 5000) array as is done now. And it is for such a case, that I indeed say: I am willing to accept a limited slowdown for this, *if* it at the same time gives us improved memory usage, performance improvements for more common cases, simplified internals making it easier to contribute to and further optimize pandas, etc.
But, I am also quite convinced that, with some optimization effort, we can at least preserve the current performance even for relatively wide dataframes (see eg this <https://gist.github.com/jorisvandenbossche/25f240a221583002720b2edf0886d609> notebook <https://gist.github.com/jorisvandenbossche/25f240a221583002720b2edf0886d609> for some quick experiments). And to be clear: doing such optimizations to ensure good performance for a variety of use cases is part of the proposal. Also, I think that having a simplified pandas internals should actually also make it easier to further explore ways to specifically optimize the "homogeneous-dtype wide dataframe" use case.
Now, it is always difficult to make such claims in the abstract. So what I personally think would be very valuable, is if you could give some example use cases that you care about (eg a notebook creating some dummy data with similar characteristics as the data you are working with (or using real data, if openly available, and a few typical operations you do on those).
Best, Joris
We already have to drop into numpy on occasion to make the performance sufficient. I would really prefer for Pandas to improve in this area not slide back.
Have a great weekend, Maarten
_______________________________________________
Pandas-dev mailing list Pandas-dev@python.org https://mail.python.org/mailman/listinfo/pandas-dev
_______________________________________________
Pandas-dev mailing list Pandas-dev@python.org https://mail.python.org/mailman/listinfo/pandas-dev
+1 in brock’s suggestions here currenty -1 on moving to add a lazy block manager i see this as simply increasing complexity
On Jun 1, 2020, at 2:07 PM, Brock Mendel <jbrockmendel@gmail.com> wrote:
Joris and I accidentally took part of the discussion off-thread. My suggestion boils down to: Let's
1) Identify pieces of this that we want to do regardless of whether we do the rest of it (e.g. consolidate only in internals, view-only indexing on columns). 2) Beef up the asvs to be give closer-to-full measure of the tradeoffs when an eventual proof of concept/PR is made.
On Mon, Jun 1, 2020 at 2:44 AM Joris Van den Bossche <jorisvandenbossche@gmail.com> wrote:
On Sat, 30 May 2020 at 23:55, Adrin <adrin.jalali@gmail.com> wrote: Although 1 x 5000 may sound an edge case, my whole 4 years of research was on 500 x 450000 data. Those usecases are probably more common than we may think.
It's still a lower column/rows ratio as 1x5000 ;) (although not that much) (it is this ratio that mostly determines whether the overhead of performing column by column starts to dominate)
But joking aside: yes, that those use cases are more common than I think is quite probable. I never have really used that myself, and therefore again: such feedback is very useful! Also in our user survey from last year, a majority indicated that they occasionally use wide dataframes (although "wide" was described as "100s of columns or more", which is not necessarily that wide).
Now, to reiterate:
- You will still be able to use pandas with wide dataframes, you only might "pay a price" for using a flexible data structure like a dataframe (that allows heterogenous dtypes, allows inserting columns cheaply, ..) for a use case that might not need that flexibility. And again, with some optimization effort, I think we can keep this "cost" at a minimum. - It might actually be that a different data model fits your use case better, such as xarray (Adrin, since you are a bit familiar with xarray, would you in hindsight rather have used that for your research?) - I think that by simplifying the pandas internals, it would actually become easier to better support the wide dataframe use case as well. Jeff mentioned it before as the "DataMatrix", also Stephan mentioned it on twitter. If we can simplify the internals, it would become more realistic to have a DataFrame-version that is for example backed by a single ndarray but supports the familiar DataFrame-API (or at least a subset of it without converting to a columnar DataFrame).
On twitter I said "pandas doesn't need to be the best solution for a variety of use cases". But I should probably have said: "pandas cannot be the best solution for different use case at the same time". Supporting wide dataframes optimally right now comes at the cost of not supporting heterogeneous dataframes as good as we could. But again, if there appears to be enough interest and there are people who want to contribute to this effort, I think we should investigate how we can actually support both cases (my last point in the above list).
Joris
On Sat., May 30, 2020, 21:03 Joris Van den Bossche, <jorisvandenbossche@gmail.com> wrote: Hi Maarten,
Thanks a lot for the feedback!
On Fri, 29 May 2020 at 20:31, Maarten Ballintijn <maartenb@xs4all.nl> wrote:
Hi Joris,
You said:
But I also deliberately choose a dataframe where n_rows >> n_columns, because I personally would be fine if operations on wide dataframes (n_rows < n_columns) show a slowdown. But that is of course something to discuss / agree upon (eg up to which dataframe size or n_columns/n_rows ratio do we care about a performance degradation?).
This is an (the) important use case for us and probably for a lot of use in finance in general. I can easily imagine many other areas where storing data for 1000’s of elements (sensors, items, people) on grid of time scales of minutes or more. (n*1000 x m*1000 data with n, m ~ 10 .. 100)
Why do you think this use case is no longer important?
To be clear up front: I think wide dataframes are still an important use case.
But to put my comment from above in more context: we had a performance regression reported (#24990, which Brock referenced in his last mail) which was about a DataFrame with 1 row and 5000 columns. And yes, for such a case, I think it will basically be impossible to preserve exact performance, even with a lot of optimizations, compared to storing this as a single, consolidated (1, 5000) array as is done now. And it is for such a case, that I indeed say: I am willing to accept a limited slowdown for this, if it at the same time gives us improved memory usage, performance improvements for more common cases, simplified internals making it easier to contribute to and further optimize pandas, etc.
But, I am also quite convinced that, with some optimization effort, we can at least preserve the current performance even for relatively wide dataframes (see eg this notebook for some quick experiments). And to be clear: doing such optimizations to ensure good performance for a variety of use cases is part of the proposal. Also, I think that having a simplified pandas internals should actually also make it easier to further explore ways to specifically optimize the "homogeneous-dtype wide dataframe" use case.
Now, it is always difficult to make such claims in the abstract. So what I personally think would be very valuable, is if you could give some example use cases that you care about (eg a notebook creating some dummy data with similar characteristics as the data you are working with (or using real data, if openly available, and a few typical operations you do on those).
Best, Joris
We already have to drop into numpy on occasion to make the performance sufficient. I would really prefer for Pandas to improve in this area not slide back.
Have a great weekend, Maarten
_______________________________________________ Pandas-dev mailing list Pandas-dev@python.org https://mail.python.org/mailman/listinfo/pandas-dev
_______________________________________________ Pandas-dev mailing list Pandas-dev@python.org https://mail.python.org/mailman/listinfo/pandas-dev
Pandas-dev mailing list Pandas-dev@python.org https://mail.python.org/mailman/listinfo/pandas-dev
On Mon, 1 Jun 2020 at 20:07, Brock Mendel <jbrockmendel@gmail.com> wrote:
Joris and I accidentally took part of the discussion off-thread. My suggestion boils down to: Let's
1) Identify pieces of this that we want to do regardless of whether we do the rest of it (e.g. consolidate only in internals, view-only indexing on columns).
Personally I am not sure it is worth trying to change consolidation policies (moving to internals is certainly fine of course, but I mean eg delaying) or copy/view semantics for the *current*, consolidated BlockManager. But there are certainly pieces in the internals that can be changed which are useful regardless. I opened https://github.com/pandas-dev/pandas/issues/34669 to have a more concrete discussion about this on github.
2) Beef up the asvs to be give closer-to-full measure of the tradeoffs when an eventual proof of concept/PR is made.
We probably won't have a "one big PR" that is going to implement a simplified block manager, so it's not really clear to me how ASV will help with making a decision on this? (it will for sure be very useful *along the way* to keep track of where we need to optimize things to preserve performance) Joris
We discussed this on the call yesterday ( https://docs.google.com/document/d/1tGbTiYORHiSPgVMXawiweGJlBw5dOkVJLY-licoB... ). I'll attempt a summary for the mailing list, and a proposed course of action. In general, there was agreement with the goal of simplifying pandas' internals, and making DataFrame a column-store seems to be the best way to achieve that. The primary arguments against were implementation costs and possible performance slowdowns for very short and wide dataframes. It was generally agreed that the change will need to be toggleable, perhaps by a parameter to the DataFrame constructor and a global option. This will make it easier to implement the new behavior and test it against existing behavior, both for us developers and users. We are keeping in mind the scikit-learn style usecase of boxing and unboxing a (homogenous) array in a DataFrame. We're committed to keeping that 0-copy and avoiding creating one Python object per column. Does this summary accurately capture the discussion? --- Going forward, there are many pieces that can be done, some in parallel. Let's keep that discussion on concrete details in https://github.com/pandas-dev/pandas/issues/34669. I do want to highlight one overlapping area though. We have some PRs up (most from Brock) that affect consolidation today. Mostly disabling consolidation in specific places. (e.g. https://github.com/pandas-dev/pandas/pull/34683). My question: do we want to continue pursuing reduced consolidation *in the current block manager*? IMO, that's a tricky question to answer. The performance implications of consolidation are hard, in part because it's so workload-dependent. Sometimes, it's completely avoided so it's a win. Other times, it's merely delayed until an operation that needs consolidated blocks, and so is a wash. And given 1. The unclear impact changing consolidation has on views vs. copies, and our unclear *policy* on when things are views vs. copies 2. The real possibility of a non-consolidating, all-1D "Block" manager in the next year or two 3. The unclear extent to which non-consolidated data is tested by our unit tests. Certainly, fixing bugs is a worthy goal on its own. So to the extent where (non)consolidation causes buggy behavior we'll want to fix that. But overall, I think the project's efforts would be better focused elsewhere (ideally on progressing to the all 1-D block manager, but wherever we think is highest-value). Do others have thoughts on what changes should be made to the "pandas 1.x BlockManager" while we work towards the "2.x BlockManager"? - Tom On Tue, Jun 9, 2020 at 10:46 AM Joris Van den Bossche < jorisvandenbossche@gmail.com> wrote:
On Mon, 1 Jun 2020 at 20:07, Brock Mendel <jbrockmendel@gmail.com> wrote:
Joris and I accidentally took part of the discussion off-thread. My suggestion boils down to: Let's
1) Identify pieces of this that we want to do regardless of whether we do the rest of it (e.g. consolidate only in internals, view-only indexing on columns).
Personally I am not sure it is worth trying to change consolidation policies (moving to internals is certainly fine of course, but I mean eg delaying) or copy/view semantics for the *current*, consolidated BlockManager.
But there are certainly pieces in the internals that can be changed which are useful regardless. I opened https://github.com/pandas-dev/pandas/issues/34669 to have a more concrete discussion about this on github.
2) Beef up the asvs to be give closer-to-full measure of the tradeoffs when an eventual proof of concept/PR is made.
We probably won't have a "one big PR" that is going to implement a simplified block manager, so it's not really clear to me how ASV will help with making a decision on this? (it will for sure be very useful *along the way* to keep track of where we need to optimize things to preserve performance)
Joris
_______________________________________________ Pandas-dev mailing list Pandas-dev@python.org https://mail.python.org/mailman/listinfo/pandas-dev
Does this summary accurately capture the discussion?
Not quite.
there was agreement with the goal of simplifying pandas' internals,
and making DataFrame a column-store seems to be the best way to achieve
Yes. that. No. We will not know this until we see an implementation. Nor will we know the performance impact. My expectation is that the performance impact will lead to a bunch of workarounds that cut against the simplification. I strongly object to committing to this before having this information. --- I have tried to avoid bringing up 2D EAs in this conversation, but the term "best way" requires a discussion of alternatives. Allowing 2D EAs will allow for a large fraction of the same simplifications (grep for "TODO(EA2D)"), and will _improve_ performance (in eg reshape, arithmetic operations) instead of hurting it. It means removing workarounds rather than adding new ones. It also allows for an incremental upgrade path: opt-in for 1.X, then if we like it, required for 2.X. ----
Going forward, there are many pieces that can be done, some in parallel
Related to but not identical to consolidation is the views vs copies on column indexing, GH#33780 <https://github.com/pandas-dev/pandas/issues/33780>, discussed on the previous call without a solid conclusion. The FUD largely boiled down to "some users could be relying on the current behavior and there isnt a nice way to deprecate it". On further reflection, this seems like an impossible standard to meet for _any_ change in not-tested/not-documented behavior. We should move to having column indexing being copy-free. On Thu, Jun 11, 2020 at 7:56 AM Tom Augspurger <tom.augspurger88@gmail.com> wrote:
We discussed this on the call yesterday ( https://docs.google.com/document/d/1tGbTiYORHiSPgVMXawiweGJlBw5dOkVJLY-licoB... ). I'll attempt a summary for the mailing list, and a proposed course of action.
In general, there was agreement with the goal of simplifying pandas' internals, and making DataFrame a column-store seems to be the best way to achieve that. The primary arguments against were implementation costs and possible performance slowdowns for very short and wide dataframes.
It was generally agreed that the change will need to be toggleable, perhaps by a parameter to the DataFrame constructor and a global option. This will make it easier to implement the new behavior and test it against existing behavior, both for us developers and users.
We are keeping in mind the scikit-learn style usecase of boxing and unboxing a (homogenous) array in a DataFrame. We're committed to keeping that 0-copy and avoiding creating one Python object per column.
Does this summary accurately capture the discussion?
---
Going forward, there are many pieces that can be done, some in parallel. Let's keep that discussion on concrete details in https://github.com/pandas-dev/pandas/issues/34669.
I do want to highlight one overlapping area though. We have some PRs up (most from Brock) that affect consolidation today. Mostly disabling consolidation in specific places. (e.g. https://github.com/pandas-dev/pandas/pull/34683). My question: do we want to continue pursuing reduced consolidation *in the current block manager*?
IMO, that's a tricky question to answer. The performance implications of consolidation are hard, in part because it's so workload-dependent. Sometimes, it's completely avoided so it's a win. Other times, it's merely delayed until an operation that needs consolidated blocks, and so is a wash. And given
1. The unclear impact changing consolidation has on views vs. copies, and our unclear *policy* on when things are views vs. copies 2. The real possibility of a non-consolidating, all-1D "Block" manager in the next year or two 3. The unclear extent to which non-consolidated data is tested by our unit tests.
Certainly, fixing bugs is a worthy goal on its own. So to the extent where (non)consolidation causes buggy behavior we'll want to fix that. But overall, I think the project's efforts would be better focused elsewhere (ideally on progressing to the all 1-D block manager, but wherever we think is highest-value).
Do others have thoughts on what changes should be made to the "pandas 1.x BlockManager" while we work towards the "2.x BlockManager"?
- Tom
On Tue, Jun 9, 2020 at 10:46 AM Joris Van den Bossche < jorisvandenbossche@gmail.com> wrote:
On Mon, 1 Jun 2020 at 20:07, Brock Mendel <jbrockmendel@gmail.com> wrote:
Joris and I accidentally took part of the discussion off-thread. My suggestion boils down to: Let's
1) Identify pieces of this that we want to do regardless of whether we do the rest of it (e.g. consolidate only in internals, view-only indexing on columns).
Personally I am not sure it is worth trying to change consolidation policies (moving to internals is certainly fine of course, but I mean eg delaying) or copy/view semantics for the *current*, consolidated BlockManager.
But there are certainly pieces in the internals that can be changed which are useful regardless. I opened https://github.com/pandas-dev/pandas/issues/34669 to have a more concrete discussion about this on github.
2) Beef up the asvs to be give closer-to-full measure of the tradeoffs when an eventual proof of concept/PR is made.
We probably won't have a "one big PR" that is going to implement a simplified block manager, so it's not really clear to me how ASV will help with making a decision on this? (it will for sure be very useful *along the way* to keep track of where we need to optimize things to preserve performance)
Joris
_______________________________________________ Pandas-dev mailing list Pandas-dev@python.org https://mail.python.org/mailman/listinfo/pandas-dev
_______________________________________________ Pandas-dev mailing list Pandas-dev@python.org https://mail.python.org/mailman/listinfo/pandas-dev
On Thu, Jun 11, 2020 at 10:51 AM Brock Mendel <jbrockmendel@gmail.com> wrote:
Does this summary accurately capture the discussion?
Not quite.
there was agreement with the goal of simplifying pandas' internals,
Yes.
and making DataFrame a column-store seems to be the best way to achieve that.
No.
We will not know this until we see an implementation. Nor will we know the performance impact. My expectation is that the performance impact will lead to a bunch of workarounds that cut against the simplification.
I strongly object to committing to this before having this information.
It'd be good to clarify exactly what you object to committing to. Changing the Block Manager is a large task, made especially difficult by us being an open-source project with many stake-holders and limited funding. I think that we as a project can say "We as a project think that making DataFrame a column store is best", while still acknowledging that it's an uncertain goal that may be abandoned if it turns out to be a bad idea. So to make sure: You're objecting to a column-store in principle, or you're objecting to the project saying we think it's a good idea, or...?
--- I have tried to avoid bringing up 2D EAs in this conversation, but the term "best way" requires a discussion of alternatives.
Allowing 2D EAs will allow for a large fraction of the same simplifications (grep for "TODO(EA2D)"), and will _improve_ performance (in eg reshape, arithmetic operations) instead of hurting it. It means removing workarounds rather than adding new ones.
It also allows for an incremental upgrade path: opt-in for 1.X, then if we like it, required for 2.X.
Will have thoughts on this later.
----
Going forward, there are many pieces that can be done, some in parallel
Related to but not identical to consolidation is the views vs copies on column indexing, GH#33780 <https://github.com/pandas-dev/pandas/issues/33780>, discussed on the previous call without a solid conclusion. The FUD largely boiled down to "some users could be relying on the current behavior and there isnt a nice way to deprecate it". On further reflection, this seems like an impossible standard to meet for _any_ change in not-tested/not-documented behavior. We should move to having column indexing being copy-free.
I think I disagree with that, at least to a degree. But it's primarily about views vs. copies so I'll take it to https://github.com/pandas-dev/pandas/issues/33780.
On Thu, Jun 11, 2020 at 7:56 AM Tom Augspurger <tom.augspurger88@gmail.com> wrote:
We discussed this on the call yesterday ( https://docs.google.com/document/d/1tGbTiYORHiSPgVMXawiweGJlBw5dOkVJLY-licoB... ). I'll attempt a summary for the mailing list, and a proposed course of action.
In general, there was agreement with the goal of simplifying pandas' internals, and making DataFrame a column-store seems to be the best way to achieve that. The primary arguments against were implementation costs and possible performance slowdowns for very short and wide dataframes.
It was generally agreed that the change will need to be toggleable, perhaps by a parameter to the DataFrame constructor and a global option. This will make it easier to implement the new behavior and test it against existing behavior, both for us developers and users.
We are keeping in mind the scikit-learn style usecase of boxing and unboxing a (homogenous) array in a DataFrame. We're committed to keeping that 0-copy and avoiding creating one Python object per column.
Does this summary accurately capture the discussion?
---
Going forward, there are many pieces that can be done, some in parallel. Let's keep that discussion on concrete details in https://github.com/pandas-dev/pandas/issues/34669.
I do want to highlight one overlapping area though. We have some PRs up (most from Brock) that affect consolidation today. Mostly disabling consolidation in specific places. (e.g. https://github.com/pandas-dev/pandas/pull/34683). My question: do we want to continue pursuing reduced consolidation *in the current block manager*?
IMO, that's a tricky question to answer. The performance implications of consolidation are hard, in part because it's so workload-dependent. Sometimes, it's completely avoided so it's a win. Other times, it's merely delayed until an operation that needs consolidated blocks, and so is a wash. And given
1. The unclear impact changing consolidation has on views vs. copies, and our unclear *policy* on when things are views vs. copies 2. The real possibility of a non-consolidating, all-1D "Block" manager in the next year or two 3. The unclear extent to which non-consolidated data is tested by our unit tests.
Certainly, fixing bugs is a worthy goal on its own. So to the extent where (non)consolidation causes buggy behavior we'll want to fix that. But overall, I think the project's efforts would be better focused elsewhere (ideally on progressing to the all 1-D block manager, but wherever we think is highest-value).
Do others have thoughts on what changes should be made to the "pandas 1.x BlockManager" while we work towards the "2.x BlockManager"?
- Tom
On Tue, Jun 9, 2020 at 10:46 AM Joris Van den Bossche < jorisvandenbossche@gmail.com> wrote:
On Mon, 1 Jun 2020 at 20:07, Brock Mendel <jbrockmendel@gmail.com> wrote:
Joris and I accidentally took part of the discussion off-thread. My suggestion boils down to: Let's
1) Identify pieces of this that we want to do regardless of whether we do the rest of it (e.g. consolidate only in internals, view-only indexing on columns).
Personally I am not sure it is worth trying to change consolidation policies (moving to internals is certainly fine of course, but I mean eg delaying) or copy/view semantics for the *current*, consolidated BlockManager.
But there are certainly pieces in the internals that can be changed which are useful regardless. I opened https://github.com/pandas-dev/pandas/issues/34669 to have a more concrete discussion about this on github.
2) Beef up the asvs to be give closer-to-full measure of the tradeoffs when an eventual proof of concept/PR is made.
We probably won't have a "one big PR" that is going to implement a simplified block manager, so it's not really clear to me how ASV will help with making a decision on this? (it will for sure be very useful *along the way* to keep track of where we need to optimize things to preserve performance)
Joris
_______________________________________________ Pandas-dev mailing list Pandas-dev@python.org https://mail.python.org/mailman/listinfo/pandas-dev
_______________________________________________ Pandas-dev mailing list Pandas-dev@python.org https://mail.python.org/mailman/listinfo/pandas-dev
So to make sure: You're objecting to a column-store in principle, or you're objecting to the project saying we think it's a good idea, or...?
Not at all. I look forward to seeing an implementation so that we can actually make an informed decision as to whether or not we want to use it. I object to a) declaring ex-ante that we intend to replace the existing BlockManager with it and b) effectively declaring a moratorium on improvements to the existing code. On Thu, Jun 11, 2020 at 9:01 AM Tom Augspurger <tom.augspurger88@gmail.com> wrote:
On Thu, Jun 11, 2020 at 10:51 AM Brock Mendel <jbrockmendel@gmail.com> wrote:
Does this summary accurately capture the discussion?
Not quite.
there was agreement with the goal of simplifying pandas' internals,
Yes.
and making DataFrame a column-store seems to be the best way to achieve that.
No.
We will not know this until we see an implementation. Nor will we know the performance impact. My expectation is that the performance impact will lead to a bunch of workarounds that cut against the simplification.
I strongly object to committing to this before having this information.
It'd be good to clarify exactly what you object to committing to. Changing the Block Manager is a large task, made especially difficult by us being an open-source project with many stake-holders and limited funding. I think that we as a project can say "We as a project think that making DataFrame a column store is best", while still acknowledging that it's an uncertain goal that may be abandoned if it turns out to be a bad idea.
So to make sure: You're objecting to a column-store in principle, or you're objecting to the project saying we think it's a good idea, or...?
--- I have tried to avoid bringing up 2D EAs in this conversation, but the term "best way" requires a discussion of alternatives.
Allowing 2D EAs will allow for a large fraction of the same simplifications (grep for "TODO(EA2D)"), and will _improve_ performance (in eg reshape, arithmetic operations) instead of hurting it. It means removing workarounds rather than adding new ones.
It also allows for an incremental upgrade path: opt-in for 1.X, then if we like it, required for 2.X.
Will have thoughts on this later.
----
Going forward, there are many pieces that can be done, some in parallel
Related to but not identical to consolidation is the views vs copies on column indexing, GH#33780 <https://github.com/pandas-dev/pandas/issues/33780>, discussed on the previous call without a solid conclusion. The FUD largely boiled down to "some users could be relying on the current behavior and there isnt a nice way to deprecate it". On further reflection, this seems like an impossible standard to meet for _any_ change in not-tested/not-documented behavior. We should move to having column indexing being copy-free.
I think I disagree with that, at least to a degree. But it's primarily about views vs. copies so I'll take it to https://github.com/pandas-dev/pandas/issues/33780.
On Thu, Jun 11, 2020 at 7:56 AM Tom Augspurger < tom.augspurger88@gmail.com> wrote:
We discussed this on the call yesterday ( https://docs.google.com/document/d/1tGbTiYORHiSPgVMXawiweGJlBw5dOkVJLY-licoB... ). I'll attempt a summary for the mailing list, and a proposed course of action.
In general, there was agreement with the goal of simplifying pandas' internals, and making DataFrame a column-store seems to be the best way to achieve that. The primary arguments against were implementation costs and possible performance slowdowns for very short and wide dataframes.
It was generally agreed that the change will need to be toggleable, perhaps by a parameter to the DataFrame constructor and a global option. This will make it easier to implement the new behavior and test it against existing behavior, both for us developers and users.
We are keeping in mind the scikit-learn style usecase of boxing and unboxing a (homogenous) array in a DataFrame. We're committed to keeping that 0-copy and avoiding creating one Python object per column.
Does this summary accurately capture the discussion?
---
Going forward, there are many pieces that can be done, some in parallel. Let's keep that discussion on concrete details in https://github.com/pandas-dev/pandas/issues/34669.
I do want to highlight one overlapping area though. We have some PRs up (most from Brock) that affect consolidation today. Mostly disabling consolidation in specific places. (e.g. https://github.com/pandas-dev/pandas/pull/34683). My question: do we want to continue pursuing reduced consolidation *in the current block manager*?
IMO, that's a tricky question to answer. The performance implications of consolidation are hard, in part because it's so workload-dependent. Sometimes, it's completely avoided so it's a win. Other times, it's merely delayed until an operation that needs consolidated blocks, and so is a wash. And given
1. The unclear impact changing consolidation has on views vs. copies, and our unclear *policy* on when things are views vs. copies 2. The real possibility of a non-consolidating, all-1D "Block" manager in the next year or two 3. The unclear extent to which non-consolidated data is tested by our unit tests.
Certainly, fixing bugs is a worthy goal on its own. So to the extent where (non)consolidation causes buggy behavior we'll want to fix that. But overall, I think the project's efforts would be better focused elsewhere (ideally on progressing to the all 1-D block manager, but wherever we think is highest-value).
Do others have thoughts on what changes should be made to the "pandas 1.x BlockManager" while we work towards the "2.x BlockManager"?
- Tom
On Tue, Jun 9, 2020 at 10:46 AM Joris Van den Bossche < jorisvandenbossche@gmail.com> wrote:
On Mon, 1 Jun 2020 at 20:07, Brock Mendel <jbrockmendel@gmail.com> wrote:
Joris and I accidentally took part of the discussion off-thread. My suggestion boils down to: Let's
1) Identify pieces of this that we want to do regardless of whether we do the rest of it (e.g. consolidate only in internals, view-only indexing on columns).
Personally I am not sure it is worth trying to change consolidation policies (moving to internals is certainly fine of course, but I mean eg delaying) or copy/view semantics for the *current*, consolidated BlockManager.
But there are certainly pieces in the internals that can be changed which are useful regardless. I opened https://github.com/pandas-dev/pandas/issues/34669 to have a more concrete discussion about this on github.
2) Beef up the asvs to be give closer-to-full measure of the tradeoffs when an eventual proof of concept/PR is made.
We probably won't have a "one big PR" that is going to implement a simplified block manager, so it's not really clear to me how ASV will help with making a decision on this? (it will for sure be very useful *along the way* to keep track of where we need to optimize things to preserve performance)
Joris
_______________________________________________ Pandas-dev mailing list Pandas-dev@python.org https://mail.python.org/mailman/listinfo/pandas-dev
_______________________________________________ Pandas-dev mailing list Pandas-dev@python.org https://mail.python.org/mailman/listinfo/pandas-dev
On Thu, 11 Jun 2020 at 18:29, Brock Mendel <jbrockmendel@gmail.com> wrote:
So to make sure: You're objecting to a column-store in principle, or you're objecting to the project saying we think it's a good idea, or...?
Not at all. I look forward to seeing an implementation so that we can actually make an informed decision as to whether or not we want to use it. I object to a) declaring ex-ante that we intend to replace the existing BlockManager with it and b) effectively declaring a moratorium on improvements to the existing code.
wrote:
--- I have tried to avoid bringing up 2D EAs in this conversation, but the term "best way" requires a discussion of alternatives.
Allowing 2D EAs will allow for a large fraction of the same simplifications (grep for "TODO(EA2D)"), and will _improve_ performance (in eg reshape, arithmetic operations) instead of hurting it. It means removing workarounds rather than adding new ones.
It also allows for an incremental upgrade path: opt-in for 1.X, then if we like it, required for 2.X.
In my original mail, I explicitly didn't mention 1D vs 2D extension arrays, but rather 1D vs 2D *blocks*. As for me, that is the core of the proposal. It is this column-store that will give additional simplifications by not having to care about 2D blocks (on top of getting rid of 1D/2D mixture, which could in itself also be solved by all 2D blocks), that will make it
We actually *have* prototypes: the prototype of the split-policy discussed in GH-10556 <https://github.com/pandas-dev/pandas/issues/10556#issuecomment-633703160> and for which I made a notebook benchmarking a few common operations as mentioned in my initial post (notebook <https://gist.github.com/jorisvandenbossche/b8ae071ab7823f7547567b1ab9d4c20c>), and the prototype of using all integer extension arrays (and float with my PR). In the linked notebook, I show that, for the given dataframe, a set of common operations are not slower, or if slower, that there is a clear path towards optimizing this. I welcome a critical evaluation of this notebook. And since those are based on the current BlockManager, any version of a BlockManager specifically tailored to storing the columns separately, will only do a better job. I think that based on those prototypes we already can make an informed decision right now (or with some additional benchmarks based on those prototypes). For sure, if it turns out that we were wrong, we can later again abandon the idea, but I think we can already be *confident* that there is a high probability it will work out. Also, if performance is in the end the decisive criterion, I repeat my earlier remark in this thread: we need to be clearer about what we want / expect. Because with benchmarks you can prove anything you want, depending on what you choose to benchmark. So: what size of dataframe, which set of operations, .. do we care about? On Thu, Jun 11, 2020 at 10:51 AM Brock Mendel <jbrockmendel@gmail.com> possible to get clearer copy/view semantics, that will make it easier to look into other improvements (like copy-on-write to avoid many copies in pandas operations, lazy selection filters, ..). So if we decide in the end to keep the consolidating blockmanager with 2D blocks, we certainly should consider 2D extension arrays, I fully agree on that. And indeed, the consolidating block manager is the alternative to consider. But: - "It means removing workarounds rather than adding new ones." -> whether we go all 1D or all 2D, we will initially need to keep workarounds for the other option in both cases, anyway, that's no different for all 1D or all 2D. But I am convinced that after we can remove the workarounds (eg in 2.0), the end result will be simpler in the all 1D case. - "It also allows for an incremental upgrade path: opt-in for 1.X, then if we like it, required for 2.X." -> we can perfectly provide an opt-in, incremental upgrade path for the all 1D case as well, I don't see why that would be different. Joris
Joris, Thanks very much for your reply. I can’t provide exact data or code, but I’ll try to come up with a sample of simulated data and operations that relatively closely matches our use cases. Cheers, Maarten
On May 30, 2020, at 3:03 PM, Joris Van den Bossche <jorisvandenbossche@gmail.com> wrote:
Hi Maarten,
Thanks a lot for the feedback!
On Fri, 29 May 2020 at 20:31, Maarten Ballintijn <maartenb@xs4all.nl <mailto:maartenb@xs4all.nl>> wrote:
Hi Joris,
You said:
But I also deliberately choose a dataframe where n_rows >> n_columns, because I personally would be fine if operations on wide dataframes (n_rows < n_columns) show a slowdown. But that is of course something to discuss / agree upon (eg up to which dataframe size or n_columns/n_rows ratio do we care about a performance degradation?).
This is an (the) important use case for us and probably for a lot of use in finance in general. I can easily imagine many other areas where storing data for 1000’s of elements (sensors, items, people) on grid of time scales of minutes or more. (n*1000 x m*1000 data with n, m ~ 10 .. 100)
Why do you think this use case is no longer important?
To be clear up front: I think wide dataframes are still an important use case.
But to put my comment from above in more context: we had a performance regression reported (#24990 <https://github.com/pandas-dev/pandas/issues/24990>, which Brock referenced in his last mail) which was about a DataFrame with 1 row and 5000 columns. And yes, for such a case, I think it will basically be impossible to preserve exact performance, even with a lot of optimizations, compared to storing this as a single, consolidated (1, 5000) array as is done now. And it is for such a case, that I indeed say: I am willing to accept a limited slowdown for this, if it at the same time gives us improved memory usage, performance improvements for more common cases, simplified internals making it easier to contribute to and further optimize pandas, etc.
But, I am also quite convinced that, with some optimization effort, we can at least preserve the current performance even for relatively wide dataframes (see eg this <https://gist.github.com/jorisvandenbossche/25f240a221583002720b2edf0886d609> notebook <https://gist.github.com/jorisvandenbossche/25f240a221583002720b2edf0886d609> for some quick experiments). And to be clear: doing such optimizations to ensure good performance for a variety of use cases is part of the proposal. Also, I think that having a simplified pandas internals should actually also make it easier to further explore ways to specifically optimize the "homogeneous-dtype wide dataframe" use case.
Now, it is always difficult to make such claims in the abstract. So what I personally think would be very valuable, is if you could give some example use cases that you care about (eg a notebook creating some dummy data with similar characteristics as the data you are working with (or using real data, if openly available, and a few typical operations you do on those).
Best, Joris
We already have to drop into numpy on occasion to make the performance sufficient. I would really prefer for Pandas to improve in this area not slide back.
Have a great weekend, Maarten
participants (8)
-
Adrin -
Brock Mendel -
Jeff Reback -
Joris Van den Bossche -
Maarten Ballintijn -
Tom Augspurger -
Uwe L. Korn -
Wes McKinney