[Pandas-dev] A case for a simplified (non-consolidating) BlockManager with 1D blocks

Tue May 26 15:48:39 EDT 2020

Something to add here (in favor of removing the BM) -- and apologies
if it's already mentioned in a different form:

It is very, very difficult for third party code to construct
heterogeneously-typed DataFrames without triggering a memory doubling.
To give you an example what I mean, in Apache Arrow, we painstakingly
implemented block consolidation in C++ [1] so that we can construct a
DataFrame that won't suddenly double memory the first time that a user
interacts with it. So the possibility of users having an OOM on their
first interaction with an object they created is not great. If
avoiding it for library developers were easy then perhaps it would be
less of an issue, but avoiding the doubling requires advanced
knowledge of pandas's internals.

Looking back 9-10 years, the primary motivations I had for creating
the BlockManager in the first place don't persuade me anymore:

* pandas's success was still very much coupled to vectorized
operations on wide row-major data (e.g. as present in certain sectors
of the financial industry). I don't think this represents the majority
of pandas users now
* In 2011 I was uncomfortable writing significant compiled code. Many
of the performance issues that the BM tried to ameliorate are
non-issues if you're OK writing non-trivial C/C++ code to deal with
row-level interactions. Even if there were a 50% performance
regression on some of these operations that are faster with 2D blocks
because of row-major vs. column-major memory layout, that still seems
worth it for the vast code simplification and the
memory-use-predictability benefits that others have articulated
already.

- Wes

[1]: https://github.com/apache/arrow/blob/master/cpp/src/arrow/python/arrow_to_pandas.cc

On Tue, May 26, 2020 at 2:35 PM Joris Van den Bossche
<jorisvandenbossche at gmail.com> wrote:
>
> On Tue, 26 May 2020 at 13:21, Tom Augspurger <tom.augspurger88 at gmail.com> wrote:
>>
>>
>> On Tue, May 26, 2020 at 3:35 AM Joris Van den Bossche <jorisvandenbossche at gmail.com> wrote:
>>>
>>> - We could make the DataFrame construction from a 2D array/matrix kind of "lazy" (or have an option to do it like this): upon construction just store the 2D array as is, and only once you perform an actual operation on it, convert to a columnar store. And that would make it possible to still get the 2D array back with zero-copy, if all you did was passing this DataFrame to the next step of the pipeline.
>>>
>>> I think the first option should be fairly easy to do, and should solve a large part of the concerns for scikit-learn (I think?).
>>
>>
>> I think the first option would solve that use case for scikit-learn. It sounds feasible, but I'm not sure how easy it would be.
>>
>
> A quick, ugly proof-of-concept: https://github.com/pandas-dev/pandas/commit/cf387dced4803b81ec8709eeaf624369abca1188
>
> It allows to create a "DataFrame" from an ndarray without creating a BlockManager, and it allows accessing this original ndarray:
>
> In [1]: df = pd.DataFrame._init_lazy(np.random.randn(4, 3), (pd.RangeIndex(4), pd.RangeIndex(3)))
>
> In [2]: df._mgr_data
> Out[2]:
> (array([[ 1.52971972e-01, -5.69204971e-01,  5.54430115e-01],
>         [-1.09916133e+00, -1.16315362e+00, -1.51071081e+00],
>         [ 7.05185110e-01, -1.53009348e-03,  1.54260335e+00],
>         [-4.60590231e-01, -3.85364427e-01,  1.80760103e+00]]),
>  RangeIndex(start=0, stop=4, step=1),
>  RangeIndex(start=0, stop=3, step=1))
>
> And once you do something with the dataframe, such as printing or calculating something, the BlockManager gets only created at this step:
>
> In [3]: df
> Out[3]: Initializing !!!
>
>           0         1         2
> 0  0.152972 -0.569205  0.554430
> 1 -1.099161 -1.163154 -1.510711
> 2  0.705185 -0.001530  1.542603
> 3 -0.460590 -0.385364  1.807601
>
> In [4]: df = pd.DataFrame._init_lazy(np.random.randn(4, 3), (pd.RangeIndex(4), pd.RangeIndex(3)))
>
> In [5]: df.mean()
> Initializing !!!
> Out[5]:
> 0    0.397243
> 1    0.269996
> 2   -0.454929
> dtype: float64
>
> There are of course many things missing (validation of the input to init_lazy, potentially being able to access df.index/df.columns without initializing the block manager, hooking this up in __array__, what with pickling?, ...)
> But just to illustrate the idea.
> _______________________________________________
> Pandas-dev mailing list
> Pandas-dev at python.org
> https://mail.python.org/mailman/listinfo/pandas-dev