[Pandas-dev] A case for a simplified (non-consolidating) BlockManager with 1D blocks

Joris Van den Bossche jorisvandenbossche at gmail.com
Sat May 30 15:03:41 EDT 2020


Hi Maarten,

Thanks a lot for the feedback!

On Fri, 29 May 2020 at 20:31, Maarten Ballintijn <maartenb at xs4all.nl> wrote:

>
> Hi Joris,
>
> You said:
>
> But I also deliberately choose a dataframe where n_rows >> n_columns,
> because I personally would be fine if operations on wide dataframes (n_rows
> < n_columns) show a slowdown. But that is of course something to discuss /
> agree upon (eg up to which dataframe size or n_columns/n_rows ratio do we
> care about a performance degradation?).
>
>
> This is an (the) important use case for us and probably for a lot of use
> in finance in general. I can easily imagine many other
> areas where storing data for 1000’s of elements (sensors, items, people)
> on grid of  time scales of minutes or more.
> (n*1000 x m*1000 data with n, m ~ 10 .. 100)
>
> Why do you think this use case is no longer important?
>

To be clear up front: I think wide dataframes are still an important use
case.

But to put my comment from above in more context: we had a performance
regression reported (#24990
<https://github.com/pandas-dev/pandas/issues/24990>, which Brock referenced
in his last mail) which was about a DataFrame with 1 row and 5000 columns.
And yes, for *such* a case, I think it will basically be impossible to
preserve exact performance, even with a lot of optimizations, compared to
storing this as a single, consolidated (1, 5000) array as is done now. And
it is for such a case, that I indeed say: I am willing to accept a limited
slowdown for this, *if* it at the same time gives us improved memory usage,
performance improvements for more common cases, simplified internals making
it easier to contribute to and further optimize pandas, etc.

But, I am also quite convinced that, with some optimization effort, we can
at least preserve the current performance even for relatively wide
dataframes (see eg this
<https://gist.github.com/jorisvandenbossche/25f240a221583002720b2edf0886d609>
notebook
<https://gist.github.com/jorisvandenbossche/25f240a221583002720b2edf0886d609>
for some quick experiments).
And to be clear: doing such optimizations to ensure good performance for a
variety of use cases is part of the proposal. Also, I think that having a
simplified pandas internals should actually also make it easier to further
explore ways to specifically optimize the "homogeneous-dtype wide
dataframe" use case.

Now, it is always difficult to make such claims in the abstract.
So what I personally think would be very valuable, is if you could give
some example use cases that you care about (eg a notebook creating some
dummy data with similar characteristics as the data you are working with
(or using real data, if openly available, and a few typical operations you
do on those).

Best,
Joris


>
> We already have to drop into numpy on occasion to make the performance
> sufficient. I would really prefer for Pandas to
> improve in this area not slide back.
>
> Have a great weekend,
> Maarten
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/pandas-dev/attachments/20200530/49941c0f/attachment.html>


More information about the Pandas-dev mailing list