<div dir="ltr"><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Tue, 26 May 2020 at 00:46, Brock Mendel <<a href="mailto:jbrockmendel@gmail.com">jbrockmendel@gmail.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr">Thanks for writing this up, Joris.  Assuming we go down this path, do you have an idea of how we get from here to there incrementally?  i.e. presumably this wont just be one massive PR</div></blockquote><div><br></div><div><div>Yes, this is certainly not a one-PR change. I think there are multiple options for working towards this, that are worth discussing. <br></div><div><br></div><div>But personally, I would first like to focus on the "assuming we go down this path" part. Let's discuss the pros and cons and trade-offs, and try to turn assumptions in an agreed-upon roadmap. <br></div><div>(and of course, it's not because something is on our roadmap that it can't be questioned and discussed again in the future, as we are also doing now).<br></div><div><br></div><div>---</div><div><br></div><div>Some thoughts on possible options:</div><div><br></div><div>- We briefly discussed before the idea of using (nullable) extension dtypes for all dtypes by default in pandas 2.0. If we strive towards that, and assuming we keep the current 1D-restriction on ExtensionBlock, then we would "automatically" get a BlockManager with 1D blocks. And we could then focus on optimizing some code paths (eg constructing a new block) specifically for the case of 1D ExtensionBlocks. <br></div><div>- A "consolidation policy" option similarly as in the branch discussed in <a href="https://github.com/pandas-dev/pandas/issues/10556">https://github.com/pandas-dev/pandas/issues/10556</a>. Right now, that branch still uses 2D blocks (but separate 2D blocks of shape (1, n) per column) and not actually 1D blocks. So we could add 1D versions of our numeric blocks as well. But that would probably add a lot of complexity, although temporary, to the Blocks, so maybe not an ideal path forward.</div><div>- Add a version of the ExtensionBlock but that can work with numpy arrays instead of extension arrays, or actually use the "PandasArrays" to store it them in the existing ExtensionBlock (so to already start using the existing 1D blocks without requiring all dtypes to be extension dtypes).<br></div><div><br></div><div>Those are all about BlockManager with 1D blocks. Once we only have 1D Blocks, I suppose there are many things we could simplify in the current BlockManager. The intermediate step of the current BlockManager with 1D blocks might not be an optimal situation, but seems the easiest as intermediate goal in practice.<br></div><div><br></div><div><div>It probably also depends on how much "backwards compatibility" or "transition period" we want to provide.</div></div></div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Mon, May 25, 2020 at 2:39 PM Joris Van den Bossche <<a href="mailto:jorisvandenbossche@gmail.com" target="_blank">jorisvandenbossche@gmail.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr"><p style="margin-top:0.6em;margin-bottom:0.65em;color:rgb(34,34,34);font-family:Avenir,Arial,sans-serif;font-style:normal;font-variant-ligatures:normal;font-variant-caps:normal;font-weight:400;letter-spacing:normal;text-align:start;text-indent:0px;text-transform:none;white-space:normal;word-spacing:0px;background-color:rgb(255,255,255);text-decoration-style:initial;text-decoration-color:initial"><font size="2">Hi list,</font></p><p style="margin-top:0.6em;margin-bottom:0.65em;color:rgb(34,34,34);font-family:Avenir,Arial,sans-serif;font-style:normal;font-variant-ligatures:normal;font-variant-caps:normal;font-weight:400;letter-spacing:normal;text-align:start;text-indent:0px;text-transform:none;white-space:normal;word-spacing:0px;background-color:rgb(255,255,255);text-decoration-style:initial;text-decoration-color:initial"><font size="2">Rewriting the BlockManager based on a simpler collection of 1D-arrays is actually on our roadmap (see<span> </span><a title="https://pandas.pydata.org/docs/dev/development/roadmap.html#block-manager-rewrite" href="https://pandas.pydata.org/docs/dev/development/roadmap.html#block-manager-rewrite" type="" style="background-color:transparent;color:rgb(80,130,190)" target="_blank">here</a>), and I also touched on it in a mailing list discussion about pandas 2.0 earlier this year (see<span> </span><a title="https://mail.python.org/pipermail/pandas-dev/2020-February/001180.html" href="https://mail.python.org/pipermail/pandas-dev/2020-February/001180.html" type="" style="background-color:transparent;color:rgb(80,130,190)" target="_blank">here</a>).</font></p><p style="margin-top:0.6em;margin-bottom:0.65em;color:rgb(34,34,34);font-family:Avenir,Arial,sans-serif;font-style:normal;font-variant-ligatures:normal;font-variant-caps:normal;font-weight:400;letter-spacing:normal;text-align:start;text-indent:0px;text-transform:none;white-space:normal;word-spacing:0px;background-color:rgb(255,255,255);text-decoration-style:initial;text-decoration-color:initial"><font size="2">But since the topic came up again recently at the last online dev meeting (and also Uwe Korn who wrote a nice<span> </span><a title="https://uwekorn.com/2020/05/24/the-one-pandas-internal.html" href="https://uwekorn.com/2020/05/24/the-one-pandas-internal.html" type="" style="background-color:transparent;color:rgb(80,130,190)" target="_blank">blog post</a><span> </span>about this yesterday), I thought to do a write-up of my thoughts on why I think we should actually move towards a simpler, non-consolidating BlockManager with 1D blocks.</font></p><p style="margin-top:0.6em;margin-bottom:0.65em;color:rgb(34,34,34);font-family:Avenir,Arial,sans-serif;font-style:normal;font-variant-ligatures:normal;font-variant-caps:normal;font-weight:400;letter-spacing:normal;text-align:start;text-indent:0px;text-transform:none;white-space:normal;word-spacing:0px;background-color:rgb(255,255,255);text-decoration-style:initial;text-decoration-color:initial"><font size="2"><br></font></p><p style="margin-top:0.6em;margin-bottom:0.65em;color:rgb(34,34,34);font-family:Avenir,Arial,sans-serif;font-style:normal;font-variant-ligatures:normal;font-variant-caps:normal;font-weight:400;letter-spacing:normal;text-align:start;text-indent:0px;text-transform:none;white-space:normal;word-spacing:0px;background-color:rgb(255,255,255);text-decoration-style:initial;text-decoration-color:initial"><font size="2"><b style="font-weight:bolder;color:rgb(0,0,0)">Simplication of the internals</b></font></p><p style="margin-top:0.6em;margin-bottom:0.65em;color:rgb(34,34,34);font-family:Avenir,Arial,sans-serif;font-style:normal;font-variant-ligatures:normal;font-variant-caps:normal;font-weight:400;letter-spacing:normal;text-align:start;text-indent:0px;text-transform:none;white-space:normal;word-spacing:0px;background-color:rgb(255,255,255);text-decoration-style:initial;text-decoration-color:initial"><font size="2">It's regularly brought up as a reason to have 2D EextensionArrays (EAs) because right now we have a lot of special cases for 1D EAs in the internals. But to be clear: the additional complexity does not come from 1D EAs in itself, it comes from the fact that we have a mixture of 2D and 1D blocks.<br>Solving this would require a consistent block dimension, and thus removing this added complexity can be done in two ways: have all 1D blocks, or have all 2D blocks.<br>Just to say: IMO, this is not an argument in favor of 2D blocks / consolidation.</font></p><p style="margin-top:0.6em;margin-bottom:0.65em;color:rgb(34,34,34);font-family:Avenir,Arial,sans-serif;font-style:normal;font-variant-ligatures:normal;font-variant-caps:normal;font-weight:400;letter-spacing:normal;text-align:start;text-indent:0px;text-transform:none;white-space:normal;word-spacing:0px;background-color:rgb(255,255,255);text-decoration-style:initial;text-decoration-color:initial"><font size="2">Moreover, when going with all 1D blocks, we cannot only remove the added complexity from dealing with the mixture of 1D/2D blocks, we will<span> </span><b style="font-weight:bolder;color:rgb(0,0,0)">also</b><span> </span>be able to reduce the complexity of dealing with 2D blocks. A BlockManager with 2D blocks is inherently more complex than with 1D blocks, as one needs to deal with proper alignment of the blocks, a more complex "placement" logic of the blocks, etc.</font></p><p style="margin-top:0.6em;margin-bottom:0.65em;color:rgb(34,34,34);font-family:Avenir,Arial,sans-serif;font-style:normal;font-variant-ligatures:normal;font-variant-caps:normal;font-weight:400;letter-spacing:normal;text-align:start;text-indent:0px;text-transform:none;white-space:normal;word-spacing:0px;background-color:rgb(255,255,255);text-decoration-style:initial;text-decoration-color:initial"><font size="2">I think we would be able to simplify the internals a lot by going with a BlockManager as a store of 1D arrays.</font></p><p style="margin-top:0.6em;margin-bottom:0.65em;color:rgb(34,34,34);font-family:Avenir,Arial,sans-serif;font-style:normal;font-variant-ligatures:normal;font-variant-caps:normal;font-weight:400;letter-spacing:normal;text-align:start;text-indent:0px;text-transform:none;white-space:normal;word-spacing:0px;background-color:rgb(255,255,255);text-decoration-style:initial;text-decoration-color:initial"><font size="2"><br></font></p><p style="margin-top:0.6em;margin-bottom:0.65em;color:rgb(34,34,34);font-family:Avenir,Arial,sans-serif;font-style:normal;font-variant-ligatures:normal;font-variant-caps:normal;font-weight:400;letter-spacing:normal;text-align:start;text-indent:0px;text-transform:none;white-space:normal;word-spacing:0px;background-color:rgb(255,255,255);text-decoration-style:initial;text-decoration-color:initial"><font size="2"><b style="font-weight:bolder;color:rgb(0,0,0)">Performance</b></font></p><p style="margin-top:0.6em;margin-bottom:0.65em;color:rgb(34,34,34);font-family:Avenir,Arial,sans-serif;font-style:normal;font-variant-ligatures:normal;font-variant-caps:normal;font-weight:400;letter-spacing:normal;text-align:start;text-indent:0px;text-transform:none;white-space:normal;word-spacing:0px;background-color:rgb(255,255,255);text-decoration-style:initial;text-decoration-color:initial"><font size="2">Performance is typically given as a reason to have consolidated, 2D blocks. And of course, certain operations (especially row-wise operations, or on dataframes with more columns as rows) will always be faster when done on a 2D numpy array under the hood.<br>However, based on recent experimentation with this (eg triggered by the<span> </span><a title="https://github.com/pandas-dev/pandas/pull/32779" href="https://github.com/pandas-dev/pandas/pull/32779" type="" style="background-color:transparent;color:rgb(80,130,190)" target="_blank">block-wise frame ops PR</a>, and see also some benchmarks I justed posted in<span> </span><a title="https://github.com/pandas-dev/pandas/issues/10556#issuecomment-633703160" href="https://github.com/pandas-dev/pandas/issues/10556#issuecomment-633703160" type="" style="background-color:transparent;color:rgb(80,130,190)" target="_blank">#10556</a><span> </span>/<span> </span><a title="https://gist.github.com/jorisvandenbossche/b8ae071ab7823f7547567b1ab9d4c20c" href="https://gist.github.com/jorisvandenbossche/b8ae071ab7823f7547567b1ab9d4c20c" type="" style="background-color:transparent;color:rgb(80,130,190)" target="_blank">this gist</a>), I also think that for many operations and with decent-sized dataframes, this performance penalty is actually quite OK.</font></p><p style="margin-top:0.6em;margin-bottom:0.65em;color:rgb(34,34,34);font-family:Avenir,Arial,sans-serif;font-style:normal;font-variant-ligatures:normal;font-variant-caps:normal;font-weight:400;letter-spacing:normal;text-align:start;text-indent:0px;text-transform:none;white-space:normal;word-spacing:0px;background-color:rgb(255,255,255);text-decoration-style:initial;text-decoration-color:initial"><font size="2">Further, there are also operations that will<span> </span><i>benefit</i><span> </span>from 1D blocks. First, operations that now involve aligning/splitting blocks, re-consolidation, .. will benefit (e.g. a large part of the slowdown doing frame/frame operations column-wise is currently due to the consolidation in the end). And operations like adding a column, concatting (with axis=1) or merging dataframes will be much faster when no consolidation is needed.</font></p><p style="margin-top:0.6em;margin-bottom:0.65em;color:rgb(34,34,34);font-family:Avenir,Arial,sans-serif;font-style:normal;font-variant-ligatures:normal;font-variant-caps:normal;font-weight:400;letter-spacing:normal;text-align:start;text-indent:0px;text-transform:none;white-space:normal;word-spacing:0px;background-color:rgb(255,255,255);text-decoration-style:initial;text-decoration-color:initial"><font size="2">Personally, I am convinced that with some effort, we can get on-par or sometimes even better performance with 1D blocks compared to the performance we have now for those cases that 90+% of our users care about:</font></p><ul style="margin-top:0.6em;margin-bottom:0.65em;padding-left:0px;margin-left:1.7em;color:rgb(34,34,34);font-family:Avenir,Arial,sans-serif;font-style:normal;font-variant-ligatures:normal;font-variant-caps:normal;font-weight:400;letter-spacing:normal;text-align:start;text-indent:0px;text-transform:none;white-space:normal;word-spacing:0px;background-color:rgb(255,255,255);text-decoration-style:initial;text-decoration-color:initial"><li style="margin-bottom:0.4em"><font size="2">With limited effort optimizing the column-wise code paths in the internals, we can get a long way.</font></li><li style="margin-bottom:0.4em"><font size="2">After that, if needed, we can still consider if parts of the internals could be cythonized to further improve certain bottlenecks (and actually cythonizing this will also be simpler for a simpler non-consolidating block manager).</font></li></ul><div><font size="2"><br></font></div><p style="margin-top:0.6em;margin-bottom:0.65em;color:rgb(34,34,34);font-family:Avenir,Arial,sans-serif;font-style:normal;font-variant-ligatures:normal;font-variant-caps:normal;font-weight:400;letter-spacing:normal;text-align:start;text-indent:0px;text-transform:none;white-space:normal;word-spacing:0px;background-color:rgb(255,255,255);text-decoration-style:initial;text-decoration-color:initial"><font size="2"><b style="font-weight:bolder;color:rgb(0,0,0)">Possibility to get better copy/view semantics</b></font></p><p style="margin-top:0.6em;margin-bottom:0.65em;color:rgb(34,34,34);font-family:Avenir,Arial,sans-serif;font-style:normal;font-variant-ligatures:normal;font-variant-caps:normal;font-weight:400;letter-spacing:normal;text-align:start;text-indent:0px;text-transform:none;white-space:normal;word-spacing:0px;background-color:rgb(255,255,255);text-decoration-style:initial;text-decoration-color:initial"><font size="2">Pandas is badly known for how much it copies ("you need 10x the memory available as the size of your dataframe"), and having 1D blocks will allow us to address part of those concerns.</font></p><p style="margin-top:0.6em;margin-bottom:0.65em;color:rgb(34,34,34);font-family:Avenir,Arial,sans-serif;font-style:normal;font-variant-ligatures:normal;font-variant-caps:normal;font-weight:400;letter-spacing:normal;text-align:start;text-indent:0px;text-transform:none;white-space:normal;word-spacing:0px;background-color:rgb(255,255,255);text-decoration-style:initial;text-decoration-color:initial"><font size="2"><i>No consolidation = less copying.</i><span> </span>Regularly consolidating introduces copies, and thus removing consolidation will mean less copies. For example, this would enable that you can actually add a single column to a dataframe without having to copy to the full dataframe.</font></p><p style="margin-top:0.6em;margin-bottom:0.65em;color:rgb(34,34,34);font-family:Avenir,Arial,sans-serif;font-style:normal;font-variant-ligatures:normal;font-variant-caps:normal;font-weight:400;letter-spacing:normal;text-align:start;text-indent:0px;text-transform:none;white-space:normal;word-spacing:0px;background-color:rgb(255,255,255);text-decoration-style:initial;text-decoration-color:initial"><font size="2"><i>Copy / view semantics</i><span> </span>Recently there has been discussion again around whether selecting columns should be a copy or a view, and some other issues were opened with questions about views/copies when slicing columns. In the consolidated 2D block layout this will always be inherently messy, and unpredictable (meaning: depending on the actual block layout, which means in practice unpredictable for the user unaware of the block layout).<br>Going with a non-consolidated BlockManager should at least allow us to get better / more understandable semantics around this.</font></p><p style="margin-top:0.6em;margin-bottom:0.65em;color:rgb(34,34,34);font-family:Avenir,Arial,sans-serif;font-style:normal;font-variant-ligatures:normal;font-variant-caps:normal;font-weight:400;letter-spacing:normal;text-align:start;text-indent:0px;text-transform:none;white-space:normal;word-spacing:0px;background-color:rgb(255,255,255);text-decoration-style:initial;text-decoration-color:initial"><font size="2"><br></font></p><hr style="box-sizing:content-box;overflow:visible;border-color:currentcolor currentcolor rgb(230,230,230);border-style:none none solid;border-width:medium medium 2px;color:rgb(34,34,34);font-family:Avenir,Arial,sans-serif;font-style:normal;font-variant-ligatures:normal;font-variant-caps:normal;font-weight:400;letter-spacing:normal;text-align:start;text-indent:0px;text-transform:none;white-space:normal;word-spacing:0px;background-color:rgb(255,255,255);text-decoration-style:initial;text-decoration-color:initial"><p style="margin-top:0.6em;margin-bottom:0.65em;color:rgb(34,34,34);font-family:Avenir,Arial,sans-serif;font-style:normal;font-variant-ligatures:normal;font-variant-caps:normal;font-weight:400;letter-spacing:normal;text-align:start;text-indent:0px;text-transform:none;white-space:normal;word-spacing:0px;background-color:rgb(255,255,255);text-decoration-style:initial;text-decoration-color:initial"><font size="2"><b style="font-weight:bolder;color:rgb(0,0,0)">So what are the reasons to have 2D blocks?</b></font></p><p style="margin-top:0.6em;margin-bottom:0.65em;color:rgb(34,34,34);font-family:Avenir,Arial,sans-serif;font-style:normal;font-variant-ligatures:normal;font-variant-caps:normal;font-weight:400;letter-spacing:normal;text-align:start;text-indent:0px;text-transform:none;white-space:normal;word-spacing:0px;background-color:rgb(255,255,255);text-decoration-style:initial;text-decoration-color:initial"><font size="2">I personally don't directly see reasons to have 2D blocks<span> </span><i>for pandas itself</i><span> </span>(apart from performance in certain row-wise use cases, and except for the fact that we have "always done it like this"). But quite likely I am missing reasons, so please bring them up.</font></p><p style="margin-top:0.6em;margin-bottom:0.65em;color:rgb(34,34,34);font-family:Avenir,Arial,sans-serif;font-style:normal;font-variant-ligatures:normal;font-variant-caps:normal;font-weight:400;letter-spacing:normal;text-align:start;text-indent:0px;text-transform:none;white-space:normal;word-spacing:0px;background-color:rgb(255,255,255);text-decoration-style:initial;text-decoration-color:initial"><font size="2">But I think there are certainly use cases where 2D blocks can be useful, but typically "external" (but nonetheless important) use cases: conversion to/from numpy, xarray, etc. A typical example that has recently come up is scikit-learn, where they want to have a cheap dataframe <-> numpy array roundtrip for use in their pipelines.<br>However, I personally think there are possible ways that we can still accommodate for those use cases, with some effort, while still having 1D Blocks in pandas itself. So IMO this is not sufficient to warrant the complexity of 2D blocks in pandas. <br>(but will stop here, as this mail is getting already long ..).<br><br></font></p><p style="margin-top:0.6em;margin-bottom:0.65em;color:rgb(34,34,34);font-family:Avenir,Arial,sans-serif;font-style:normal;font-variant-ligatures:normal;font-variant-caps:normal;font-weight:400;letter-spacing:normal;text-align:start;text-indent:0px;text-transform:none;white-space:normal;word-spacing:0px;background-color:rgb(255,255,255);text-decoration-style:initial;text-decoration-color:initial"><font size="2">Joris<br></font></p></div>
_______________________________________________<br>
Pandas-dev mailing list<br>
<a href="mailto:Pandas-dev@python.org" target="_blank">Pandas-dev@python.org</a><br>
<a href="https://mail.python.org/mailman/listinfo/pandas-dev" rel="noreferrer" target="_blank">https://mail.python.org/mailman/listinfo/pandas-dev</a><br>
</blockquote></div>
</blockquote></div></div>