Proposal for consistent, clear copy/view semantics in pandas with Copy-on-Write
*(a.k.a. getting rid of the SettingWithCopyWarning)* Hi all, As you are probably aware, it's not always straightforward to understand the copy or view semantics of indexing methods in pandas. To understand when you get a view and when not, or why you get a SettingWithCopyWarning or how to get rid of it? It's also something that has already been discussed regularly (e.g. the discussion and implementation from 2015 started by Nick Eubank at gh-10954 <https://github.com/pandas-dev/pandas/issues/10954>). Last year, we again started to discuss this, which is tracked at https://github.com/pandas-dev/pandas/issues/36195. Based on those discussions, I have a concrete proposal to change the copy/view semantics of pandas. Short summary of the proposal: 1. The result of *any* indexing operation (subsetting a DataFrame or Series in any way) or any method returning a new DataFrame, always *behaves as if it were* a copy in terms of user API. 2. We implement Copy-on-Write. This way, we can actually use views as much as possible under the hood, while ensuring the user API behaves as a copy. This addresses multiple aspects: 1) a clear and consistent user API (a clear rule: *any* subset or returned series/dataframe is *always* a copy of the original, and thus never modifies the original) and 2) improving performance by avoiding excessive copies (eg a chained method workflow would no longer return an actual data copy at each step). Longer version of this proposal: https://docs.google.com/document/d/1ZCQ9mx3LBMy-nhwRl33_jgcvWo9IWdEfxDNQ2thy... Proof-of-concept implementation: https://github.com/pandas-dev/pandas/pull/41878 GitHub issue with relevant discussion: https://github.com/pandas-dev/pandas/issues/36195 *Since this would be a change with a large impact on users, I think it is important to get broad feedback on this*. So comments, thoughts, concerns, ideas etc are very welcome (you can comment on the google doc, answer to this email or on the github issue). Best, Joris
+1 on the approach of the proposal, and also +1 to release in a major version, and not raise deprecation warnings. Thanks for working on this, it'll make users life much easier. On Sun, Jul 11, 2021 at 4:58 PM Joris Van den Bossche < jorisvandenbossche@gmail.com> wrote:
*(a.k.a. getting rid of the SettingWithCopyWarning)*
Hi all,
As you are probably aware, it's not always straightforward to understand the copy or view semantics of indexing methods in pandas. To understand when you get a view and when not, or why you get a SettingWithCopyWarning or how to get rid of it? It's also something that has already been discussed regularly (e.g. the discussion and implementation from 2015 started by Nick Eubank at gh-10954 <https://github.com/pandas-dev/pandas/issues/10954>). Last year, we again started to discuss this, which is tracked at https://github.com/pandas-dev/pandas/issues/36195. Based on those discussions, I have a concrete proposal to change the copy/view semantics of pandas.
Short summary of the proposal:
1. The result of *any* indexing operation (subsetting a DataFrame or Series in any way) or any method returning a new DataFrame, always *behaves as if it were* a copy in terms of user API. 2. We implement Copy-on-Write. This way, we can actually use views as much as possible under the hood, while ensuring the user API behaves as a copy.
This addresses multiple aspects: 1) a clear and consistent user API (a clear rule: *any* subset or returned series/dataframe is *always* a copy of the original, and thus never modifies the original) and 2) improving performance by avoiding excessive copies (eg a chained method workflow would no longer return an actual data copy at each step).
Longer version of this proposal: https://docs.google.com/document/d/1ZCQ9mx3LBMy-nhwRl33_jgcvWo9IWdEfxDNQ2thy... Proof-of-concept implementation: https://github.com/pandas-dev/pandas/pull/41878 GitHub issue with relevant discussion: https://github.com/pandas-dev/pandas/issues/36195
*Since this would be a change with a large impact on users, I think it is important to get broad feedback on this*. So comments, thoughts, concerns, ideas etc are very welcome (you can comment on the google doc, answer to this email or on the github issue).
Best, Joris _______________________________________________ Pandas-dev mailing list Pandas-dev@python.org https://mail.python.org/mailman/listinfo/pandas-dev
I think this is an important initiative, and I indeed wish we had designed around copy-on-write ideas from the very beginning. As one protection against improper mutation of views, it may be necessary to introduce defensive copies into APIs that expose internal data, e.g. NumPy arrays that are slices of the parent, or who have had slices taken of them. On Mon, Jul 12, 2021 at 12:42 PM Marc Garcia <garcia.marc@gmail.com> wrote:
+1 on the approach of the proposal, and also +1 to release in a major version, and not raise deprecation warnings.
Thanks for working on this, it'll make users life much easier.
On Sun, Jul 11, 2021 at 4:58 PM Joris Van den Bossche <jorisvandenbossche@gmail.com> wrote:
(a.k.a. getting rid of the SettingWithCopyWarning)
Hi all,
As you are probably aware, it's not always straightforward to understand the copy or view semantics of indexing methods in pandas. To understand when you get a view and when not, or why you get a SettingWithCopyWarning or how to get rid of it? It's also something that has already been discussed regularly (e.g. the discussion and implementation from 2015 started by Nick Eubank at gh-10954). Last year, we again started to discuss this, which is tracked at https://github.com/pandas-dev/pandas/issues/36195. Based on those discussions, I have a concrete proposal to change the copy/view semantics of pandas.
Short summary of the proposal:
The result of any indexing operation (subsetting a DataFrame or Series in any way) or any method returning a new DataFrame, always behaves as if it were a copy in terms of user API. We implement Copy-on-Write. This way, we can actually use views as much as possible under the hood, while ensuring the user API behaves as a copy.
This addresses multiple aspects: 1) a clear and consistent user API (a clear rule: any subset or returned series/dataframe is always a copy of the original, and thus never modifies the original) and 2) improving performance by avoiding excessive copies (eg a chained method workflow would no longer return an actual data copy at each step).
Longer version of this proposal: https://docs.google.com/document/d/1ZCQ9mx3LBMy-nhwRl33_jgcvWo9IWdEfxDNQ2thy... Proof-of-concept implementation: https://github.com/pandas-dev/pandas/pull/41878 GitHub issue with relevant discussion: https://github.com/pandas-dev/pandas/issues/36195
Since this would be a change with a large impact on users, I think it is important to get broad feedback on this. So comments, thoughts, concerns, ideas etc are very welcome (you can comment on the google doc, answer to this email or on the github issue).
Best, Joris _______________________________________________ Pandas-dev mailing list Pandas-dev@python.org https://mail.python.org/mailman/listinfo/pandas-dev
_______________________________________________ Pandas-dev mailing list Pandas-dev@python.org https://mail.python.org/mailman/listinfo/pandas-dev
I agree with Wes and Marc. This is an important change for the long term future of pandas. On Mon, Jul 12, 2021 at 11:29 AM Wes McKinney <wesmckinn@gmail.com> wrote:
I think this is an important initiative, and I indeed wish we had designed around copy-on-write ideas from the very beginning.
As one protection against improper mutation of views, it may be necessary to introduce defensive copies into APIs that expose internal data, e.g. NumPy arrays that are slices of the parent, or who have had slices taken of them.
On Mon, Jul 12, 2021 at 12:42 PM Marc Garcia <garcia.marc@gmail.com> wrote:
+1 on the approach of the proposal, and also +1 to release in a major
version, and not raise deprecation warnings.
Thanks for working on this, it'll make users life much easier.
On Sun, Jul 11, 2021 at 4:58 PM Joris Van den Bossche <
(a.k.a. getting rid of the SettingWithCopyWarning)
Hi all,
As you are probably aware, it's not always straightforward to
understand the copy or view semantics of indexing methods in pandas. To understand when you get a view and when not, or why you get a SettingWithCopyWarning or how to get rid of it?
It's also something that has already been discussed regularly (e.g. the discussion and implementation from 2015 started by Nick Eubank at gh-10954). Last year, we again started to discuss this, which is tracked at https://github.com/pandas-dev/pandas/issues/36195. Based on those discussions, I have a concrete proposal to change the copy/view semantics of pandas.
Short summary of the proposal:
The result of any indexing operation (subsetting a DataFrame or Series in any way) or any method returning a new DataFrame, always behaves as if it were a copy in terms of user API. We implement Copy-on-Write. This way, we can actually use views as much as possible under the hood, while ensuring the user API behaves as a copy.
This addresses multiple aspects: 1) a clear and consistent user API (a clear rule: any subset or returned series/dataframe is always a copy of the original, and thus never modifies the original) and 2) improving
jorisvandenbossche@gmail.com> wrote: performance by avoiding excessive copies (eg a chained method workflow would no longer return an actual data copy at each step).
Longer version of this proposal:
https://docs.google.com/document/d/1ZCQ9mx3LBMy-nhwRl33_jgcvWo9IWdEfxDNQ2thy...
Proof-of-concept implementation: https://github.com/pandas-dev/pandas/pull/41878 GitHub issue with relevant discussion: https://github.com/pandas-dev/pandas/issues/36195
Since this would be a change with a large impact on users, I think it is important to get broad feedback on this. So comments, thoughts, concerns, ideas etc are very welcome (you can comment on the google doc, answer to this email or on the github issue).
Best, Joris _______________________________________________ Pandas-dev mailing list Pandas-dev@python.org https://mail.python.org/mailman/listinfo/pandas-dev
_______________________________________________ Pandas-dev mailing list Pandas-dev@python.org https://mail.python.org/mailman/listinfo/pandas-dev
Pandas-dev mailing list Pandas-dev@python.org https://mail.python.org/mailman/listinfo/pandas-dev
Thanks for the feedback. Regarding protection against improper mutation of views via numpy (or in general arrays), that's indeed a risk. Since this is Python, a user will always find some (private) way to incorrectly mutate data without triggering the copy-on-write paths, but there are indeed some ways we good try to prevent that. Listing the possible ways to get the "array" data from DataFrame/Series objects: * Series.values / Series.array -> returning a numpy array or pandas ExtensionArray, which currently return the stored data as are mutatble arrays as is (or as views). Mutating such an array wouldn't trigger Copy-on-Write which is managed on the DataFrame/Series level. To prevent users from doing this, we could return those arrays as "read-only"? (to avoid always doing a defensive copy here) * Series.to_numpy() -> returning a numpy array. This method has a `copy` keyword with currently a default of False. We could either make this copy=True by default, or similarly to the above make it read-only by default, leaving the copy=True/False options to choose from explicitly. * DataFrame.to_numpy() / DataFrame.values -> returning a 2D numpy array, which is by definition always a copy (by concatting multiple 1D arrays). Except for the 1-column case, this could still be a view. For simplicity, I would make this case return a copy as well (if you want a view the user can get the Series). Or alternatively this case could follow the logic of Series.to_numpy above. On Tue, 13 Jul 2021 at 05:55, Stephan Hoyer <shoyer@gmail.com> wrote:
I agree with Wes and Marc. This is an important change for the long term future of pandas.
On Mon, Jul 12, 2021 at 11:29 AM Wes McKinney <wesmckinn@gmail.com> wrote:
I think this is an important initiative, and I indeed wish we had designed around copy-on-write ideas from the very beginning.
As one protection against improper mutation of views, it may be necessary to introduce defensive copies into APIs that expose internal data, e.g. NumPy arrays that are slices of the parent, or who have had slices taken of them.
On Mon, Jul 12, 2021 at 12:42 PM Marc Garcia <garcia.marc@gmail.com> wrote:
+1 on the approach of the proposal, and also +1 to release in a major
version, and not raise deprecation warnings.
Thanks for working on this, it'll make users life much easier.
On Sun, Jul 11, 2021 at 4:58 PM Joris Van den Bossche <
(a.k.a. getting rid of the SettingWithCopyWarning)
Hi all,
As you are probably aware, it's not always straightforward to
understand the copy or view semantics of indexing methods in pandas. To understand when you get a view and when not, or why you get a SettingWithCopyWarning or how to get rid of it?
It's also something that has already been discussed regularly (e.g.
Short summary of the proposal:
The result of any indexing operation (subsetting a DataFrame or Series
in any way) or any method returning a new DataFrame, always behaves as if it were a copy in terms of user API.
We implement Copy-on-Write. This way, we can actually use views as much as possible under the hood, while ensuring the user API behaves as a copy.
This addresses multiple aspects: 1) a clear and consistent user API (a clear rule: any subset or returned series/dataframe is always a copy of the original, and thus never modifies the original) and 2) improving
jorisvandenbossche@gmail.com> wrote: the discussion and implementation from 2015 started by Nick Eubank at gh-10954). Last year, we again started to discuss this, which is tracked at https://github.com/pandas-dev/pandas/issues/36195. Based on those discussions, I have a concrete proposal to change the copy/view semantics of pandas. performance by avoiding excessive copies (eg a chained method workflow would no longer return an actual data copy at each step).
Longer version of this proposal:
https://docs.google.com/document/d/1ZCQ9mx3LBMy-nhwRl33_jgcvWo9IWdEfxDNQ2thy...
Proof-of-concept implementation: https://github.com/pandas-dev/pandas/pull/41878 GitHub issue with relevant discussion: https://github.com/pandas-dev/pandas/issues/36195
Since this would be a change with a large impact on users, I think it is important to get broad feedback on this. So comments, thoughts, concerns, ideas etc are very welcome (you can comment on the google doc, answer to this email or on the github issue).
Best, Joris _______________________________________________ Pandas-dev mailing list Pandas-dev@python.org https://mail.python.org/mailman/listinfo/pandas-dev
_______________________________________________ Pandas-dev mailing list Pandas-dev@python.org https://mail.python.org/mailman/listinfo/pandas-dev
Pandas-dev mailing list Pandas-dev@python.org https://mail.python.org/mailman/listinfo/pandas-dev
_______________________________________________ Pandas-dev mailing list Pandas-dev@python.org https://mail.python.org/mailman/listinfo/pandas-dev
On Mon, 12 Jul 2021 at 00:58, Joris Van den Bossche < jorisvandenbossche@gmail.com> wrote:
Short summary of the proposal:
1. The result of *any* indexing operation (subsetting a DataFrame or Series in any way) or any method returning a new DataFrame, always *behaves as if it were* a copy in terms of user API.
To explicitly call out the column-as-Series case (since this is a typical case that right now *always* is a view): "any" indexing operation thus also included accessing a DataFrame column as a Series (or slicing a Series).
So something like s = df["col"] and then mutating s will no longer update df. Similarly for series_subset = series[1:5], mutating series_subset will no longer update s.
[xposting from https://github.com/pandas-dev/pandas/issues/36195] I'm glad there is a proof of concept to help clarify what this looks like. I do not like the fact that nothing can ever be "just a view" with these semantics, including series[::-1], frame[col], frame[:]. Users reasonably expect numpy semantics for these. We should revisit the alternative "clear/simple rules" approach that is "indexing on columns always gives a view" ( https://github.com/pandas-dev/pandas/pull/33597). This is simpler to explain/grok, simpler to implement, and not dependent on BlockManager vs ArrayManager. On Fri, Jul 16, 2021 at 5:26 AM Joris Van den Bossche < jorisvandenbossche@gmail.com> wrote:
On Mon, 12 Jul 2021 at 00:58, Joris Van den Bossche < jorisvandenbossche@gmail.com> wrote:
Short summary of the proposal:
1. The result of *any* indexing operation (subsetting a DataFrame or Series in any way) or any method returning a new DataFrame, always *behaves as if it were* a copy in terms of user API.
To explicitly call out the column-as-Series case (since this is a typical case that right now *always* is a view): "any" indexing operation thus also included accessing a DataFrame column as a Series (or slicing a Series).
So something like s = df["col"] and then mutating s will no longer update df. Similarly for series_subset = series[1:5], mutating series_subset will no longer update s. _______________________________________________ Pandas-dev mailing list Pandas-dev@python.org https://mail.python.org/mailman/listinfo/pandas-dev
On Fri, Jul 16, 2021 at 11:04 AM Brock Mendel <jbrockmendel@gmail.com> wrote:
[xposting from https://github.com/pandas-dev/pandas/issues/36195]
I'm glad there is a proof of concept to help clarify what this looks like.
I do not like the fact that nothing can ever be "just a view" with these semantics, including series[::-1], frame[col], frame[:]. Users reasonably expect numpy semantics for these.
I wonder if we can validate what users (new and old) *actually* expect? Users coming from R, which IIRC implements Copy on Write for matrices, might be OK with indexing always being (behaving like) a copy. I'm not sure what users coming from NumPy would expect, since I don't know how many NumPy users really understand *a**.)* when a NumPy slice is a view or copy, and *b.) *how a pandas indexing operation translates to a NumPy slice.
We should revisit the alternative "clear/simple rules" approach that is "indexing on columns always gives a view" ( https://github.com/pandas-dev/pandas/pull/33597). This is simpler to explain/grok, simpler to implement, and not dependent on BlockManager vs ArrayManager.
On Fri, Jul 16, 2021 at 5:26 AM Joris Van den Bossche < jorisvandenbossche@gmail.com> wrote:
On Mon, 12 Jul 2021 at 00:58, Joris Van den Bossche < jorisvandenbossche@gmail.com> wrote:
Short summary of the proposal:
1. The result of *any* indexing operation (subsetting a DataFrame or Series in any way) or any method returning a new DataFrame, always *behaves as if it were* a copy in terms of user API.
To explicitly call out the column-as-Series case (since this is a typical case that right now *always* is a view): "any" indexing operation thus also included accessing a DataFrame column as a Series (or slicing a Series).
So something like s = df["col"] and then mutating s will no longer update df. Similarly for series_subset = series[1:5], mutating series_subset will no longer update s. _______________________________________________ Pandas-dev mailing list Pandas-dev@python.org https://mail.python.org/mailman/listinfo/pandas-dev
_______________________________________________ Pandas-dev mailing list Pandas-dev@python.org https://mail.python.org/mailman/listinfo/pandas-dev
On Fri, Jul 16, 2021 at 9:04 AM Brock Mendel <jbrockmendel@gmail.com> wrote:
I do not like the fact that nothing can ever be "just a view" with these semantics, including series[::-1], frame[col], frame[:]. Users reasonably expect numpy semantics for these.
We should revisit the alternative "clear/simple rules" approach that is "indexing on columns always gives a view" ( https://github.com/pandas-dev/pandas/pull/33597). This is simpler to explain/grok, simpler to implement, and not dependent on BlockManager vs ArrayManager.
I don't know if it is worth the trouble for complex multi-column selections, but I do see the appeal here. A simpler variant would be to make indexing out a single Series from a DataFrame return a view, with everything else doing copy on write. Then the existing pattern df.column_one[:] = ... would still work.
On Fri, Jul 16, 2021 at 5:26 AM Joris Van den Bossche < jorisvandenbossche@gmail.com> wrote:
On Mon, 12 Jul 2021 at 00:58, Joris Van den Bossche < jorisvandenbossche@gmail.com> wrote:
Short summary of the proposal:
1. The result of *any* indexing operation (subsetting a DataFrame or Series in any way) or any method returning a new DataFrame, always *behaves as if it were* a copy in terms of user API.
To explicitly call out the column-as-Series case (since this is a typical case that right now *always* is a view): "any" indexing operation thus also included accessing a DataFrame column as a Series (or slicing a Series).
So something like s = df["col"] and then mutating s will no longer update df. Similarly for series_subset = series[1:5], mutating series_subset will no longer update s. _______________________________________________ Pandas-dev mailing list Pandas-dev@python.org https://mail.python.org/mailman/listinfo/pandas-dev
_______________________________________________ Pandas-dev mailing list Pandas-dev@python.org https://mail.python.org/mailman/listinfo/pandas-dev
On Fri, 16 Jul 2021 at 18:58, Stephan Hoyer <shoyer@gmail.com> wrote:
On Fri, Jul 16, 2021 at 9:04 AM Brock Mendel <jbrockmendel@gmail.com> wrote:
I do not like the fact that nothing can ever be "just a view" with these semantics, including series[::-1], frame[col], frame[:]. Users reasonably expect numpy semantics for these.
I am personally not sure what "users" in general expect for those (as also mentioned by Tom and Irv already, depending on their background, they might expect different things). For example, for a user that knows basic Python, they could actually expect all those examples to give a copy since `a_list[:]` is a typical way to make a copy of a list.
(it might be interesting to reach out to educators (who might have more experience with expectations/typical errors of novice users) or to do some kind of experiment on this topic) Personally, I cannot remember that I ever relied on the mutability-aspect of eg `series[1:3]` or `frame[:]` being a view. I think there are generally 2 reasons for users caring about a view: 1) for performance (less copying) and 2) for being able to mutate the view with the explicit goal to mutate the parent (and not as an irrelevant side-effect). I think the first reason is by far the most common one (but that's my subjective opinion from my experience using pandas, so that can certainly depend), and in the current proposal, all those mentioned example will be actual views under the hood (and thus cover this first reason). The only case where I know I explicitly rely on this is with chained assignment (eg `frame[col][1:3] = ..`). That's certainly a very important use case (and probably the most impacted usage pattern with the current proposal), but it's also a case where 1) there is a clear alternative (don't use chained assignment, but do it in one step (e.g. `frame.loc[1:3, col] = ..`), some corner cases of mixed positional/label-based indexing aside, for which we should find an alternative) and 2) we might be able to detect this and raise an informative error message (specifically for chained assignment). I think it can be easier to explain "chained assignment never works" than "chained assignment only works if first selecting the column(s)" (depending on the exact rules).
We should revisit the alternative "clear/simple rules" approach that is
"indexing on columns always gives a view" ( https://github.com/pandas-dev/pandas/pull/33597). This is simpler to explain/grok, simpler to implement
I don't know if it is worth the trouble for complex multi-column selections, but I do see the appeal here.
A simpler variant would be to make indexing out a single Series from a DataFrame return a view, with everything else doing copy on write. Then the existing pattern df.column_one[:] = ... would still work.
I was initially thinking about this as well. In the end, I didn't (yet) try to implement this, because while thinking it through, it seemed that this might give quite some tricky cases. Consider the following example: df = pd.DataFrame(..) df_subset = df[["col1", "col2"]] s1 = df["col1"] s1_subset = s1[0:3] # modifying s1 should modify df, but not df_subset and s1_subset? s1[0] = 0 If we take "only accessing a single Series from a DataFrame is a view, everything else uses copy-on-write", that gives rise to questions like the above where some parents/childs get modified, and some not. This is both harder to explain to users, as harder to implement. For the implementation of the proof-of-concept, the copy-on-write happens "locally" in the series/dataframe that gets modified (meaning: when modifying a given object, its internal array data first gets copied and replaced *if* the object is viewing another or is being viewed by another object). While in the above case, modifying a given object would need to trigger a copy in other (potentially many) objects, and not in the object being modified. It's probably possible to implement this, but certainly harder/trickier to do.
On Fri, Jul 16, 2021 at 5:26 AM Joris Van den Bossche < jorisvandenbossche@gmail.com> wrote:
On Mon, 12 Jul 2021 at 00:58, Joris Van den Bossche < jorisvandenbossche@gmail.com> wrote:
Short summary of the proposal:
1. The result of *any* indexing operation (subsetting a DataFrame or Series in any way) or any method returning a new DataFrame, always *behaves as if it were* a copy in terms of user API.
To explicitly call out the column-as-Series case (since this is a typical case that right now *always* is a view): "any" indexing operation thus also included accessing a DataFrame column as a Series (or slicing a Series).
So something like s = df["col"] and then mutating s will no longer update df. Similarly for series_subset = series[1:5], mutating series_subset will no longer update s. _______________________________________________ Pandas-dev mailing list Pandas-dev@python.org https://mail.python.org/mailman/listinfo/pandas-dev
_______________________________________________ Pandas-dev mailing list Pandas-dev@python.org https://mail.python.org/mailman/listinfo/pandas-dev
Based on my experience (not sure how biased it is), modifying dataframes with something like `df[col][1:3] = ...` is rare (or the equivalent with ` .loc`) except for boolean arrays. From my experience, when the values of a dataframe column are changed, what I think it's way more common is to use `df[col] = df[col].str.upper()`, `df[col] = df[col].fillna(0)`... While I'm personally happy with Joris proposal, I see two other options that could complement or replace it: Option 1) Deprecate assigning to a subset of rows, and only allow assigning to whole columns. Something like `df[col][1:3] = ...` could be replaced by for example `df[col] = df[col].mask(slice(1, 3), ...)`. Using `mask` and ` where` is already supported for boolean arrays, so slices should be added, and they'd be the only way to replace a subset of values. I think that makes the problem narrower, and easier to understand for users. The main thing to decide and be clear about is what happens if the dataframe is a subset of another one: ``` df2 = df[cond] df2[col] = df2[col].str.upper() ``` Option 2) If assigning with the current syntax (`df[col][1:3] = ...` or ` .loc` equivalent) is something we want to keep (I wouldn't if we move in this direction), maybe it could be moved to a `DataFrame` subclass So, the main dataframe class behaves like in option 1, so expectations are much easier to manage. But users who really want to assign with indexing, can still use it, knowing that having a mutable dataframe comes at a cost (copies, more complex behavior...). The `MutableDataFrame` could be in pandas, or a third-party extension. ``` df_mutable = df.to_mutable() df_mutable[col][1:3] = ... ``` On Sat, Jul 17, 2021 at 9:16 AM Joris Van den Bossche < jorisvandenbossche@gmail.com> wrote:
On Fri, 16 Jul 2021 at 18:58, Stephan Hoyer <shoyer@gmail.com> wrote:
On Fri, Jul 16, 2021 at 9:04 AM Brock Mendel <jbrockmendel@gmail.com> wrote:
I do not like the fact that nothing can ever be "just a view" with these semantics, including series[::-1], frame[col], frame[:]. Users reasonably expect numpy semantics for these.
I am personally not sure what "users" in general expect for those (as also mentioned by Tom and Irv already, depending on their background, they might expect different things). For example, for a user that knows basic Python, they could actually expect all those examples to give a copy since `a_list[:]` is a typical way to make a copy of a list.
(it might be interesting to reach out to educators (who might have more experience with expectations/typical errors of novice users) or to do some kind of experiment on this topic)
Personally, I cannot remember that I ever relied on the mutability-aspect of eg `series[1:3]` or `frame[:]` being a view. I think there are generally 2 reasons for users caring about a view: 1) for performance (less copying) and 2) for being able to mutate the view with the explicit goal to mutate the parent (and not as an irrelevant side-effect). I think the first reason is by far the most common one (but that's my subjective opinion from my experience using pandas, so that can certainly depend), and in the current proposal, all those mentioned example will be actual views under the hood (and thus cover this first reason).
The only case where I know I explicitly rely on this is with chained assignment (eg `frame[col][1:3] = ..`). That's certainly a very important use case (and probably the most impacted usage pattern with the current proposal), but it's also a case where 1) there is a clear alternative (don't use chained assignment, but do it in one step (e.g. `frame.loc[1:3, col] = ..`), some corner cases of mixed positional/label-based indexing aside, for which we should find an alternative) and 2) we might be able to detect this and raise an informative error message (specifically for chained assignment).
I think it can be easier to explain "chained assignment never works" than "chained assignment only works if first selecting the column(s)" (depending on the exact rules).
We should revisit the alternative "clear/simple rules" approach that is
"indexing on columns always gives a view" ( https://github.com/pandas-dev/pandas/pull/33597). This is simpler to explain/grok, simpler to implement
I don't know if it is worth the trouble for complex multi-column selections, but I do see the appeal here.
A simpler variant would be to make indexing out a single Series from a DataFrame return a view, with everything else doing copy on write. Then the existing pattern df.column_one[:] = ... would still work.
I was initially thinking about this as well. In the end, I didn't (yet) try to implement this, because while thinking it through, it seemed that this might give quite some tricky cases. Consider the following example:
df = pd.DataFrame(..) df_subset = df[["col1", "col2"]] s1 = df["col1"] s1_subset = s1[0:3] # modifying s1 should modify df, but not df_subset and s1_subset? s1[0] = 0
If we take "only accessing a single Series from a DataFrame is a view, everything else uses copy-on-write", that gives rise to questions like the above where some parents/childs get modified, and some not. This is both harder to explain to users, as harder to implement. For the implementation of the proof-of-concept, the copy-on-write happens "locally" in the series/dataframe that gets modified (meaning: when modifying a given object, its internal array data first gets copied and replaced *if* the object is viewing another or is being viewed by another object). While in the above case, modifying a given object would need to trigger a copy in other (potentially many) objects, and not in the object being modified. It's probably possible to implement this, but certainly harder/trickier to do.
On Fri, Jul 16, 2021 at 5:26 AM Joris Van den Bossche < jorisvandenbossche@gmail.com> wrote:
On Mon, 12 Jul 2021 at 00:58, Joris Van den Bossche < jorisvandenbossche@gmail.com> wrote:
Short summary of the proposal:
1. The result of *any* indexing operation (subsetting a DataFrame or Series in any way) or any method returning a new DataFrame, always *behaves as if it were* a copy in terms of user API.
To explicitly call out the column-as-Series case (since this is a typical case that right now *always* is a view): "any" indexing operation thus also included accessing a DataFrame column as a Series (or slicing a Series).
So something like s = df["col"] and then mutating s will no longer update df. Similarly for series_subset = series[1:5], mutating series_subset will no longer update s. _______________________________________________ Pandas-dev mailing list Pandas-dev@python.org https://mail.python.org/mailman/listinfo/pandas-dev
_______________________________________________ Pandas-dev mailing list Pandas-dev@python.org https://mail.python.org/mailman/listinfo/pandas-dev
_______________________________________________
Pandas-dev mailing list Pandas-dev@python.org https://mail.python.org/mailman/listinfo/pandas-dev
I guess one question I have is what are the memory and time performance implications of the proposed change. I guess I belong to the group of users who think of a pandas DataFrame more as a numpy array with column names attached to them, and hence I'd expect very similar semantics when indexing, and I think copy on write semantics would have a significant impact on our workflows. On Sat, Jul 17, 2021 at 6:12 PM Marc Garcia <garcia.marc@gmail.com> wrote:
Based on my experience (not sure how biased it is), modifying dataframes with something like `df[col][1:3] = ...` is rare (or the equivalent with ` .loc`) except for boolean arrays. From my experience, when the values of a dataframe column are changed, what I think it's way more common is to use `df[col] = df[col].str.upper()`, `df[col] = df[col].fillna(0)`...
While I'm personally happy with Joris proposal, I see two other options that could complement or replace it:
Option 1) Deprecate assigning to a subset of rows, and only allow assigning to whole columns. Something like `df[col][1:3] = ...` could be replaced by for example `df[col] = df[col].mask(slice(1, 3), ...)`. Using `mask` and `where` is already supported for boolean arrays, so slices should be added, and they'd be the only way to replace a subset of values. I think that makes the problem narrower, and easier to understand for users. The main thing to decide and be clear about is what happens if the dataframe is a subset of another one:
``` df2 = df[cond] df2[col] = df2[col].str.upper() ```
Option 2) If assigning with the current syntax (`df[col][1:3] = ...` or ` .loc` equivalent) is something we want to keep (I wouldn't if we move in this direction), maybe it could be moved to a `DataFrame` subclass So, the main dataframe class behaves like in option 1, so expectations are much easier to manage. But users who really want to assign with indexing, can still use it, knowing that having a mutable dataframe comes at a cost (copies, more complex behavior...). The `MutableDataFrame` could be in pandas, or a third-party extension.
``` df_mutable = df.to_mutable() df_mutable[col][1:3] = ... ```
On Sat, Jul 17, 2021 at 9:16 AM Joris Van den Bossche < jorisvandenbossche@gmail.com> wrote:
On Fri, 16 Jul 2021 at 18:58, Stephan Hoyer <shoyer@gmail.com> wrote:
On Fri, Jul 16, 2021 at 9:04 AM Brock Mendel <jbrockmendel@gmail.com> wrote:
I do not like the fact that nothing can ever be "just a view" with these semantics, including series[::-1], frame[col], frame[:]. Users reasonably expect numpy semantics for these.
I am personally not sure what "users" in general expect for those (as also mentioned by Tom and Irv already, depending on their background, they might expect different things). For example, for a user that knows basic Python, they could actually expect all those examples to give a copy since `a_list[:]` is a typical way to make a copy of a list.
(it might be interesting to reach out to educators (who might have more experience with expectations/typical errors of novice users) or to do some kind of experiment on this topic)
Personally, I cannot remember that I ever relied on the mutability-aspect of eg `series[1:3]` or `frame[:]` being a view. I think there are generally 2 reasons for users caring about a view: 1) for performance (less copying) and 2) for being able to mutate the view with the explicit goal to mutate the parent (and not as an irrelevant side-effect). I think the first reason is by far the most common one (but that's my subjective opinion from my experience using pandas, so that can certainly depend), and in the current proposal, all those mentioned example will be actual views under the hood (and thus cover this first reason).
The only case where I know I explicitly rely on this is with chained assignment (eg `frame[col][1:3] = ..`). That's certainly a very important use case (and probably the most impacted usage pattern with the current proposal), but it's also a case where 1) there is a clear alternative (don't use chained assignment, but do it in one step (e.g. `frame.loc[1:3, col] = ..`), some corner cases of mixed positional/label-based indexing aside, for which we should find an alternative) and 2) we might be able to detect this and raise an informative error message (specifically for chained assignment).
I think it can be easier to explain "chained assignment never works" than "chained assignment only works if first selecting the column(s)" (depending on the exact rules).
We should revisit the alternative "clear/simple rules" approach that is
"indexing on columns always gives a view" ( https://github.com/pandas-dev/pandas/pull/33597). This is simpler to explain/grok, simpler to implement
I don't know if it is worth the trouble for complex multi-column selections, but I do see the appeal here.
A simpler variant would be to make indexing out a single Series from a DataFrame return a view, with everything else doing copy on write. Then the existing pattern df.column_one[:] = ... would still work.
I was initially thinking about this as well. In the end, I didn't (yet) try to implement this, because while thinking it through, it seemed that this might give quite some tricky cases. Consider the following example:
df = pd.DataFrame(..) df_subset = df[["col1", "col2"]] s1 = df["col1"] s1_subset = s1[0:3] # modifying s1 should modify df, but not df_subset and s1_subset? s1[0] = 0
If we take "only accessing a single Series from a DataFrame is a view, everything else uses copy-on-write", that gives rise to questions like the above where some parents/childs get modified, and some not. This is both harder to explain to users, as harder to implement. For the implementation of the proof-of-concept, the copy-on-write happens "locally" in the series/dataframe that gets modified (meaning: when modifying a given object, its internal array data first gets copied and replaced *if* the object is viewing another or is being viewed by another object). While in the above case, modifying a given object would need to trigger a copy in other (potentially many) objects, and not in the object being modified. It's probably possible to implement this, but certainly harder/trickier to do.
On Fri, Jul 16, 2021 at 5:26 AM Joris Van den Bossche < jorisvandenbossche@gmail.com> wrote:
On Mon, 12 Jul 2021 at 00:58, Joris Van den Bossche < jorisvandenbossche@gmail.com> wrote:
Short summary of the proposal:
1. The result of *any* indexing operation (subsetting a DataFrame or Series in any way) or any method returning a new DataFrame, always *behaves as if it were* a copy in terms of user API.
To explicitly call out the column-as-Series case (since this is a typical case that right now *always* is a view): "any" indexing operation thus also included accessing a DataFrame column as a Series (or slicing a Series).
So something like s = df["col"] and then mutating s will no longer update df. Similarly for series_subset = series[1:5], mutating series_subset will no longer update s. _______________________________________________ Pandas-dev mailing list Pandas-dev@python.org https://mail.python.org/mailman/listinfo/pandas-dev
_______________________________________________ Pandas-dev mailing list Pandas-dev@python.org https://mail.python.org/mailman/listinfo/pandas-dev
_______________________________________________
Pandas-dev mailing list Pandas-dev@python.org https://mail.python.org/mailman/listinfo/pandas-dev
_______________________________________________ Pandas-dev mailing list Pandas-dev@python.org https://mail.python.org/mailman/listinfo/pandas-dev
On Tue, 20 Jul 2021 at 16:10, Adrin <adrin.jalali@gmail.com> wrote:
I guess one question I have is what are the memory and time performance implications of the proposed change.
Memory implications should be positive (less copying). The performance impact of the additional logic (adding/checking of the weak references) is something I didn't yet check (on my to do list), but I suspect it to not be significant. Based on your comment of "numpy array with column names", I think the potential change of the ArrayManager is much more relevant for you than the Copy-on-Write. And to be clear, the current proposal is not tied to the ArrayManager (it's only the proof of concept that is implemented for that). So I would prefer to keep the discussion focused on the copy/view semantics, at least for now (it's only later, when discussing practical ways to get this released, that we need to decide whether we want to combine this with an ArrayManager refactor or not).
I guess I belong to the group of users who think of a pandas DataFrame more as a numpy array with column names attached to them, and hence I'd expect very similar semantics when indexing, and I think copy on write semantics would have a significant impact on our workflows.
I assuming you are also thinking of scikit-learn like worflows? Can you give an example of what your are thinking about how copy-on-write impacts such (or other) workflows? In any case thanks already for your feedback!
On Sat, Jul 17, 2021 at 6:12 PM Marc Garcia <garcia.marc@gmail.com> wrote:
Based on my experience (not sure how biased it is), modifying dataframes with something like `df[col][1:3] = ...` is rare (or the equivalent with `.loc`) except for boolean arrays. From my experience, when the values of a dataframe column are changed, what I think it's way more common is to use `df[col] = df[col].str.upper()`, `df[col] = df[col].fillna(0)`...
While I'm personally happy with Joris proposal, I see two other options that could complement or replace it:
Option 1) Deprecate assigning to a subset of rows, and only allow assigning to whole columns. Something like `df[col][1:3] = ...` could be replaced by for example `df[col] = df[col].mask(slice(1, 3), ...)`. Using `mask` and `where` is already supported for boolean arrays, so slices should be added, and they'd be the only way to replace a subset of values. I think that makes the problem narrower, and easier to understand for users. The main thing to decide and be clear about is what happens if the dataframe is a subset of another one:
``` df2 = df[cond] df2[col] = df2[col].str.upper() ```
Option 2) If assigning with the current syntax (`df[col][1:3] = ...` or `.loc` equivalent) is something we want to keep (I wouldn't if we move in this direction), maybe it could be moved to a `DataFrame` subclass So, the main dataframe class behaves like in option 1, so expectations are much easier to manage. But users who really want to assign with indexing, can still use it, knowing that having a mutable dataframe comes at a cost (copies, more complex behavior...). The `MutableDataFrame` could be in pandas, or a third-party extension.
``` df_mutable = df.to_mutable() df_mutable[col][1:3] = ... ```
On Sat, Jul 17, 2021 at 9:16 AM Joris Van den Bossche < jorisvandenbossche@gmail.com> wrote:
On Fri, 16 Jul 2021 at 18:58, Stephan Hoyer <shoyer@gmail.com> wrote:
On Fri, Jul 16, 2021 at 9:04 AM Brock Mendel <jbrockmendel@gmail.com> wrote:
I do not like the fact that nothing can ever be "just a view" with these semantics, including series[::-1], frame[col], frame[:]. Users reasonably expect numpy semantics for these.
I am personally not sure what "users" in general expect for those (as also mentioned by Tom and Irv already, depending on their background, they might expect different things). For example, for a user that knows basic Python, they could actually expect all those examples to give a copy since `a_list[:]` is a typical way to make a copy of a list.
(it might be interesting to reach out to educators (who might have more experience with expectations/typical errors of novice users) or to do some kind of experiment on this topic)
Personally, I cannot remember that I ever relied on the mutability-aspect of eg `series[1:3]` or `frame[:]` being a view. I think there are generally 2 reasons for users caring about a view: 1) for performance (less copying) and 2) for being able to mutate the view with the explicit goal to mutate the parent (and not as an irrelevant side-effect). I think the first reason is by far the most common one (but that's my subjective opinion from my experience using pandas, so that can certainly depend), and in the current proposal, all those mentioned example will be actual views under the hood (and thus cover this first reason).
The only case where I know I explicitly rely on this is with chained assignment (eg `frame[col][1:3] = ..`). That's certainly a very important use case (and probably the most impacted usage pattern with the current proposal), but it's also a case where 1) there is a clear alternative (don't use chained assignment, but do it in one step (e.g. `frame.loc[1:3, col] = ..`), some corner cases of mixed positional/label-based indexing aside, for which we should find an alternative) and 2) we might be able to detect this and raise an informative error message (specifically for chained assignment).
I think it can be easier to explain "chained assignment never works" than "chained assignment only works if first selecting the column(s)" (depending on the exact rules).
We should revisit the alternative "clear/simple rules" approach that is
"indexing on columns always gives a view" ( https://github.com/pandas-dev/pandas/pull/33597). This is simpler to explain/grok, simpler to implement
I don't know if it is worth the trouble for complex multi-column selections, but I do see the appeal here.
A simpler variant would be to make indexing out a single Series from a DataFrame return a view, with everything else doing copy on write. Then the existing pattern df.column_one[:] = ... would still work.
I was initially thinking about this as well. In the end, I didn't (yet) try to implement this, because while thinking it through, it seemed that this might give quite some tricky cases. Consider the following example:
df = pd.DataFrame(..) df_subset = df[["col1", "col2"]] s1 = df["col1"] s1_subset = s1[0:3] # modifying s1 should modify df, but not df_subset and s1_subset? s1[0] = 0
If we take "only accessing a single Series from a DataFrame is a view, everything else uses copy-on-write", that gives rise to questions like the above where some parents/childs get modified, and some not. This is both harder to explain to users, as harder to implement. For the implementation of the proof-of-concept, the copy-on-write happens "locally" in the series/dataframe that gets modified (meaning: when modifying a given object, its internal array data first gets copied and replaced *if* the object is viewing another or is being viewed by another object). While in the above case, modifying a given object would need to trigger a copy in other (potentially many) objects, and not in the object being modified. It's probably possible to implement this, but certainly harder/trickier to do.
On Fri, Jul 16, 2021 at 5:26 AM Joris Van den Bossche < jorisvandenbossche@gmail.com> wrote:
On Mon, 12 Jul 2021 at 00:58, Joris Van den Bossche < jorisvandenbossche@gmail.com> wrote:
> Short summary of the proposal: > > 1. The result of *any* indexing operation (subsetting a > DataFrame or Series in any way) or any method returning a new DataFrame, > always *behaves as if it were* a copy in terms of user API. > > To explicitly call out the column-as-Series case (since this is a typical case that right now *always* is a view): "any" indexing operation thus also included accessing a DataFrame column as a Series (or slicing a Series).
So something like s = df["col"] and then mutating s will no longer update df. Similarly for series_subset = series[1:5], mutating series_subset will no longer update s. _______________________________________________ Pandas-dev mailing list Pandas-dev@python.org https://mail.python.org/mailman/listinfo/pandas-dev
_______________________________________________ Pandas-dev mailing list Pandas-dev@python.org https://mail.python.org/mailman/listinfo/pandas-dev
_______________________________________________
Pandas-dev mailing list Pandas-dev@python.org https://mail.python.org/mailman/listinfo/pandas-dev
_______________________________________________ Pandas-dev mailing list Pandas-dev@python.org https://mail.python.org/mailman/listinfo/pandas-dev
Memory implications should be positive (less copying).
This is accurate _only_ in cases where we currently make copies. In cases where we currently make views, the perf effect goes the other way. On the flip side, Always-Views improves perf in cases where we currently make copies, but if you want a copy then you'll have to make one explicitly which will claw back that gain. (In the long-out-of-date proof of concept https://github.com/pandas-dev/pandas/pull/33597 df[np.random.randint(0, 30, 30)] was ~92% faster than the status quo at the time)
The performance impact of the additional logic (adding/checking of the weak references) is something I didn't yet check (on my to do list), but I suspect it to not be significant.
Agreed the CoW logic itself should be negligible outside of microbenchmarks. On Fri, Jul 23, 2021 at 12:32 PM Joris Van den Bossche < jorisvandenbossche@gmail.com> wrote:
On Tue, 20 Jul 2021 at 16:10, Adrin <adrin.jalali@gmail.com> wrote:
I guess one question I have is what are the memory and time performance implications of the proposed change.
Memory implications should be positive (less copying). The performance impact of the additional logic (adding/checking of the weak references) is something I didn't yet check (on my to do list), but I suspect it to not be significant.
Based on your comment of "numpy array with column names", I think the potential change of the ArrayManager is much more relevant for you than the Copy-on-Write. And to be clear, the current proposal is not tied to the ArrayManager (it's only the proof of concept that is implemented for that). So I would prefer to keep the discussion focused on the copy/view semantics, at least for now (it's only later, when discussing practical ways to get this released, that we need to decide whether we want to combine this with an ArrayManager refactor or not).
I guess I belong to the group of users who think of a pandas DataFrame more as a numpy array with column names attached to them, and hence I'd expect very similar semantics when indexing, and I think copy on write semantics would have a significant impact on our workflows.
I assuming you are also thinking of scikit-learn like worflows? Can you give an example of what your are thinking about how copy-on-write impacts such (or other) workflows? In any case thanks already for your feedback!
On Sat, Jul 17, 2021 at 6:12 PM Marc Garcia <garcia.marc@gmail.com> wrote:
Based on my experience (not sure how biased it is), modifying dataframes with something like `df[col][1:3] = ...` is rare (or the equivalent with `.loc`) except for boolean arrays. From my experience, when the values of a dataframe column are changed, what I think it's way more common is to use `df[col] = df[col].str.upper()`, `df[col] = df[col].fillna(0) `...
While I'm personally happy with Joris proposal, I see two other options that could complement or replace it:
Option 1) Deprecate assigning to a subset of rows, and only allow assigning to whole columns. Something like `df[col][1:3] = ...` could be replaced by for example `df[col] = df[col].mask(slice(1, 3), ...)`. Using `mask` and `where` is already supported for boolean arrays, so slices should be added, and they'd be the only way to replace a subset of values. I think that makes the problem narrower, and easier to understand for users. The main thing to decide and be clear about is what happens if the dataframe is a subset of another one:
``` df2 = df[cond] df2[col] = df2[col].str.upper() ```
Option 2) If assigning with the current syntax (`df[col][1:3] = ...` or `.loc` equivalent) is something we want to keep (I wouldn't if we move in this direction), maybe it could be moved to a `DataFrame` subclass So, the main dataframe class behaves like in option 1, so expectations are much easier to manage. But users who really want to assign with indexing, can still use it, knowing that having a mutable dataframe comes at a cost (copies, more complex behavior...). The ` MutableDataFrame` could be in pandas, or a third-party extension.
``` df_mutable = df.to_mutable() df_mutable[col][1:3] = ... ```
On Sat, Jul 17, 2021 at 9:16 AM Joris Van den Bossche < jorisvandenbossche@gmail.com> wrote:
On Fri, 16 Jul 2021 at 18:58, Stephan Hoyer <shoyer@gmail.com> wrote:
On Fri, Jul 16, 2021 at 9:04 AM Brock Mendel <jbrockmendel@gmail.com> wrote:
I do not like the fact that nothing can ever be "just a view" with these semantics, including series[::-1], frame[col], frame[:]. Users reasonably expect numpy semantics for these.
I am personally not sure what "users" in general expect for those (as also mentioned by Tom and Irv already, depending on their background, they might expect different things). For example, for a user that knows basic Python, they could actually expect all those examples to give a copy since `a_list[:]` is a typical way to make a copy of a list.
(it might be interesting to reach out to educators (who might have more experience with expectations/typical errors of novice users) or to do some kind of experiment on this topic)
Personally, I cannot remember that I ever relied on the mutability-aspect of eg `series[1:3]` or `frame[:]` being a view. I think there are generally 2 reasons for users caring about a view: 1) for performance (less copying) and 2) for being able to mutate the view with the explicit goal to mutate the parent (and not as an irrelevant side-effect). I think the first reason is by far the most common one (but that's my subjective opinion from my experience using pandas, so that can certainly depend), and in the current proposal, all those mentioned example will be actual views under the hood (and thus cover this first reason).
The only case where I know I explicitly rely on this is with chained assignment (eg `frame[col][1:3] = ..`). That's certainly a very important use case (and probably the most impacted usage pattern with the current proposal), but it's also a case where 1) there is a clear alternative (don't use chained assignment, but do it in one step (e.g. `frame.loc[1:3, col] = ..`), some corner cases of mixed positional/label-based indexing aside, for which we should find an alternative) and 2) we might be able to detect this and raise an informative error message (specifically for chained assignment).
I think it can be easier to explain "chained assignment never works" than "chained assignment only works if first selecting the column(s)" (depending on the exact rules).
We should revisit the alternative "clear/simple rules" approach that
is "indexing on columns always gives a view" ( https://github.com/pandas-dev/pandas/pull/33597). This is simpler to explain/grok, simpler to implement
I don't know if it is worth the trouble for complex multi-column selections, but I do see the appeal here.
A simpler variant would be to make indexing out a single Series from a DataFrame return a view, with everything else doing copy on write. Then the existing pattern df.column_one[:] = ... would still work.
I was initially thinking about this as well. In the end, I didn't (yet) try to implement this, because while thinking it through, it seemed that this might give quite some tricky cases. Consider the following example:
df = pd.DataFrame(..) df_subset = df[["col1", "col2"]] s1 = df["col1"] s1_subset = s1[0:3] # modifying s1 should modify df, but not df_subset and s1_subset? s1[0] = 0
If we take "only accessing a single Series from a DataFrame is a view, everything else uses copy-on-write", that gives rise to questions like the above where some parents/childs get modified, and some not. This is both harder to explain to users, as harder to implement. For the implementation of the proof-of-concept, the copy-on-write happens "locally" in the series/dataframe that gets modified (meaning: when modifying a given object, its internal array data first gets copied and replaced *if* the object is viewing another or is being viewed by another object). While in the above case, modifying a given object would need to trigger a copy in other (potentially many) objects, and not in the object being modified. It's probably possible to implement this, but certainly harder/trickier to do.
On Fri, Jul 16, 2021 at 5:26 AM Joris Van den Bossche < jorisvandenbossche@gmail.com> wrote:
> > > On Mon, 12 Jul 2021 at 00:58, Joris Van den Bossche < > jorisvandenbossche@gmail.com> wrote: > >> Short summary of the proposal: >> >> 1. The result of *any* indexing operation (subsetting a >> DataFrame or Series in any way) or any method returning a new DataFrame, >> always *behaves as if it were* a copy in terms of user API. >> >> To explicitly call out the column-as-Series case (since this is a > typical case that right now *always* is a view): "any" indexing > operation thus also included accessing a DataFrame column as a Series (or > slicing a Series). > > So something like s = df["col"] and then mutating s will no longer > update df. Similarly for series_subset = series[1:5], mutating > series_subset will no longer update s. > _______________________________________________ > Pandas-dev mailing list > Pandas-dev@python.org > https://mail.python.org/mailman/listinfo/pandas-dev > _______________________________________________ Pandas-dev mailing list Pandas-dev@python.org https://mail.python.org/mailman/listinfo/pandas-dev
_______________________________________________
Pandas-dev mailing list Pandas-dev@python.org https://mail.python.org/mailman/listinfo/pandas-dev
_______________________________________________ Pandas-dev mailing list Pandas-dev@python.org https://mail.python.org/mailman/listinfo/pandas-dev
_______________________________________________
Pandas-dev mailing list Pandas-dev@python.org https://mail.python.org/mailman/listinfo/pandas-dev
There are two cases that I think are relevant here (as opposed to the ArrayManager discussion), but I may be wrong. The two cases I'm thinking, in a simple not-optimized way are, in psuedocode: for column in columns_of(data): data[:, column] = (data[:, column] - mean(data[:, column])) / std(data[:, column]) And the other one is the same as above, but for rows. Also, one issue I have, is that if we're doing copy-on-write, then what does the above mean? As in, if I do `df["column_A"] = ....`, where is that copy? How do I access the new one as opposed to the old one? On Fri, Jul 23, 2021 at 10:09 PM Brock Mendel <jbrockmendel@gmail.com> wrote:
Memory implications should be positive (less copying).
This is accurate _only_ in cases where we currently make copies. In cases where we currently make views, the perf effect goes the other way.
On the flip side, Always-Views improves perf in cases where we currently make copies, but if you want a copy then you'll have to make one explicitly which will claw back that gain. (In the long-out-of-date proof of concept https://github.com/pandas-dev/pandas/pull/33597 df[np.random.randint(0, 30, 30)] was ~92% faster than the status quo at the time)
The performance impact of the additional logic (adding/checking of the weak references) is something I didn't yet check (on my to do list), but I suspect it to not be significant.
Agreed the CoW logic itself should be negligible outside of microbenchmarks.
On Fri, Jul 23, 2021 at 12:32 PM Joris Van den Bossche < jorisvandenbossche@gmail.com> wrote:
On Tue, 20 Jul 2021 at 16:10, Adrin <adrin.jalali@gmail.com> wrote:
I guess one question I have is what are the memory and time performance implications of the proposed change.
Memory implications should be positive (less copying). The performance impact of the additional logic (adding/checking of the weak references) is something I didn't yet check (on my to do list), but I suspect it to not be significant.
Based on your comment of "numpy array with column names", I think the potential change of the ArrayManager is much more relevant for you than the Copy-on-Write. And to be clear, the current proposal is not tied to the ArrayManager (it's only the proof of concept that is implemented for that). So I would prefer to keep the discussion focused on the copy/view semantics, at least for now (it's only later, when discussing practical ways to get this released, that we need to decide whether we want to combine this with an ArrayManager refactor or not).
I guess I belong to the group of users who think of a pandas DataFrame more as a numpy array with column names attached to them, and hence I'd expect very similar semantics when indexing, and I think copy on write semantics would have a significant impact on our workflows.
I assuming you are also thinking of scikit-learn like worflows? Can you give an example of what your are thinking about how copy-on-write impacts such (or other) workflows? In any case thanks already for your feedback!
On Sat, Jul 17, 2021 at 6:12 PM Marc Garcia <garcia.marc@gmail.com> wrote:
Based on my experience (not sure how biased it is), modifying dataframes with something like `df[col][1:3] = ...` is rare (or the equivalent with `.loc`) except for boolean arrays. From my experience, when the values of a dataframe column are changed, what I think it's way more common is to use `df[col] = df[col].str.upper()`, `df[col] = df[col].fillna(0)`...
While I'm personally happy with Joris proposal, I see two other options that could complement or replace it:
Option 1) Deprecate assigning to a subset of rows, and only allow assigning to whole columns. Something like `df[col][1:3] = ...` could be replaced by for example `df[col] = df[col].mask(slice(1, 3), ...)`. Using `mask` and `where` is already supported for boolean arrays, so slices should be added, and they'd be the only way to replace a subset of values. I think that makes the problem narrower, and easier to understand for users. The main thing to decide and be clear about is what happens if the dataframe is a subset of another one:
``` df2 = df[cond] df2[col] = df2[col].str.upper() ```
Option 2) If assigning with the current syntax (`df[col][1:3] = ...` or `.loc` equivalent) is something we want to keep (I wouldn't if we move in this direction), maybe it could be moved to a `DataFrame` subclass So, the main dataframe class behaves like in option 1, so expectations are much easier to manage. But users who really want to assign with indexing, can still use it, knowing that having a mutable dataframe comes at a cost (copies, more complex behavior...). The ` MutableDataFrame` could be in pandas, or a third-party extension.
``` df_mutable = df.to_mutable() df_mutable[col][1:3] = ... ```
On Sat, Jul 17, 2021 at 9:16 AM Joris Van den Bossche < jorisvandenbossche@gmail.com> wrote:
On Fri, 16 Jul 2021 at 18:58, Stephan Hoyer <shoyer@gmail.com> wrote:
On Fri, Jul 16, 2021 at 9:04 AM Brock Mendel <jbrockmendel@gmail.com> wrote:
> I do not like the fact that nothing can ever be "just a view" with > these semantics, including series[::-1], frame[col], frame[:]. Users > reasonably expect numpy semantics for these. > > I am personally not sure what "users" in general expect for those (as also mentioned by Tom and Irv already, depending on their background, they might expect different things). For example, for a user that knows basic Python, they could actually expect all those examples to give a copy since `a_list[:]` is a typical way to make a copy of a list.
(it might be interesting to reach out to educators (who might have more experience with expectations/typical errors of novice users) or to do some kind of experiment on this topic)
Personally, I cannot remember that I ever relied on the mutability-aspect of eg `series[1:3]` or `frame[:]` being a view. I think there are generally 2 reasons for users caring about a view: 1) for performance (less copying) and 2) for being able to mutate the view with the explicit goal to mutate the parent (and not as an irrelevant side-effect). I think the first reason is by far the most common one (but that's my subjective opinion from my experience using pandas, so that can certainly depend), and in the current proposal, all those mentioned example will be actual views under the hood (and thus cover this first reason).
The only case where I know I explicitly rely on this is with chained assignment (eg `frame[col][1:3] = ..`). That's certainly a very important use case (and probably the most impacted usage pattern with the current proposal), but it's also a case where 1) there is a clear alternative (don't use chained assignment, but do it in one step (e.g. `frame.loc[1:3, col] = ..`), some corner cases of mixed positional/label-based indexing aside, for which we should find an alternative) and 2) we might be able to detect this and raise an informative error message (specifically for chained assignment).
I think it can be easier to explain "chained assignment never works" than "chained assignment only works if first selecting the column(s)" (depending on the exact rules).
We should revisit the alternative "clear/simple rules" approach that > is "indexing on columns always gives a view" ( > https://github.com/pandas-dev/pandas/pull/33597). This is simpler > to explain/grok, simpler to implement >
I don't know if it is worth the trouble for complex multi-column selections, but I do see the appeal here.
A simpler variant would be to make indexing out a single Series from a DataFrame return a view, with everything else doing copy on write. Then the existing pattern df.column_one[:] = ... would still work.
I was initially thinking about this as well. In the end, I didn't (yet) try to implement this, because while thinking it through, it seemed that this might give quite some tricky cases. Consider the following example:
df = pd.DataFrame(..) df_subset = df[["col1", "col2"]] s1 = df["col1"] s1_subset = s1[0:3] # modifying s1 should modify df, but not df_subset and s1_subset? s1[0] = 0
If we take "only accessing a single Series from a DataFrame is a view, everything else uses copy-on-write", that gives rise to questions like the above where some parents/childs get modified, and some not. This is both harder to explain to users, as harder to implement. For the implementation of the proof-of-concept, the copy-on-write happens "locally" in the series/dataframe that gets modified (meaning: when modifying a given object, its internal array data first gets copied and replaced *if* the object is viewing another or is being viewed by another object). While in the above case, modifying a given object would need to trigger a copy in other (potentially many) objects, and not in the object being modified. It's probably possible to implement this, but certainly harder/trickier to do.
> > On Fri, Jul 16, 2021 at 5:26 AM Joris Van den Bossche < > jorisvandenbossche@gmail.com> wrote: > >> >> >> On Mon, 12 Jul 2021 at 00:58, Joris Van den Bossche < >> jorisvandenbossche@gmail.com> wrote: >> >>> Short summary of the proposal: >>> >>> 1. The result of *any* indexing operation (subsetting a >>> DataFrame or Series in any way) or any method returning a new DataFrame, >>> always *behaves as if it were* a copy in terms of user API. >>> >>> To explicitly call out the column-as-Series case (since this is a >> typical case that right now *always* is a view): "any" indexing >> operation thus also included accessing a DataFrame column as a Series (or >> slicing a Series). >> >> So something like s = df["col"] and then mutating s will no longer >> update df. Similarly for series_subset = series[1:5], mutating >> series_subset will no longer update s. >> _______________________________________________ >> Pandas-dev mailing list >> Pandas-dev@python.org >> https://mail.python.org/mailman/listinfo/pandas-dev >> > _______________________________________________ > Pandas-dev mailing list > Pandas-dev@python.org > https://mail.python.org/mailman/listinfo/pandas-dev > _______________________________________________
Pandas-dev mailing list Pandas-dev@python.org https://mail.python.org/mailman/listinfo/pandas-dev
_______________________________________________ Pandas-dev mailing list Pandas-dev@python.org https://mail.python.org/mailman/listinfo/pandas-dev
_______________________________________________
Pandas-dev mailing list Pandas-dev@python.org https://mail.python.org/mailman/listinfo/pandas-dev
data.iloc[:, c] = (data.iloc[:, c] - data.iloc[:, c].mean()) / data.iloc[:, c].std()
This would not make any copies under any of the scenarios being discussed, including the status quo.
And the other one is the same as above, but for rows.
With ArrayManager, the `data.iloc[r]` will make a copy, but the CoW doesn't affect that. No copies with BlockManager, regardless of CoW. On Mon, Jul 26, 2021 at 2:51 AM Adrin <adrin.jalali@gmail.com> wrote:
There are two cases that I think are relevant here (as opposed to the ArrayManager discussion), but I may be wrong.
The two cases I'm thinking, in a simple not-optimized way are, in psuedocode:
for column in columns_of(data): data[:, column] = (data[:, column] - mean(data[:, column])) / std(data[:, column])
And the other one is the same as above, but for rows.
Also, one issue I have, is that if we're doing copy-on-write, then what does the above mean? As in, if I do `df["column_A"] = ....`, where is that copy? How do I access the new one as opposed to the old one?
On Fri, Jul 23, 2021 at 10:09 PM Brock Mendel <jbrockmendel@gmail.com> wrote:
Memory implications should be positive (less copying).
This is accurate _only_ in cases where we currently make copies. In cases where we currently make views, the perf effect goes the other way.
On the flip side, Always-Views improves perf in cases where we currently make copies, but if you want a copy then you'll have to make one explicitly which will claw back that gain. (In the long-out-of-date proof of concept https://github.com/pandas-dev/pandas/pull/33597 df[np.random.randint(0, 30, 30)] was ~92% faster than the status quo at the time)
The performance impact of the additional logic (adding/checking of the weak references) is something I didn't yet check (on my to do list), but I suspect it to not be significant.
Agreed the CoW logic itself should be negligible outside of microbenchmarks.
On Fri, Jul 23, 2021 at 12:32 PM Joris Van den Bossche < jorisvandenbossche@gmail.com> wrote:
On Tue, 20 Jul 2021 at 16:10, Adrin <adrin.jalali@gmail.com> wrote:
I guess one question I have is what are the memory and time performance implications of the proposed change.
Memory implications should be positive (less copying). The performance impact of the additional logic (adding/checking of the weak references) is something I didn't yet check (on my to do list), but I suspect it to not be significant.
Based on your comment of "numpy array with column names", I think the potential change of the ArrayManager is much more relevant for you than the Copy-on-Write. And to be clear, the current proposal is not tied to the ArrayManager (it's only the proof of concept that is implemented for that). So I would prefer to keep the discussion focused on the copy/view semantics, at least for now (it's only later, when discussing practical ways to get this released, that we need to decide whether we want to combine this with an ArrayManager refactor or not).
I guess I belong to the group of users who think of a pandas DataFrame more as a numpy array with column names attached to them, and hence I'd expect very similar semantics when indexing, and I think copy on write semantics would have a significant impact on our workflows.
I assuming you are also thinking of scikit-learn like worflows? Can you give an example of what your are thinking about how copy-on-write impacts such (or other) workflows? In any case thanks already for your feedback!
On Sat, Jul 17, 2021 at 6:12 PM Marc Garcia <garcia.marc@gmail.com> wrote:
Based on my experience (not sure how biased it is), modifying dataframes with something like `df[col][1:3] = ...` is rare (or the equivalent with `.loc`) except for boolean arrays. From my experience, when the values of a dataframe column are changed, what I think it's way more common is to use `df[col] = df[col].str.upper()`, `df[col] = df[col].fillna(0)`...
While I'm personally happy with Joris proposal, I see two other options that could complement or replace it:
Option 1) Deprecate assigning to a subset of rows, and only allow assigning to whole columns. Something like `df[col][1:3] = ...` could be replaced by for example `df[col] = df[col].mask(slice(1, 3), ...)`. Using `mask` and `where` is already supported for boolean arrays, so slices should be added, and they'd be the only way to replace a subset of values. I think that makes the problem narrower, and easier to understand for users. The main thing to decide and be clear about is what happens if the dataframe is a subset of another one:
``` df2 = df[cond] df2[col] = df2[col].str.upper() ```
Option 2) If assigning with the current syntax (`df[col][1:3] = ...` or `.loc` equivalent) is something we want to keep (I wouldn't if we move in this direction), maybe it could be moved to a `DataFrame` subclass So, the main dataframe class behaves like in option 1, so expectations are much easier to manage. But users who really want to assign with indexing, can still use it, knowing that having a mutable dataframe comes at a cost (copies, more complex behavior...). The ` MutableDataFrame` could be in pandas, or a third-party extension.
``` df_mutable = df.to_mutable() df_mutable[col][1:3] = ... ```
On Sat, Jul 17, 2021 at 9:16 AM Joris Van den Bossche < jorisvandenbossche@gmail.com> wrote:
On Fri, 16 Jul 2021 at 18:58, Stephan Hoyer <shoyer@gmail.com> wrote:
> On Fri, Jul 16, 2021 at 9:04 AM Brock Mendel <jbrockmendel@gmail.com> > wrote: > >> I do not like the fact that nothing can ever be "just a view" with >> these semantics, including series[::-1], frame[col], frame[:]. Users >> reasonably expect numpy semantics for these. >> >> I am personally not sure what "users" in general expect for those (as also mentioned by Tom and Irv already, depending on their background, they might expect different things). For example, for a user that knows basic Python, they could actually expect all those examples to give a copy since `a_list[:]` is a typical way to make a copy of a list.
(it might be interesting to reach out to educators (who might have more experience with expectations/typical errors of novice users) or to do some kind of experiment on this topic)
Personally, I cannot remember that I ever relied on the mutability-aspect of eg `series[1:3]` or `frame[:]` being a view. I think there are generally 2 reasons for users caring about a view: 1) for performance (less copying) and 2) for being able to mutate the view with the explicit goal to mutate the parent (and not as an irrelevant side-effect). I think the first reason is by far the most common one (but that's my subjective opinion from my experience using pandas, so that can certainly depend), and in the current proposal, all those mentioned example will be actual views under the hood (and thus cover this first reason).
The only case where I know I explicitly rely on this is with chained assignment (eg `frame[col][1:3] = ..`). That's certainly a very important use case (and probably the most impacted usage pattern with the current proposal), but it's also a case where 1) there is a clear alternative (don't use chained assignment, but do it in one step (e.g. `frame.loc[1:3, col] = ..`), some corner cases of mixed positional/label-based indexing aside, for which we should find an alternative) and 2) we might be able to detect this and raise an informative error message (specifically for chained assignment).
I think it can be easier to explain "chained assignment never works" than "chained assignment only works if first selecting the column(s)" (depending on the exact rules).
> We should revisit the alternative "clear/simple rules" approach that >> is "indexing on columns always gives a view" ( >> https://github.com/pandas-dev/pandas/pull/33597). This is simpler >> to explain/grok, simpler to implement >> > > I don't know if it is worth the trouble for complex multi-column > selections, but I do see the appeal here. > > A simpler variant would be to make indexing out a single Series from > a DataFrame return a view, with everything else doing copy on write. Then > the existing pattern df.column_one[:] = ... would still work. >
I was initially thinking about this as well. In the end, I didn't (yet) try to implement this, because while thinking it through, it seemed that this might give quite some tricky cases. Consider the following example:
df = pd.DataFrame(..) df_subset = df[["col1", "col2"]] s1 = df["col1"] s1_subset = s1[0:3] # modifying s1 should modify df, but not df_subset and s1_subset? s1[0] = 0
If we take "only accessing a single Series from a DataFrame is a view, everything else uses copy-on-write", that gives rise to questions like the above where some parents/childs get modified, and some not. This is both harder to explain to users, as harder to implement. For the implementation of the proof-of-concept, the copy-on-write happens "locally" in the series/dataframe that gets modified (meaning: when modifying a given object, its internal array data first gets copied and replaced *if* the object is viewing another or is being viewed by another object). While in the above case, modifying a given object would need to trigger a copy in other (potentially many) objects, and not in the object being modified. It's probably possible to implement this, but certainly harder/trickier to do.
> > >> >> On Fri, Jul 16, 2021 at 5:26 AM Joris Van den Bossche < >> jorisvandenbossche@gmail.com> wrote: >> >>> >>> >>> On Mon, 12 Jul 2021 at 00:58, Joris Van den Bossche < >>> jorisvandenbossche@gmail.com> wrote: >>> >>>> Short summary of the proposal: >>>> >>>> 1. The result of *any* indexing operation (subsetting a >>>> DataFrame or Series in any way) or any method returning a new DataFrame, >>>> always *behaves as if it were* a copy in terms of user API. >>>> >>>> To explicitly call out the column-as-Series case (since this is >>> a typical case that right now *always* is a view): "any" indexing >>> operation thus also included accessing a DataFrame column as a Series (or >>> slicing a Series). >>> >>> So something like s = df["col"] and then mutating s will no >>> longer update df. Similarly for series_subset = series[1:5], >>> mutating series_subset will no longer update s. >>> _______________________________________________ >>> Pandas-dev mailing list >>> Pandas-dev@python.org >>> https://mail.python.org/mailman/listinfo/pandas-dev >>> >> _______________________________________________ >> Pandas-dev mailing list >> Pandas-dev@python.org >> https://mail.python.org/mailman/listinfo/pandas-dev >> > _______________________________________________ Pandas-dev mailing list Pandas-dev@python.org https://mail.python.org/mailman/listinfo/pandas-dev
_______________________________________________ Pandas-dev mailing list Pandas-dev@python.org https://mail.python.org/mailman/listinfo/pandas-dev
_______________________________________________
Pandas-dev mailing list Pandas-dev@python.org https://mail.python.org/mailman/listinfo/pandas-dev
On Fri, 23 Jul 2021 at 22:09, Brock Mendel <jbrockmendel@gmail.com> wrote:
Memory implications should be positive (less copying).
This is accurate _only_ in cases where we currently make copies. In cases where we currently make views, the perf effect goes the other way.
Yes, but to clear: only when you mutate an object. As long as you don't do that (which I think is the majority of operations), we will keep making views where we currently do that already. On Mon, 26 Jul 2021 at 18:38, Brock Mendel <jbrockmendel@gmail.com> wrote:
data.iloc[:, c] = (data.iloc[:, c] - data.iloc[:, c].mean()) / data.iloc[:, c].std()
This would not make any copies under any of the scenarios being discussed, including the status quo.
One small point: this might depend on whether we keep `[:, col]` as a special case replacing the column altogether (as we currently still do, I think, related to some recent discussions), or if we see it as an in-place mutation of the existing column with a slice (which just happens to be a "full" slice). In the second case, this could actually trigger copy-on-write since the same column is also accessed (only as temporary variable, but python might not yet have garbage collected it). On Mon, 26 Jul 2021 at 11:51, Adrin <adrin.jalali@gmail.com> wrote:
.... Also, one issue I have, is that if we're doing copy-on-write, then what does the above mean? As in, if I do `df["column_A"] = ....`, where is that copy? How do I access the new one as opposed to the old one?
I am not fully sure if I understand your question correctly, but something like `df["column_A"] = ....` still edits the DataFrame in place. So here there is no "new" or "old" version of the DataFrame. That specific example replaces a full column and will not trigger a copy (as it doesn't edit the specific column's data inplace), but if you take something like `df.loc[mask, '"column_A"] = ...`, the possible copy happens inside df: if "column_A" is a view / being viewed, then the underlying array for this column first gets copied before being mutated. So the copy happens on the level of the array. But the DataFrame df itself is still mutated in place (the array for "column_A" get replaced with a copy of it), so also here there is no "old"/"new" version of the DataFrame. Does that answer the question, or can you otherwise clarify your question? Joris
I am not fully sure if I understand your question correctly, but something like `df["column_A"] = ....` still edits the DataFrame in place. So here there is no "new" or "old" version of the DataFrame. That specific example replaces a full column and will not trigger a copy (as it doesn't edit the specific column's data inplace), but if you take something like `df.loc[mask, '"column_A"] = ...`, the possible copy happens inside df: if "column_A" is a view / being viewed, then the underlying array for this column first gets copied before being mutated. So the copy happens on the level of the array. But the DataFrame df itself is still mutated in place (the array for "column_A" get replaced with a copy of it), so also here there is no "old"/"new" version of the DataFrame. Does that answer the question, or can you otherwise clarify your question?
I guess as a user, I find it odd that with and w/o a mask, the behavior is different. So does that mean `df.loc[mask, '"column_A"] = ...` is not a valid operation? Cause I guess I've lost that copy which holds the modified data, right? Silly question: why not move the other way around, i.e. always modify the original data, unless the user does a `copy()`? Is that not more intuitive to people? On Mon, Aug 9, 2021 at 6:53 PM Joris Van den Bossche < jorisvandenbossche@gmail.com> wrote:
On Fri, 23 Jul 2021 at 22:09, Brock Mendel <jbrockmendel@gmail.com> wrote:
Memory implications should be positive (less copying).
This is accurate _only_ in cases where we currently make copies. In cases where we currently make views, the perf effect goes the other way.
Yes, but to clear: only when you mutate an object. As long as you don't do that (which I think is the majority of operations), we will keep making views where we currently do that already.
On Mon, 26 Jul 2021 at 18:38, Brock Mendel <jbrockmendel@gmail.com> wrote:
data.iloc[:, c] = (data.iloc[:, c] - data.iloc[:, c].mean()) / data.iloc[:, c].std()
This would not make any copies under any of the scenarios being discussed, including the status quo.
One small point: this might depend on whether we keep `[:, col]` as a special case replacing the column altogether (as we currently still do, I think, related to some recent discussions), or if we see it as an in-place mutation of the existing column with a slice (which just happens to be a "full" slice). In the second case, this could actually trigger copy-on-write since the same column is also accessed (only as temporary variable, but python might not yet have garbage collected it).
On Mon, 26 Jul 2021 at 11:51, Adrin <adrin.jalali@gmail.com> wrote:
.... Also, one issue I have, is that if we're doing copy-on-write, then what does the above mean? As in, if I do `df["column_A"] = ....`, where is that copy? How do I access the new one as opposed to the old one?
I am not fully sure if I understand your question correctly, but something like `df["column_A"] = ....` still edits the DataFrame in place. So here there is no "new" or "old" version of the DataFrame. That specific example replaces a full column and will not trigger a copy (as it doesn't edit the specific column's data inplace), but if you take something like `df.loc[mask, '"column_A"] = ...`, the possible copy happens inside df: if "column_A" is a view / being viewed, then the underlying array for this column first gets copied before being mutated. So the copy happens on the level of the array. But the DataFrame df itself is still mutated in place (the array for "column_A" get replaced with a copy of it), so also here there is no "old"/"new" version of the DataFrame. Does that answer the question, or can you otherwise clarify your question?
Joris
_______________________________________________ Pandas-dev mailing list Pandas-dev@python.org https://mail.python.org/mailman/listinfo/pandas-dev
On Tue, 10 Aug 2021 at 12:52, Adrin <adrin.jalali@gmail.com> wrote:
Silly question: why not move the other way around, i.e. always modify the original data, unless the user does a `copy()`? Is that not more intuitive to people?
That's certainly not a silly question :) That's an option as well, and
somewhat related to the "indexing on columns always gives a view" mentioned by Brock above. The alternatives section in the google doc <https://docs.google.com/document/d/1ZCQ9mx3LBMy-nhwRl33_jgcvWo9IWdEfxDNQ2thy...> also mentions a few reasons to prefer copy-on-write IMO. Some points on this: 1) First, we can't "always modify the original data", since that is only possible when we have a view of the original data. That might be obvious for someone (like you and me) familiar with numpy, but if you don't have this background, that's not necessarily the case (I am not sure numpy's copy/view rules are necessarily intuitive, unless you are familiar with memory layout). So we still need some rules. The selection of columns can always be a view, as proposed by Brock. But someone should then make a more complete proposal for how to handle row selection: always copy, or follow numpy rules? (i.e. basically a slice is a view, otherwise a copy) You also get things like `df.iloc[[0, 1, 2], :]` being a copy and `df.iloc[:, [0, 1, 2]]` being a view. Of course that's explainable (i.e. since the storage is columnar, different copy/view rules apply to selecting rows vs columns), but IMO not necessarily simpler as the proposal where both cases act as a copy. Or that `df[0:5]['col'] = ..` works but `df[mask]['col'] = ...` doesn't work. 2) For indexing it's certainly an open question what is most intuitive, but I think for *methods* that return a new DataFrame, people generally expect that those don't modify each other. And for me, this is one of the main reasons for this proposal that I want to improve the efficiency of methods to not have to copy the dataframe by default (methods like rename, (re)set_index, drop columns, etc). In my mind, for this the most logical thing to do is copy-on-write. Of course it's not because we would want copy-on-write for methods, that we can't do something different for indexing (although what with methods that basically are equivalent to an indexing operation .. ?). But, from an implementation point of view, I am not sure it would actually be technically possible to sometimes do copy-on-write, and sometimes not (probably possible in theory, but a lot more complicated; see also one of my previous answers ( https://mail.python.org/pipermail/pandas-dev/2021-July/001368.html) on having a single column as view). 3) Personally, I don't think that I ever (at least not often) had the use case where I intentionally wanted to modify a parent dataframe by modifying a subsetted child dataframe (explicit chained indexing aside). So also from that point of view, I find the "always (if possible) modify the original data" less interesting than the potential performance benefits / the IMO simpler rule of never modifying. Joris
A couple of thoughts from the discussion on today's call: 1) A lot of the discussion about the indexing behavior revolved around "users expect X". I fundamentally do *not* want to be in the business of speculating about this. 2) I find the case for CoW more compelling for the chained methods usage `frame.rename(...).reset_index(...).set_index(...)`. If we had a viable way to implement CoW for these independently of the indexing, that would be a slam dunk. Alternatively, we could get a lot of the benefits from a `copy` keyword in the pertinent methods (explicit, better than implicit). On Tue, Aug 10, 2021 at 3:14 PM Joris Van den Bossche < jorisvandenbossche@gmail.com> wrote:
On Tue, 10 Aug 2021 at 12:52, Adrin <adrin.jalali@gmail.com> wrote:
Silly question: why not move the other way around, i.e. always modify the original data, unless the user does a `copy()`? Is that not more intuitive to people?
That's certainly not a silly question :) That's an option as well, and
somewhat related to the "indexing on columns always gives a view" mentioned by Brock above. The alternatives section in the google doc <https://docs.google.com/document/d/1ZCQ9mx3LBMy-nhwRl33_jgcvWo9IWdEfxDNQ2thy...> also mentions a few reasons to prefer copy-on-write IMO. Some points on this:
1) First, we can't "always modify the original data", since that is only possible when we have a view of the original data. That might be obvious for someone (like you and me) familiar with numpy, but if you don't have this background, that's not necessarily the case (I am not sure numpy's copy/view rules are necessarily intuitive, unless you are familiar with memory layout). So we still need some rules. The selection of columns can always be a view, as proposed by Brock. But someone should then make a more complete proposal for how to handle row selection: always copy, or follow numpy rules? (i.e. basically a slice is a view, otherwise a copy)
You also get things like `df.iloc[[0, 1, 2], :]` being a copy and `df.iloc[:, [0, 1, 2]]` being a view. Of course that's explainable (i.e. since the storage is columnar, different copy/view rules apply to selecting rows vs columns), but IMO not necessarily simpler as the proposal where both cases act as a copy. Or that `df[0:5]['col'] = ..` works but `df[mask]['col'] = ...` doesn't work.
2) For indexing it's certainly an open question what is most intuitive, but I think for *methods* that return a new DataFrame, people generally expect that those don't modify each other. And for me, this is one of the main reasons for this proposal that I want to improve the efficiency of methods to not have to copy the dataframe by default (methods like rename, (re)set_index, drop columns, etc). In my mind, for this the most logical thing to do is copy-on-write. Of course it's not because we would want copy-on-write for methods, that we can't do something different for indexing (although what with methods that basically are equivalent to an indexing operation .. ?). But, from an implementation point of view, I am not sure it would actually be technically possible to sometimes do copy-on-write, and sometimes not (probably possible in theory, but a lot more complicated; see also one of my previous answers ( https://mail.python.org/pipermail/pandas-dev/2021-July/001368.html) on having a single column as view).
3) Personally, I don't think that I ever (at least not often) had the use case where I intentionally wanted to modify a parent dataframe by modifying a subsetted child dataframe (explicit chained indexing aside). So also from that point of view, I find the "always (if possible) modify the original data" less interesting than the potential performance benefits / the IMO simpler rule of never modifying.
Joris
_______________________________________________ Pandas-dev mailing list Pandas-dev@python.org https://mail.python.org/mailman/listinfo/pandas-dev
Another follow-up of the discussion we had yesterday: we talked about when objects get modified and when not (in this proposal), and basically the rule would be: *"the only way to modify an object (DataFrame or Series) is to modify the object itself directly"*, or stated in another way: you can never modify an object by modifying a different object (modifications are never propagated, as you would have with numpy views). In Python, we need to take into account "object identity" then (because you can still have multiple variables/names pointing to the same object), and I added a section trying to explain that with an example in the google doc: https://docs.google.com/document/d/1ZCQ9mx3LBMy-nhwRl33_jgcvWo9IWdEfxDNQ2thy...
(trying to revive this discussion) Some assorted comments on the last emails in this thread / comments on the google doc (and I will follow-up with a separate email about the single-Series-from-DataFrame-as-view issue). - A small note about "users' expectations": I am not going to say this easy (in contrast, this is one of the hardest parts of being a library author, IMO), but we are creating tools to be used by users. So while designing those tools, I think it is an essential part to think about how users will use your library / how they think something works / what they need / what they find intuitive / etc (thus, related to their expectations). And because this is a hard problem (and subjective), it would be good to get some more feedback from others on the proposed semantics from the usage point of view. I think the current proposal will be simpler to grasp and reason about especially for new users, but I certainly don't hold the truth on this aspect (and there are different options that are all simpler as the current situation). - On the google doc, Adrin made an interesting comment, quoting a part of that: I understand a slice and a mask are fundamentally different, but I don't
think from the perspective of a user they're different. The user is selecting a subset of the original data. ... Reading through this document I understand why users (and I occasionally) would get the pandas warnings telling us we're modifying something which is not the original object, but it always puzzled me since I didn't expect a slice or a mask to create a copy.
This is an interesting point, and I think one of the crucial aspects that the proposal tries to address. In short: while using a slice or mask are both methods to select a subset of your original data, when it comes to copy/view semantics they *are* fundamentally different for numpy arrays (a slice gives a view, a mask gives a copy). Currently, those numpy rules "leak" through to pandas, although not exactly the same and fully consistently. So we expect a pandas user to know those numpy concepts (views / fancy indexing), and know the differences in rules with pandas. If we want that pandas users don't have to know this, I think the most sensible option is to make them both behave as a copy (which is what the copy-on-write proposal does). I added a new section about this (relation with numpy views and differences) in the good doc: https://docs.google.com/document/d/1ZCQ9mx3LBMy-nhwRl33_jgcvWo9IWdEfxDNQ2thy... On Thu, 12 Aug 2021 at 01:45, Brock Mendel <jbrockmendel@gmail.com> wrote:
2) I find the case for CoW more compelling for the chained methods usage `frame.rename(...).reset_index(...).set_index(...)`. If we had a viable way to implement CoW for these independently of the indexing, that would be a slam dunk. Alternatively, we could get a lot of the benefits from a `copy` keyword in the pertinent methods (explicit, better than implicit).
Based on my intuition from implementing the POC, I don't think it would be feasible to have both CoW in some cases, and normal views (eg when selecting columns from a DataFrame) in other cases (but you are certainly welcome to experiment with it as well). Personally I think adding keywords alone would not be a sufficient/satisfying solution, as I would like to see those methods to not copy by default, while keeping the behaviour of returning a new object (that doesn't modify the parent one if mutated). In addition, there are also methods that do indexing-like operations (reindex on columns, filter), and I think it would be surprising if those behaved differently as the indexing operations (getitem). On Thu, 12 Aug 2021 at 01:45, Brock Mendel <jbrockmendel@gmail.com> wrote:
A couple of thoughts from the discussion on today's call:
1) A lot of the discussion about the indexing behavior revolved around "users expect X". I fundamentally do *not* want to be in the business of speculating about this.
2) I find the case for CoW more compelling for the chained methods usage `frame.rename(...).reset_index(...).set_index(...)`. If we had a viable way to implement CoW for these independently of the indexing, that would be a slam dunk. Alternatively, we could get a lot of the benefits from a `copy` keyword in the pertinent methods (explicit, better than implicit).
I would like to highlight a comment that Stephan made earlier in this thread about accessing a DataFrame column as a Series: A simpler variant would be to make indexing out a single Series from a
DataFrame return a view, with everything else doing copy on write. Then the existing pattern df.column_one[:] = ... would still work.
In the old issue about this, Stephan also mentioned this option (see eg https://github.com/pandas-dev/pandas/issues/10954#issuecomment-136521398 and https://github.com/pandas-dev/pandas/issues/10954#issuecomment-136816312 ). For me, this is one of the main aspects of the proposal I am the least sure about. On the one hand, it would certainly help the transition ("df[col][..] = .." is a case we currently don't warn about and would stop working with a pure CoW, but would keep working with this modification). It also makes sense in the idea of seeing a DataFrame as a "dict of Series" objects. On the other hand, it also adds complication because it inherently adds a special case to the rules. It might also result in some confusing corner cases (see eg the example I gave earlier in this thread at https://mail.python.org/pipermail/pandas-dev/2021-July/001368.html). What are people's thoughts on this aspect? This would also complicate the implementation, but I now think it might be possible to do this, if we preferred this behaviour (eg by turning a SingleBlockManager into a wrapper around the parent DataFrame BlockManager, so it's actually referencing directly the original DataFrame's data instead of an independent array).
participants (7)
-
Adrin -
Brock Mendel -
Joris Van den Bossche -
Marc Garcia -
Stephan Hoyer -
Tom Augspurger -
Wes McKinney