Re: [Pandas-dev] Proposal for consistent, clear copy/view semantics in pandas with Copy-on-Write
Joris: I finally had some time to study our conversation from July, reread the Google docs proposal, and I tried out the PR as well. What I'm struggling with is how we document where behavior will change. As an example, the following sequence will give different results: Current behavior:
df = pd.DataFrame({"a":[10,11,12,13,14], "b": [100,101,102,103,104]}) df["a"].loc[2] = 112 df a b 0 10 100 1 11 101 2 112 102 3 13 103 4 14 104
New behavior: (from the PR):
df = pd.DataFrame({"a":[10,11,12,13,14], "b": [100,101,102,103,104]}) df["a"].loc[2] = 112 df a b 0 10 100 1 11 101 2 12 102 3 13 103 4 14 104
But in both cases, the following works:
df.loc[3,"b"] = 999 df a b 0 10 100 1 11 101 2 12 102 3 13 999 4 14 104
On Sat, 17 Jul 2021 at 20:51, Joris Van den Bossche < jorisvandenbossche@gmail.com> wrote:
On Fri, 16 Jul 2021 at 20:50, Irv Lustig <irv@princeton.com> wrote:
Tom Augspurger <tom.augspurger88@gmail.com> wrote:
I wonder if we can validate what users (new and old) *actually*
expect?
Users coming from R, which IIRC implements Copy on Write for matrices, might be OK with indexing always being (behaving like) a copy. I'm not sure what users coming from NumPy would expect, since I don't know how many NumPy users really understand *a**.)* when a NumPy slice is a view or copy, and *b.) *how a pandas indexing operation translates to a NumPy slice.
IMHO, we should concentrate on the "new" users. For my team, there is no numpy or R background. They learn pandas, and what pandas does needs to be really clear in behavior and documentation. I would also hazard a guess
The places where I think confusion could happen are things like this
with a DataFrame df :
s = df["a"] s.iloc[3:5] = [1, 2, 3] df["a"].iloc[3:5] = [1, 2, 3] df["b"] = df["a"] df["b"].iloc[3:5] = [4, 5, 6] s2 = df["b"] df["c"] = s2 s2.iloc[3:5] = [7, 8, 9]
As I understand it (please correct me if I'm wrong), these lines would
be interpreted as follows with the current proposal:
It's a bit different (to reiterate, with the *current* proposal, *any* indexing operation (including series selection) behaves as a copy; and also to be clear, this is one possible proposal, there are certainly other
So my concern is that if you had existing code that used the pattern df["a"].loc[2] = 112 , you'd get no warning that the behavior had changed. What I don't know is how much of code in the wild assumes the current behavior. So my questions are now: 1. How will we document, in a clean and concise way, the new behavior for people with existing pandas code? 2. How can people find pandas code where the behavior will change? Can we list all patterns that would produce different results? Can we detect chained indexing with setitem calls? 3. I'm guessing there is lots of code where people use DataFrame.copy() to avoid the SettingWithCopy warning. Can they just remove those copies now and their code will work? I agree that for new users, this new way of doing things makes sense. I'm worried about how we make the transition easier for people with large code bases that use pandas. -Irv that most pandas users are like that - pandas is the first tool they see, not numpy or R. possibilities). Answering case by case:
1. s = df["a"] Creates a view into the DataFrame df. No copying is done at all
Indeed a view (but that's an implementation detail)
2. s.iloc[3:5] = [1, 2, 3] Modifies the series s and the underlying DataFrame df. (copy-on-write)
Due to copy-on-write, it does *not* modify the DataFrame df.
Copy-on-write means that only when s is being written to, its data get copied (so at that point breaking the view-relation with the parent df)
3. df["a"].iloc[3:5] = [1, 2, 3] Modifies the dataframe
This is an example of chained assignment, which in the current proposal
never works (see the example in the google doc). This is because chained assignment can always be written as:
temp = df["a"] temp.iloc[3:5] = [1, 2, 3]
and `temp` uses copy-on-write (and then it is the same example as the
one above in 2.).
(what you describe is the current behaviour of pandas)
4. df["b"] = df["a"] Copies the series from "a" to "b"
It would indeed behave as a copy, but under the hood we can actually
keep this as a view (delay the copy thanks to copy-on-write).
5. df["b"].iloc[3:5] = [4, 5, 6] Modifies "b" in the DataFrame, but not "a"
Also doesn't modify "b" (see example 3. above), but indeed does not
modify "a"
6. s2 = df["b"] Create a view into the DataFrame df. No copying is done at all.
Same as 1.
7. df["c"] = s2 Copies the series from "b" to "c"
Same as 4.
8. s2.iloc[3:5] = [7, 8, 9] Modifies s2, which modifies "b", but NOT "c"
Doesn't modify "b" and "c". Similar as 3.
I think the challenge is explaining the sequence 6,7,8 above in
comparison to the other sequences.
So with the current proposal, the sequece 6, 7, 8 actually doesn't
behave differently. But it is mainly 2 and 3 that would be quite different compared to the current pandas behaviour.
-Irv
Thanks for testing the branch and the feedback, Irv! Related to your concern about how users will know or get notified about behaviour that will change: the branch you tested is a proof-of-concept for the *final* behaviour, and so I didn't (yet) add warnings for such cases. So that's the simple reason why a case like df["a"].loc[2] = 112 didn't trigger a warning. But I agree that this is important, and it's certainly the idea that we will have a pandas release (before actually changing the behaviour) where the cases like above that will change behaviour trigger a deprecation warning about this. We will need to see a bit how to implement this, though, and it might become quite complex. But if we are convinced that the final behaviour is better, I think this is certainly worth it (and only temporary). On Wed, 15 Dec 2021 at 15:54, Irv Lustig <irv@princeton.com> wrote:
Joris: I finally had some time to study our conversation from July, reread the Google docs proposal, and I tried out the PR as well.
What I'm struggling with is how we document where behavior will change. As an example, the following sequence will give different results:
Current behavior:
df = pd.DataFrame({"a":[10,11,12,13,14], "b": [100,101,102,103,104]}) df["a"].loc[2] = 112 df a b 0 10 100 1 11 101 2 112 102 3 13 103 4 14 104
New behavior: (from the PR):
df = pd.DataFrame({"a":[10,11,12,13,14], "b": [100,101,102,103,104]}) df["a"].loc[2] = 112 df a b 0 10 100 1 11 101 2 12 102 3 13 103 4 14 104
But in both cases, the following works:
df.loc[3,"b"] = 999 df a b 0 10 100 1 11 101 2 12 102 3 13 999 4 14 104
So my concern is that if you had existing code that used the pattern df["a"].loc[2] = 112 , you'd get no warning that the behavior had changed. What I don't know is how much of code in the wild assumes the current behavior.
So my questions are now: 1. How will we document, in a clean and concise way, the new behavior for people with existing pandas code?
Given that the new behaviour makes more sense than the current behaviour (in my opinion, and I think yours as well based on your email), it should be actually be easier to properly document it :) But joking aside, yes, we will certainly need to put effort in creating a very good set of documentation on this topic (the google doc could be a starting point).
2. How can people find pandas code where the behavior will change? Can we list all patterns that would produce different results? Can we detect chained indexing with setitem calls?
The documentation can certainly list lots of patterns, but is of course always based on examples. As mentioned above, I think we should be able to catch most / all cases in setitem where behaviour will change, and trigger a warning about this. This will be quite some work (probably even more than the actual implementation that I currently did), but I am convinced this is possible and worth it.
3. I'm guessing there is lots of code where people use DataFrame.copy() to avoid the SettingWithCopy warning. Can they just remove those copies now and their code will work?
Yes, I think so. Especially if you did "copy" for avoiding the warning, you were never modifying the original parent dataframe, which will become the default/automatic behaviour with the proposal.
I agree that for new users, this new way of doing things makes sense. I'm worried about how we make the transition easier for people with large code bases that use pandas.
It's indeed a big change, that will impact quite some people, and can be a big task to update for large code bases. So I think we need to take care about this and really put effort in this aspect: ensuring we have good deprecation warnings, a very good migration guide, reach out to (big) users to check how the migration goes so we can improve this migration path, etc. This is a lot of work of course, but I think a necessity if we want this to be a success, and we also have some funding from the CZI grant specifically for this aspect of the larger roadmap items. Joris
-Irv
On Sat, 17 Jul 2021 at 20:51, Joris Van den Bossche < jorisvandenbossche@gmail.com> wrote:
On Fri, 16 Jul 2021 at 20:50, Irv Lustig <irv@princeton.com> wrote:
Tom Augspurger <tom.augspurger88@gmail.com> wrote:
I wonder if we can validate what users (new and old) *actually*
Users coming from R, which IIRC implements Copy on Write for matrices, might be OK with indexing always being (behaving like) a copy. I'm not sure what users coming from NumPy would expect, since I don't know how many NumPy users really understand *a**.)* when a NumPy slice is a view or copy, and *b.) *how a pandas indexing operation translates to a NumPy slice.
IMHO, we should concentrate on the "new" users. For my team, there is no numpy or R background. They learn pandas, and what pandas does needs to be really clear in behavior and documentation. I would also hazard a guess that most pandas users are like that - pandas is the first tool they see, not numpy or R.
The places where I think confusion could happen are things like this with a DataFrame df :
s = df["a"] s.iloc[3:5] = [1, 2, 3] df["a"].iloc[3:5] = [1, 2, 3] df["b"] = df["a"] df["b"].iloc[3:5] = [4, 5, 6] s2 = df["b"] df["c"] = s2 s2.iloc[3:5] = [7, 8, 9]
As I understand it (please correct me if I'm wrong), these lines would be interpreted as follows with the current proposal:
It's a bit different (to reiterate, with the *current* proposal, *any* indexing operation (including series selection) behaves as a copy; and also to be clear, this is one possible proposal, there are certainly other
expect? possibilities). Answering case by case:
1. s = df["a"] Creates a view into the DataFrame df. No copying is done at all
Indeed a view (but that's an implementation detail)
2. s.iloc[3:5] = [1, 2, 3] Modifies the series s and the underlying DataFrame df.
(copy-on-write)
Due to copy-on-write, it does *not* modify the DataFrame df.
Copy-on-write means that only when s is being written to, its data get copied (so at that point breaking the view-relation with the parent df)
3. df["a"].iloc[3:5] = [1, 2, 3] Modifies the dataframe
This is an example of chained assignment, which in the current
proposal never works (see the example in the google doc). This is because chained assignment can always be written as:
temp = df["a"] temp.iloc[3:5] = [1, 2, 3]
and `temp` uses copy-on-write (and then it is the same example as the
one above in 2.).
(what you describe is the current behaviour of pandas)
4. df["b"] = df["a"] Copies the series from "a" to "b"
It would indeed behave as a copy, but under the hood we can actually
keep this as a view (delay the copy thanks to copy-on-write).
5. df["b"].iloc[3:5] = [4, 5, 6] Modifies "b" in the DataFrame, but not "a"
Also doesn't modify "b" (see example 3. above), but indeed does not
modify "a"
6. s2 = df["b"] Create a view into the DataFrame df. No copying is done at all.
Same as 1.
7. df["c"] = s2 Copies the series from "b" to "c"
Same as 4.
8. s2.iloc[3:5] = [7, 8, 9] Modifies s2, which modifies "b", but NOT "c"
Doesn't modify "b" and "c". Similar as 3.
I think the challenge is explaining the sequence 6,7,8 above in
comparison to the other sequences.
So with the current proposal, the sequece 6, 7, 8 actually doesn't
behave differently. But it is mainly 2 and 3 that would be quite different compared to the current pandas behaviour.
-Irv
Pandas-dev mailing list Pandas-dev@python.org https://mail.python.org/mailman/listinfo/pandas-dev
participants (2)
-
Irv Lustig -
Joris Van den Bossche