Joris: I finally had some time to study our conversation from July, reread the Google docs proposal, and I tried out the PR as well. What I'm struggling with is how we document where behavior will change. As an example, the following sequence will give different results: Current behavior:
df = pd.DataFrame({"a":[10,11,12,13,14], "b": [100,101,102,103,104]}) df["a"].loc[2] = 112 df a b 0 10 100 1 11 101 2 112 102 3 13 103 4 14 104
New behavior: (from the PR):
df = pd.DataFrame({"a":[10,11,12,13,14], "b": [100,101,102,103,104]}) df["a"].loc[2] = 112 df a b 0 10 100 1 11 101 2 12 102 3 13 103 4 14 104
But in both cases, the following works:
df.loc[3,"b"] = 999 df a b 0 10 100 1 11 101 2 12 102 3 13 999 4 14 104
On Sat, 17 Jul 2021 at 20:51, Joris Van den Bossche < jorisvandenbossche@gmail.com> wrote:
On Fri, 16 Jul 2021 at 20:50, Irv Lustig <irv@princeton.com> wrote:
Tom Augspurger <tom.augspurger88@gmail.com> wrote:
I wonder if we can validate what users (new and old) *actually*
expect?
Users coming from R, which IIRC implements Copy on Write for matrices, might be OK with indexing always being (behaving like) a copy. I'm not sure what users coming from NumPy would expect, since I don't know how many NumPy users really understand *a**.)* when a NumPy slice is a view or copy, and *b.) *how a pandas indexing operation translates to a NumPy slice.
IMHO, we should concentrate on the "new" users. For my team, there is no numpy or R background. They learn pandas, and what pandas does needs to be really clear in behavior and documentation. I would also hazard a guess
The places where I think confusion could happen are things like this
with a DataFrame df :
s = df["a"] s.iloc[3:5] = [1, 2, 3] df["a"].iloc[3:5] = [1, 2, 3] df["b"] = df["a"] df["b"].iloc[3:5] = [4, 5, 6] s2 = df["b"] df["c"] = s2 s2.iloc[3:5] = [7, 8, 9]
As I understand it (please correct me if I'm wrong), these lines would
be interpreted as follows with the current proposal:
It's a bit different (to reiterate, with the *current* proposal, *any* indexing operation (including series selection) behaves as a copy; and also to be clear, this is one possible proposal, there are certainly other
So my concern is that if you had existing code that used the pattern df["a"].loc[2] = 112 , you'd get no warning that the behavior had changed. What I don't know is how much of code in the wild assumes the current behavior. So my questions are now: 1. How will we document, in a clean and concise way, the new behavior for people with existing pandas code? 2. How can people find pandas code where the behavior will change? Can we list all patterns that would produce different results? Can we detect chained indexing with setitem calls? 3. I'm guessing there is lots of code where people use DataFrame.copy() to avoid the SettingWithCopy warning. Can they just remove those copies now and their code will work? I agree that for new users, this new way of doing things makes sense. I'm worried about how we make the transition easier for people with large code bases that use pandas. -Irv that most pandas users are like that - pandas is the first tool they see, not numpy or R. possibilities). Answering case by case:
1. s = df["a"] Creates a view into the DataFrame df. No copying is done at all
Indeed a view (but that's an implementation detail)
2. s.iloc[3:5] = [1, 2, 3] Modifies the series s and the underlying DataFrame df. (copy-on-write)
Due to copy-on-write, it does *not* modify the DataFrame df.
Copy-on-write means that only when s is being written to, its data get copied (so at that point breaking the view-relation with the parent df)
3. df["a"].iloc[3:5] = [1, 2, 3] Modifies the dataframe
This is an example of chained assignment, which in the current proposal
never works (see the example in the google doc). This is because chained assignment can always be written as:
temp = df["a"] temp.iloc[3:5] = [1, 2, 3]
and `temp` uses copy-on-write (and then it is the same example as the
one above in 2.).
(what you describe is the current behaviour of pandas)
4. df["b"] = df["a"] Copies the series from "a" to "b"
It would indeed behave as a copy, but under the hood we can actually
keep this as a view (delay the copy thanks to copy-on-write).
5. df["b"].iloc[3:5] = [4, 5, 6] Modifies "b" in the DataFrame, but not "a"
Also doesn't modify "b" (see example 3. above), but indeed does not
modify "a"
6. s2 = df["b"] Create a view into the DataFrame df. No copying is done at all.
Same as 1.
7. df["c"] = s2 Copies the series from "b" to "c"
Same as 4.
8. s2.iloc[3:5] = [7, 8, 9] Modifies s2, which modifies "b", but NOT "c"
Doesn't modify "b" and "c". Similar as 3.
I think the challenge is explaining the sequence 6,7,8 above in
comparison to the other sequences.
So with the current proposal, the sequece 6, 7, 8 actually doesn't
behave differently. But it is mainly 2 and 3 that would be quite different compared to the current pandas behaviour.
-Irv