Re: [Pandas-dev] Proposal for consistent, clear copy/view semantics in pandas with Copy-on-Write
On Fri, 16 Jul 2021 at 20:50, Irv Lustig <irv@princeton.com> wrote:
Tom Augspurger <tom.augspurger88@gmail.com> wrote:
I wonder if we can validate what users (new and old) *actually* expect?
Users coming from R, which IIRC implements Copy on Write for matrices, might be OK with indexing always being (behaving like) a copy. I'm not sure what users coming from NumPy would expect, since I don't know how many NumPy users really understand *a**.)* when a NumPy slice is a view or copy, and *b.) *how a pandas indexing operation translates to a NumPy slice.
IMHO, we should concentrate on the "new" users. For my team, there is no numpy or R background. They learn pandas, and what pandas does needs to be really clear in behavior and documentation. I would also hazard a guess that most pandas users are like that - pandas is the first tool they see, not numpy or R.
The places where I think confusion could happen are things like this with a DataFrame df :
1. s = df["a"] 2. s.iloc[3:5] = [1, 2, 3] 3. df["a"].iloc[3:5] = [1, 2, 3] 4. df["b"] = df["a"] 5. df["b"].iloc[3:5] = [4, 5, 6] 6. s2 = df["b"] 7. df["c"] = s2 8. s2.iloc[3:5] = [7, 8, 9]
As I understand it (please correct me if I'm wrong), these lines would be interpreted as follows with the current proposal:
It's a bit different (to reiterate, with the *current* proposal, *any* indexing operation (including series selection) behaves as a copy; and also to be clear, this is one possible proposal, there are certainly other possibilities). Answering case by case:
1. s = df["a"] Creates a view into the DataFrame df. No copying is done at all
Indeed a view (but that's an implementation detail) 2. s.iloc[3:5] = [1, 2, 3]
Modifies the series s and the underlying DataFrame df. (copy-on-write)
Due to copy-on-write, it does *not* modify the DataFrame df. Copy-on-write means that only when s is being written to, its data get copied (so at that point breaking the view-relation with the parent df)
3. df["a"].iloc[3:5] = [1, 2, 3] Modifies the dataframe
This is an example of chained assignment, which in the current proposal never works (see the example in the google doc <https://docs.google.com/document/d/1ZCQ9mx3LBMy-nhwRl33_jgcvWo9IWdEfxDNQ2thy...>). This is because chained assignment can always be written as: temp = df["a"] temp.iloc[3:5] = [1, 2, 3] and `temp` uses copy-on-write (and then it is the same example as the one above in 2.). (what you describe is the current behaviour of pandas)
4. df["b"] = df["a"] Copies the series from "a" to "b"
It would indeed behave as a copy, but under the hood we can actually keep this as a view (delay the copy thanks to copy-on-write).
5. df["b"].iloc[3:5] = [4, 5, 6] Modifies "b" in the DataFrame, but not "a"
Also doesn't modify "b" (see example 3. above), but indeed does not modify "a"
6. s2 = df["b"] Create a view into the DataFrame df. No copying is done at all.
Same as 1.
7. df["c"] = s2 Copies the series from "b" to "c"
Same as 4.
8. s2.iloc[3:5] = [7, 8, 9] Modifies s2, which modifies "b", but NOT "c"
Doesn't modify "b" and "c". Similar as 3. I think the challenge is explaining the sequence 6,7,8 above in comparison
to the other sequences.
So with the current proposal, the sequece 6, 7, 8 actually doesn't behave differently. But it is mainly 2 and 3 that would be quite different compared to the current pandas behaviour.
-Irv
_______________________________________________ Pandas-dev mailing list Pandas-dev@python.org https://mail.python.org/mailman/listinfo/pandas-dev
participants (1)
-
Joris Van den Bossche