On Fri, 16 Jul 2021 at 20:50, Irv Lustig <irv@princeton.com> wrote:
 
Tom Augspurger <tom.augspurger88@gmail.com> wrote:

I wonder if we can validate what users (new and old) *actually* expect?
Users coming from R, which IIRC implements Copy on Write for matrices,
might be OK with indexing always being (behaving like) a copy.
I'm not sure what users coming from NumPy would expect, since I don't know
how many NumPy users really understand *a**.)* when a NumPy slice is a view
or copy, and *b.) *how a pandas indexing operation translates to a NumPy
slice.


IMHO, we should concentrate on the "new" users.  For my team, there is no numpy or R background.  They learn pandas, and what pandas does needs to be really clear in behavior and documentation.  I would also hazard a guess that most pandas users are like that - pandas is the first tool they see, not numpy or R.

The places where I think confusion could happen are things like this with a DataFrame df :
  1. s = df["a"]
  2. s.iloc[3:5] = [1, 2, 3]
  3. df["a"].iloc[3:5] = [1, 2, 3]
  4. df["b"] = df["a"]
  5. df["b"].iloc[3:5] = [4, 5, 6]
  6. s2 = df["b"]
  7. df["c"] = s2
  8. s2.iloc[3:5] = [7, 8, 9]
As I understand it (please correct me if I'm wrong), these lines would be interpreted as follows with the current proposal:

It's a bit different (to reiterate, with the *current* proposal, *any* indexing operation (including series selection) behaves as a copy; and also to be clear, this is one possible proposal, there are certainly other possibilities). Answering case by case:
 
1. s = df["a"]
Creates a view into the DataFrame df.  No copying is done at all

Indeed a view (but that's an implementation detail)

2. s.iloc[3:5] = [1, 2, 3]
Modifies the series s and the underlying DataFrame df.  (copy-on-write)

Due to copy-on-write, it does *not* modify the DataFrame df. Copy-on-write means that only when s is being written to, its data get copied (so at that point breaking the view-relation with the parent df)
 
3. df["a"].iloc[3:5] = [1, 2, 3]
Modifies the dataframe

This is an example of chained assignment, which in the current proposal never works (see the example in the google doc). This is because chained assignment can always be written as:

temp = df["a"]
temp.iloc[3:5] = [1, 2, 3]

and `temp` uses copy-on-write (and then it is the same example as the one above in 2.).

(what you describe is the current behaviour of pandas)
 
4. df["b"] = df["a"]
Copies the series from "a" to "b"

It would indeed behave as a copy, but under the hood we can actually keep this as a view (delay the copy thanks to copy-on-write).
 
5. df["b"].iloc[3:5] = [4, 5, 6]
Modifies "b" in the DataFrame, but not "a"

Also doesn't modify "b" (see example 3. above), but indeed does not modify "a"
 
6. s2 = df["b"]
Create a view into the DataFrame df.  No copying is done at all.

Same as 1.
 
7. df["c"] = s2
Copies the series from "b" to "c"

Same as 4.
 
8. s2.iloc[3:5] = [7, 8, 9]
Modifies s2, which modifies "b", but NOT "c"

Doesn't modify "b" and "c". Similar as 3.

I think the challenge is explaining the sequence 6,7,8 above in comparison to the other sequences.

So with the current proposal, the sequece 6, 7, 8 actually doesn't behave differently. But it is mainly 2 and 3 that would be quite different compared to the current pandas behaviour.
 

-Irv




 
_______________________________________________
Pandas-dev mailing list
Pandas-dev@python.org
https://mail.python.org/mailman/listinfo/pandas-dev