Hi there, I'd like to bring your attention to a proposal being discussed among pandas developers, regarding copy-on-write semantics. A very short summary of the proposal, according to the document <https://docs.google.com/document/d/1ZCQ9mx3LBMy-nhwRl33_jgcvWo9IWdEfxDNQ2thy...>, is: *- The result of any indexing operation (subsetting a DataFrame or Series in any way, i.e. including accessing a DataFrame column as a Series) or any method returning a new DataFrame or Series, always behaves as if it were a copy in terms of user API.- We implement Copy-on-Write (as implementation detail). This way, we can actually use views as much as possible under the hood, while ensuring the user API behaves as a copy.* *- As a consequence, if you want to modify an object (DataFrame or Series), the only way to do this is to modify that object itself directly.* *This addresses multiple aspects: 1) a clear and consistent user API (a clear rule: any subset or returned series/dataframe always behaves as a copy of the original, and thus never modifies the original) and 2) improving performance by avoiding excessive copies (eg a chained method workflow would no longer return an actual data copy at each step). Because every single indexing step behaves as a copy, this also means that with this proposal, “chained assignment” (with multiple setitem steps) will never work.* You can also read the related discussion on the pandas mailing list here <https://mail.python.org/pipermail/pandas-dev/2021-July/001358.html>. It would be nice for us to think about the implications of this proposal on our work related to supporting pandas dataframes. Cheers, Adrin
Thanks for the heads up! This is interesting. We rarely update dataframe values in-place in scikit-learn but this is interesting to know that we could leverage this for more efficient pandas-in pandas-out support, for instance for missing value imputation.
participants (2)
-
Adrin -
Olivier Grisel