memory-efficient `np.cov` for `n_samples >> n_features`
Proposed new feature or change: `np.cov` centers its input by allocating `m - mean(m)`, a copy the size of `m`. When observations vastly outnumber variables, that copy dominates peak memory and wall time; the covariance matrix itself is tiny. I and potentially scikit-learn (cc. @ogrisel) would like to add an opt-in keyword that uses the algebraically equivalent post-hoc formula: cov = (m @ m.conj().T) / fact - (N / fact) * avg @ avg.conj().T It is already implemented here on scikit-learn: https://github.com/scikit-learn/scikit-learn/blob/94af1dff678e8a7a6fc627b85a... I think this optimization should be put here, and we have many historical examples of this type of optimization on numpy. From the historical perspective: 1. In the PR #6403 (merged 2015), already halved `np.cov` memory; the ' m-mean` copy is what's left. 2. #13199 tracks the same class of problem for `np.var` and `np.std` (open). 3. #6231 proposes Welford's algorithm; that's orthogonal, because post-hoc pairs with BLAS `gemm`/`syrk` and Welford does not. #1696, #9631, #4694 cover the stability/memory tradeoff for variance-like reductions. Scikit-learn reimplements this locally and already shows strong benchmark results with this. `PCA(svd_solver="covariance_eigh")` does `C = X.T @ X - N * mean mean.T; C /= (N - 1)` ([source][sklearn-pca]) to avoid the copy. [scikit-learn#33573][] ([review comment][grisel-comment]) asks for this to move upstream. The small trade-off is that the post-hoc formula loses precision when `|avg|^2 / var(m)` is large; the reference that discusses this is Chan, T. F., Golub, G. H., & LeVeque, R. J. (1983). So the default stays safe. Callers who know their data is well-conditioned opt in. ## Proposal np.cov(m, ..., *, method="pre-center") # or "post-hoc" Weighted (`fweights`, `aweights`) and complex cases generalize without structural change. `corrcoef` could follow, but not in the first PR. Chan, T. F., Golub, G. H., & LeVeque, R. J. (1983). Algorithms for computing the sample variance: Analysis and recommendations. The American Statistician, 37(3), 242-247. [#6403]: https://github.com/numpy/numpy/pull/6403 [#13199]: https://github.com/numpy/numpy/issues/13199 [#6231]: https://github.com/numpy/numpy/issues/6231 [#1696]: https://github.com/numpy/numpy/issues/1696 [#9631]: https://github.com/numpy/numpy/issues/9631 [#4694]: https://github.com/numpy/numpy/issues/4694 [sklearn-pca]: https://github.com/scikit-learn/scikit-learn/blob/94af1dff678e8a7a6fc627b85a... [scikit-learn#33573]: https://github.com/scikit-learn/scikit-learn/pull/33573 [grisel-comment]: https://github.com/scikit-learn/scikit-learn/pull/33573#discussion_r31012455... cc. @thomasjpfan, @ogrisel, @qbarthelemy, @betatim, @rgommers, @lucascolley. and FYI: @seberg and @eric-wieser
participants (1)
-
Bru