Guido van Rossum writes:
I was going to say that such a matrix module would be better of in PyPI, but then I recalled how the statistics module got created, and I think that the same reasoning from PEP 450 applies here too (https://www.python.org/dev/peps/pep-0450/#rationale).
So I'd say go for it!
I disagree that that rationale applies. Let's consider where the statistics module stopped, and why I think that's the right place to stop. Simple statistics on *single* variables, as proposed by PEP 450, are useful in many contexts to summarize data sets. You see them frequently in newpaper and Wikipedia articles, serious blog posts, and even on Twitter. PEP 450 mentions, but does not propose to provide, linear regression (presumably ordinary least squares -- as with median and mode, there are many ways to compute a regression line). The PEP mentions covariance and correlation coefficients once each, and remarks that the API design is unclear. I think that omission was the better part of valor. Even on Twitter, it's hard to abuse the combination of mean and standard deviation (without outright lies about the values, of course). But most uses of correlation and regression are more or less abusive. That's true even in more serious venues (Murray & Herrnstein's "The Bell Curve" comes immediately to mind). Almost all uses of multiple regression outside of highly technical academic publications are abusive. I don't mean to say "keep power tools out of the reach of the #ToddlerInChief and his minions".[1] Rather, I mean to say that most serious languages and packages for these applications (such as R, and closer to home numpy and pandas) provide substantial documentation suggesting *appropriate* use and pointing to the texts on algorithms and caveats. Steven doesn't do that with statistics -- and correctly so, he doesn't need to. None of the calculations he implements are in any way controversial as calculations. To the extent that different users might want slightly different definitions of mode or median, the ones provided are good enough for stdlib purposes. Nor are the interpretations of the results of the various calculations at all controversial.[2] But even a two-dimensional regression y on x is fraught. Should we include the Anscombe[3] data and require a plotting function so users can see what they're getting into? I think Steven would say that's *way* beyond the scope of his package -- and I agree. Let's not go there. At all. Let users who need that stuff use packages that encourage them and help them do it right. I don't find the "teaching high school linear/matrix algebra" use case persuasive. I taught "MBA Statistics for Liberal Arts Majors" for a decade. Writing simple matrix classes was an assignment, and then they used their own classes to implement covariance, correlation, and OLS. I don't think having a "canned" matrix class would have been of benefit to them -- a substantial fraction (10% < x < 50%) did get some idea of what was going on "inside" those calculations by programming them themselves plus a bit of printf debugging, which neither the linear algebra equations nor the sum operator I wrote on the whiteboard did. I will say I wish I had Steven's implementation of sum() at hand back then to show them to give them some idea of the care that numerical accuracy demands. I cannot speak to engineering uses of matrix computations. If someone produces use cases that fit into the list of operations proposed (addition, negation, multiplication, inverse, transposition, scalar multiplication, solving linear equation systems, and determinants, I will concede it's useful and fall back to +/- 0. However, I think the comparison to multivariate statistics is enlightening. You see many two-dimensional tables in public discussions (even on Twitter!) but they are not treated as matrices. Now, it's true that *every* basic matrix calculation (except multiplication by a scalar) requires the same kind of care that statistics.sum exerts, but having provided that, what have you bought? Not a lot, as far as I can see -- applications of matrix algebra are extremely diverse, and many require just as much attention to detail as the basic operations do. In sum, I suspect that simply providing numerically stable algorithms for those computations isn't enough for useful engineering work -- as with multivariate statistics, you're not even halfway to useful and accurate computations, and the diversity is explosive. How to choose? Footnotes: [1] Any fans of cubic regression models for epidemiology? No? OK, then. [2] They can be abused. I did so myself just this morning, to tease a statistically literate friend. But it takes a bit of effort. [3] https://stat.ethz.ch/R-manual/R-patched/library/datasets/html/anscombe.html