That's food for thought. I have to admit that I have forgotten almost everything about linear algebra that I was ever taught -- and I was never taught numerical accuracy concerns in this context, since we were allowed to use only pencil and paper (in high school as well as college), so the problems were always constructed to ensure that correct answers contained nothing more complicated than 0, 0.5, 1 or sqrt(2). At this point I have a hard time reproducing multiplication for two 2x2 matrices (the only thing I remember is that it was the first example of something where AxB != BxA). What gives me hope though is that Steven has been thinking about this somewhat seriously already, and given that he successfully chose what to include or exclude for the statistics module, I trust him to know how much to put into a Matrix class as well. Certainly I trust him to come up with a reasonable strawman whose tires we can all kick. My own strawman would be to limit a Matrix to 2-dimensionality -- I believe that even my college linear algebra introduction (for math majors!) didn't touch upon higher dimensionality, and I doubt that what I learned in high school about the topic went beyond 3x3 (it might not even have treated non-square matrices). In terms of numerical care (that topic to which I never warmed up), which operations from the OP's list need more than statistics._sum() when limited to NxM matrices for single-digit N and M? (He named "matrix multiplication, transposition, addition, linear problem solving, determinant.") On Thu, Aug 13, 2020 at 9:04 PM Stephen J. Turnbull < turnbull.stephen.fw@u.tsukuba.ac.jp> wrote:
Guido van Rossum writes:
I was going to say that such a matrix module would be better of in PyPI, but then I recalled how the statistics module got created, and I think that the same reasoning from PEP 450 applies here too (https://www.python.org/dev/peps/pep-0450/#rationale).
So I'd say go for it!
I disagree that that rationale applies. Let's consider where the statistics module stopped, and why I think that's the right place to stop. Simple statistics on *single* variables, as proposed by PEP 450, are useful in many contexts to summarize data sets. You see them frequently in newpaper and Wikipedia articles, serious blog posts, and even on Twitter.
PEP 450 mentions, but does not propose to provide, linear regression (presumably ordinary least squares -- as with median and mode, there are many ways to compute a regression line). The PEP mentions covariance and correlation coefficients once each, and remarks that the API design is unclear. I think that omission was the better part of valor. Even on Twitter, it's hard to abuse the combination of mean and standard deviation (without outright lies about the values, of course). But most uses of correlation and regression are more or less abusive. That's true even in more serious venues (Murray & Herrnstein's "The Bell Curve" comes immediately to mind). Almost all uses of multiple regression outside of highly technical academic publications are abusive.
I don't mean to say "keep power tools out of the reach of the #ToddlerInChief and his minions".[1] Rather, I mean to say that most serious languages and packages for these applications (such as R, and closer to home numpy and pandas) provide substantial documentation suggesting *appropriate* use and pointing to the texts on algorithms and caveats. Steven doesn't do that with statistics -- and correctly so, he doesn't need to. None of the calculations he implements are in any way controversial as calculations. To the extent that different users might want slightly different definitions of mode or median, the ones provided are good enough for stdlib purposes. Nor are the interpretations of the results of the various calculations at all controversial.[2]
But even a two-dimensional regression y on x is fraught. Should we include the Anscombe[3] data and require a plotting function so users can see what they're getting into? I think Steven would say that's *way* beyond the scope of his package -- and I agree. Let's not go there. At all. Let users who need that stuff use packages that encourage them and help them do it right.
I don't find the "teaching high school linear/matrix algebra" use case persuasive. I taught "MBA Statistics for Liberal Arts Majors" for a decade. Writing simple matrix classes was an assignment, and then they used their own classes to implement covariance, correlation, and OLS. I don't think having a "canned" matrix class would have been of benefit to them -- a substantial fraction (10% < x < 50%) did get some idea of what was going on "inside" those calculations by programming them themselves plus a bit of printf debugging, which neither the linear algebra equations nor the sum operator I wrote on the whiteboard did. I will say I wish I had Steven's implementation of sum() at hand back then to show them to give them some idea of the care that numerical accuracy demands.
I cannot speak to engineering uses of matrix computations. If someone produces use cases that fit into the list of operations proposed (addition, negation, multiplication, inverse, transposition, scalar multiplication, solving linear equation systems, and determinants, I will concede it's useful and fall back to +/- 0.
However, I think the comparison to multivariate statistics is enlightening. You see many two-dimensional tables in public discussions (even on Twitter!) but they are not treated as matrices. Now, it's true that *every* basic matrix calculation (except multiplication by a scalar) requires the same kind of care that statistics.sum exerts, but having provided that, what have you bought? Not a lot, as far as I can see -- applications of matrix algebra are extremely diverse, and many require just as much attention to detail as the basic operations do.
In sum, I suspect that simply providing numerically stable algorithms for those computations isn't enough for useful engineering work -- as with multivariate statistics, you're not even halfway to useful and accurate computations, and the diversity is explosive. How to choose?
Footnotes: [1] Any fans of cubic regression models for epidemiology? No? OK, then.
[2] They can be abused. I did so myself just this morning, to tease a statistically literate friend. But it takes a bit of effort.
[3] https://stat.ethz.ch/R-manual/R-patched/library/datasets/html/anscombe.html
-- --Guido van Rossum (python.org/~guido) *Pronouns: he/him **(why is my pronoun here?)* http://feministing.com/2015/02/03/how-using-they-as-a-singular-pronoun-can-c...