Guido van Rossum writes:
> I was going to say that such a matrix module would be better of in
> PyPI, but then I recalled how the statistics module got created,
> and I think that the same reasoning from PEP 450 applies here too
> (https://www.python.org/dev/peps/pep-0450/#rationale).
>
> So I'd say go for it!
I disagree that that rationale applies. Let's consider where the
statistics module stopped, and why I think that's the right place to
stop. Simple statistics on *single* variables, as proposed by PEP
450, are useful in many contexts to summarize data sets. You see them
frequently in newpaper and Wikipedia articles, serious blog posts, and
even on Twitter.
PEP 450 mentions, but does not propose to provide, linear regression
(presumably ordinary least squares -- as with median and mode, there
are many ways to compute a regression line). The PEP mentions
covariance and correlation coefficients once each, and remarks that
the API design is unclear. I think that omission was the better part
of valor. Even on Twitter, it's hard to abuse the combination of mean
and standard deviation (without outright lies about the values, of
course). But most uses of correlation and regression are more or less
abusive. That's true even in more serious venues (Murray &
Herrnstein's "The Bell Curve" comes immediately to mind). Almost all
uses of multiple regression outside of highly technical academic
publications are abusive.
I don't mean to say "keep power tools out of the reach of the
#ToddlerInChief and his minions".[1] Rather, I mean to say that most
serious languages and packages for these applications (such as R, and
closer to home numpy and pandas) provide substantial documentation
suggesting *appropriate* use and pointing to the texts on algorithms
and caveats. Steven doesn't do that with statistics -- and correctly
so, he doesn't need to. None of the calculations he implements are in
any way controversial as calculations. To the extent that different
users might want slightly different definitions of mode or median, the
ones provided are good enough for stdlib purposes. Nor are the
interpretations of the results of the various calculations at all
controversial.[2]
But even a two-dimensional regression y on x is fraught. Should we
include the Anscombe[3] data and require a plotting function so users
can see what they're getting into? I think Steven would say that's
*way* beyond the scope of his package -- and I agree. Let's not go
there. At all. Let users who need that stuff use packages that
encourage them and help them do it right.
I don't find the "teaching high school linear/matrix algebra" use case
persuasive. I taught "MBA Statistics for Liberal Arts Majors" for a
decade. Writing simple matrix classes was an assignment, and then
they used their own classes to implement covariance, correlation, and
OLS. I don't think having a "canned" matrix class would have been of
benefit to them -- a substantial fraction (10% < x < 50%) did get some
idea of what was going on "inside" those calculations by programming
them themselves plus a bit of printf debugging, which neither the
linear algebra equations nor the sum operator I wrote on the
whiteboard did. I will say I wish I had Steven's implementation of
sum() at hand back then to show them to give them some idea of the
care that numerical accuracy demands.
I cannot speak to engineering uses of matrix computations. If someone
produces use cases that fit into the list of operations proposed
(addition, negation, multiplication, inverse, transposition, scalar
multiplication, solving linear equation systems, and determinants, I
will concede it's useful and fall back to +/- 0.
However, I think the comparison to multivariate statistics is
enlightening. You see many two-dimensional tables in public
discussions (even on Twitter!) but they are not treated as matrices.
Now, it's true that *every* basic matrix calculation (except
multiplication by a scalar) requires the same kind of care that
statistics.sum exerts, but having provided that, what have you
bought? Not a lot, as far as I can see -- applications of matrix
algebra are extremely diverse, and many require just as much attention
to detail as the basic operations do.
In sum, I suspect that simply providing numerically stable algorithms
for those computations isn't enough for useful engineering work -- as
with multivariate statistics, you're not even halfway to useful and
accurate computations, and the diversity is explosive. How to choose?
Footnotes:
[1] Any fans of cubic regression models for epidemiology? No? OK, then.
[2] They can be abused. I did so myself just this morning, to tease
a statistically literate friend. But it takes a bit of effort.
[3] https://stat.ethz.ch/R-manual/R-patched/library/datasets/html/anscombe.html