[Python-ideas] Re: basic matrix object

14 Aug 2020

      That's food for thought. I have to admit that I have forgotten almost
everything about linear algebra that I was ever taught -- and I was never
taught numerical accuracy concerns in this context, since we were allowed
to use only pencil and paper (in high school as well as college), so the
problems were always constructed to ensure that correct answers contained
nothing more complicated than 0, 0.5, 1 or sqrt(2). At this point I have a
hard time reproducing multiplication for two 2x2 matrices (the only thing I
remember is that it was the first example of something where AxB != BxA).

What gives me hope though is that Steven has been thinking about this
somewhat seriously already, and given that he successfully chose what to
include or exclude for the statistics module, I trust him to know how much
to put into a Matrix class as well. Certainly I trust him to come up with a
reasonable strawman whose tires we can all kick.

My own strawman would be to limit a Matrix to 2-dimensionality -- I believe
that even my college linear algebra introduction (for math majors!) didn't
touch upon higher dimensionality, and I doubt that what I learned in high
school about the topic went beyond 3x3 (it might not even have treated
non-square matrices).

In terms of numerical care (that topic to which I never warmed up), which
operations from the OP's list need more than statistics._sum() when limited
to NxM matrices for single-digit N and M? (He named "matrix multiplication,
transposition, addition, linear problem solving, determinant.")

On Thu, Aug 13, 2020 at 9:04 PM Stephen J. Turnbull <
turnbull.stephen.fw@u.tsukuba.ac.jp> wrote:
...
Guido van Rossum writes:
...
I was going to say that such a matrix module would be better of in
PyPI, but then I recalled how the statistics module got created,
and I think that the same reasoning from PEP 450 applies here too
(https://www.python.org/dev/peps/pep-0450/#rationale).
So I'd say go for it!
I disagree that that rationale applies.  Let's consider where the
statistics module stopped, and why I think that's the right place to
stop.  Simple statistics on *single* variables, as proposed by PEP
450, are useful in many contexts to summarize data sets.  You see them
frequently in newpaper and Wikipedia articles, serious blog posts, and
even on Twitter.
PEP 450 mentions, but does not propose to provide, linear regression
(presumably ordinary least squares -- as with median and mode, there
are many ways to compute a regression line).  The PEP mentions
covariance and correlation coefficients once each, and remarks that
the API design is unclear.  I think that omission was the better part
of valor.  Even on Twitter, it's hard to abuse the combination of mean
and standard deviation (without outright lies about the values, of
course).  But most uses of correlation and regression are more or less
abusive.  That's true even in more serious venues (Murray &
Herrnstein's "The Bell Curve" comes immediately to mind).  Almost all
uses of multiple regression outside of highly technical academic
publications are abusive.
I don't mean to say "keep power tools out of the reach of the
#ToddlerInChief and his minions".[1]  Rather, I mean to say that most
serious languages and packages for these applications (such as R, and
closer to home numpy and pandas) provide substantial documentation
suggesting *appropriate* use and pointing to the texts on algorithms
and caveats.  Steven doesn't do that with statistics -- and correctly
so, he doesn't need to.  None of the calculations he implements are in
any way controversial as calculations.  To the extent that different
users might want slightly different definitions of mode or median, the
ones provided are good enough for stdlib purposes.  Nor are the
interpretations of the results of the various calculations at all
controversial.[2]
But even a two-dimensional regression y on x is fraught.  Should we
include the Anscombe[3] data and require a plotting function so users
can see what they're getting into?  I think Steven would say that's
*way* beyond the scope of his package -- and I agree.  Let's not go
there.  At all.  Let users who need that stuff use packages that
encourage them and help them do it right.
I don't find the "teaching high school linear/matrix algebra" use case
persuasive.  I taught "MBA Statistics for Liberal Arts Majors" for a
decade.  Writing simple matrix classes was an assignment, and then
they used their own classes to implement covariance, correlation, and
OLS.  I don't think having a "canned" matrix class would have been of
benefit to them -- a substantial fraction (10% < x < 50%) did get some
idea of what was going on "inside" those calculations by programming
them themselves plus a bit of printf debugging, which neither the
linear algebra equations nor the sum operator I wrote on the
whiteboard did.  I will say I wish I had Steven's implementation of
sum() at hand back then to show them to give them some idea of the
care that numerical accuracy demands.
I cannot speak to engineering uses of matrix computations.  If someone
produces use cases that fit into the list of operations proposed
(addition, negation, multiplication, inverse, transposition, scalar
multiplication, solving linear equation systems, and determinants, I
will concede it's useful and fall back to +/- 0.
However, I think the comparison to multivariate statistics is
enlightening.  You see many two-dimensional tables in public
discussions (even on Twitter!)  but they are not treated as matrices.
Now, it's true that *every* basic matrix calculation (except
multiplication by a scalar) requires the same kind of care that
statistics.sum exerts, but having provided that, what have you
bought?  Not a lot, as far as I can see -- applications of matrix
algebra are extremely diverse, and many require just as much attention
to detail as the basic operations do.
In sum, I suspect that simply providing numerically stable algorithms
for those computations isn't enough for useful engineering work -- as
with multivariate statistics, you're not even halfway to useful and
accurate computations, and the diversity is explosive.  How to choose?
Footnotes:
[1]  Any fans of cubic regression models for epidemiology?  No?  OK, then.
[2]  They can be abused.  I did so myself just this morning, to tease
a statistically literate friend.  But it takes a bit of effort.
[3]
https://stat.ethz.ch/R-manual/R-patched/library/datasets/html/anscombe.html
-- 
--Guido van Rossum (python.org/~guido)
*Pronouns: he/him **(why is my pronoun here?)*
http://feministing.com/2015/02/03/how-using-they-as-a-singular-pronoun-can-c...