[Numpy-discussion] deterministic, reproducible matmul / __matmult_

Mon Jul 18 17:15:54 EDT 2016

On Mon, Jul 11, 2016 at 3:27 PM, Pauli Virtanen <pav at iki.fi> wrote:
> Mon, 11 Jul 2016 13:01:49 -0400, Jason Newton kirjoitti:
>> Does the ML have any ideas on how one could get a matmul that will not
>> allow any funny business on the evaluation of the products?  Funny
>> business here is something like changing the evaluation order additions
>> of terms. I want strict IEEE 754 compliance - no 80 bit registers, and
>> perhaps control of the rounding mode, no unsafe math optimizations.
>
> If you link Numpy with a BLAS and LAPACK libraries that have been
> compiled for this purpose, and turn on the compiler flags that enforce
> strict IEEE (and disable SSE) when compiling Numpy, you probably will get
> reproducible builds. Numpy itself just offloads the dot computations to
> BLAS, so if your BLAS is reproducible, things should mainly be OK.

The matrix multiplication of the reference blas hard to follow, so
that's good.  Generalized Inverse is a little more difficult.  I've
actually had problems building numpy w/ ref blas under windows...,
anyone know where and how to take a ready made cblas dll and get an
existing numpy installation (e.g. anaconda) to use it?

>
> You may also need to turn off the SSE optimizations in Numpy, because
> these can make results depend on memory alignment --- not in dot
> products, but in other computations.
>
> Out of curiosity, what is the application where this is necessary?
> Maybe there is a numerically stable formulation?

This can come up in recursive processes or anywhere where parallelism
(vectorization or other kinds) and the need for reproducable code
exists (there are many reasons you'd want this, see slides below).  As
I mentioned, I desire to have reproducable calculations when involving
alternative implementations with things like c modules, MPI, FPGAs,
GPUs coming into the picture.  You  can usually figure out a strategy
to do this when you write everything yourself, but I'd love to write
things in numpy and have it just choose simple / straight forward
implementations that are usually what everything boils down to onto
these other devices, at least in meaningful peaces.  There may be
other times where it gets more complicated than what i mention here,
but it is still very useful as a building block for those cases (which
are often multiple accumulators/partitioning, tree like reduction
ordering).

I looked into reproblas further and also I asked the authors of
repoblas to add a cblas wrapper in the hopes I might sometime have
numpy working on top of it.  I read alot of their research recently
and it's pretty cool - they get better accuracy than most
implementations and the performance is pretty good.  Check out slides
here http://people.eecs.berkeley.edu/~hdnguyen/public/talks/ARITH21_Fast_Sum.pdf
(skip to slide 24 for accuracy) - the takeaway here is you actually do
quite a bit better in precision/accuracy with their sum method, than
the naive and alternative implementations.  The cpu performance is
worth the trade and really not bad at all for most operations and
purposes - especially since they hand scalability very well.

They also just posted a monster of a technical report here
https://www.eecs.berkeley.edu/Pubs/TechRpts/2016/EECS-2016-121.html -
this details recent performance using avx and sse, if anyone is
interested.

I'd love to have a library like this that I could just tell numpy to
switch out to at runtime - at the start of a script.

-Jason