[Numpy-discussion] RFC: comments to BLAS committee from numpy/scipy devs

Tue Jan 9 03:24:17 EST 2018

Hi all,

As mentioned earlier [1][2], there's work underway to revise and
update the BLAS standard -- e.g. we might get support for strided
arrays and lose xerbla! There's a draft at [3]. They're interested in
feedback from users, so I've written up a first draft of comments
about what we would like as NumPy/SciPy developers. This is very much
a first attempt -- I know we have lots of people who are more expert
on BLAS than me on these lists :-). Please let me know what you think.

-n

[1] https://mail.python.org/pipermail/numpy-discussion/2017-November/077420.html
[2] https://mail.python.org/pipermail/scipy-dev/2017-November/022267.html
[3] https://docs.google.com/document/d/1DY4ImZT1coqri2382GusXgBTTTVdBDvtD5I14QHp9OE/edit

-----

# Comments from NumPy / SciPy developers on "A Proposal for a
Next-Generation BLAS"

These are comments on [A Proposal for a Next-Generation
BLAS](https://docs.google.com/document/d/1DY4ImZT1coqri2382GusXgBTTTVdBDvtD5I14QHp9OE/edit#)
(version as of 2017-12-13), from the perspective of the developers of
the NumPy and SciPy libraries. We hope this feedback is useful, and
welcome further discussion.

## Who are we?

NumPy and SciPy are the two foundational libraries of the Python
numerical ecosystem, and one of their duties is to wrap BLAS and
expose it for the use of other Python libraries. (NumPy primarily
provides a GEMM wrapper, while SciPy exposes more specialized
operations.) It's unclear how many users we have exactly, but we
certainly ship multiple million copies of BLAS every month, and
provide one of the most popular numerical toolkits for both novice and
expert users.

Looking at the original BLAS and LAPACK interfaces, it often seems
that their imagined user is something like a classic supercomputer
consumer, who writes code directly in Fortran or C against the BLAS
API, and where the person writing the code and running the code are
the same. NumPy/SciPy are coming from a very different perspective:
our users generally know nothing about the details of the underlying
BLAS; they just want to describe their problem in some high-level way,
and the library is responsible for making it happen as efficiently as
possible, and is often integrated into some larger system (e.g. a
real-time analytics platform embedded in a web server).

When it comes to our BLAS usage, we mostly use only a small subset of
the routines. However, as "consumer software" used by a wide variety
of users with differing degress of technical expertise, we're expected
to Just Work on a wide variety of systems, and with as many different
vendor BLAS libraries as possible. On the other hand, the fact that
we're working with Python means we don't tend to worry about small
inefficiencies that will be lost in the noise in any case, and are
willing to sacrifice some performance to get more reliable operation
across our diverse userbase.

## Comments on specific aspects of the proposal

### Data Layout

We are **strongly in favor** of the proposal to support arbitrary
strided data layouts. Ideally, this would support strides *specified
in bytes* (allowing for unaligned data layouts), and allow for truly
arbitrary strides, including *zero or negative* values. However, we
think it's fine if some of the weirder cases suffer a performance
penalty.

Rationale: NumPy – and thus, most of the scientific Python ecosystem –
only has one way of representing an array: the `numpy.ndarray` type,
which is an arbitrary dimensional tensor with arbitrary strides. It is
common to encounter matrices with non-trivial strides. For example::

    # Make a 3-dimensional tensor, 10 x 9 x 8
    t = np.zeros((10, 9, 8))
    # Considering this as a stack of eight 10x9 matrices, extract the first:
    mat = t[:, :, 0]

Now `mat` has non-trivial strides on both axes. (If running this in a
Python interpreter, you can see this by looking at the value of
`mat.strides`.) Another case where interesting strides arise is when
performing ["broadcasting"](https://docs.scipy.org/doc/numpy-1.13.0/user/basics.broadcasting.html),
which is the name for NumPy's rules for stretching arrays to make
their shapes match. For example, in an expression like::

    np.array([1, 2, 3]) + 1

the scalar `1` is "broadcast" to create a vector `[1, 1, 1]`. This is
accomplished without allocating memory, by creating a vector with
settings length = 3, strides = 0 – so all the elements share a single
location in memory. Similarly, by using negative strides we can
reverse an array without allocating memory::

    a = np.array([1, 2, 3])
    a_flipped = a[::-1]

Now `a_flipped` has the value `[3, 2, 1]`, while sharing storage with
the array `a = [1, 2, 3]`. Misaligned data is also possible (e.g. an
array of 8-byte doubles with a 9-byte stride), though it arises more
rarely. (An example of when it might occurs is in an on-disk data
format that alternates between storing a double value and then a
single byte value, which is then memory-mapped.)

While this array representation is very flexible and useful, it makes
interfacing with BLAS a challenge: how do you perform a GEMM operation
on two arrays with arbitrary strides? Currently, NumPy attempts to
detect a number of special cases: if the strides in both arrays imply
a column-major layout, then call BLAS directly; if one of them has
strides corresponding to a row-major layout, then set the
corresponding `transA`/`transB` argument, etc. – and if all else
fails, either copy the data into a contiguous buffer, or else fall
back on a naive triple-nested-loop GEMM implementation. (There's also
a check where if we can determine through examining the arrays' data
pointers and strides that they're actually transposes of each other,
then we instead dispatch to SYRK.)

So for us, native stride support in the BLAS has two major advantages:

- It allows a wider variety of operations to be transparently handled
using high-speed kernels.

- It reduces the complexity of the NumPy/BLAS interface code.

Any kind of stride support produces both of these advantages to some
extent. The ideal would be if BLAS supports strides specified in bytes
(not elements), and allowing truly arbitrary values, because in this
case, we can simply pass through our arrays directly to the library
without having to do any fiddly checking at all. Note that for these
purposes, it's OK if "weird" arrays pay a speed penalty: if the BLAS
didn't support them, we'd just have to fall back on some even worse
strategy for handling them ourselves. And this approach would mean
that if some BLAS vendors do at some point decide to optimize these
cases, then our users will immediately see the advantages.

### Error-reporting

We are **strongly in favor** of the draft's proposal to replace XERBLA
with proper error codes. XERBLA is fine for one-off programs in
Fortran, but awful for wrapper libraries like NumPy/SciPy.

### Handling of exceptional values

A recurring issue in low-level linear-algebra libraries is that they
may crash, freeze, or produce unexpected results when receiving
non-finite inputs. Of course as a wrapper library, NumPy/SciPy can't
control what kind of strange data our users might pass in, and we have
to produce sensible results regardless, so we sometimes accumulate
hacks like having to validate all incoming data for exceptional values

We support the proposal's recommendation of "NaN interpretation (4)".

We also note that contrary to the note under "Interpretation (1)"; in
the R statistical environment NaN does *not* represent missing data. R
maintains a clear distinction between "NA" (missing) values and "NaN"
(invalid) values. This is important for various statistical procedures
such as imputation, where you might want to estimate the true value of
an unmeasured variable, but you don't want to estimate the true value
of 0/0. For efficiency, they do perform some calculations on NA values
by storing them as NaNs with a specific payload; in the R context,
therefore, it's particularly valuable if operations **correctly
propagate NaN payloads**. NumPy does not yet provide first-class
missing value support, but we plan to add it within the next 1-2
years, and when we do it seems likely that we'll have similar needs to
R.

### BLAS G2

NumPy has a rich type system, including 16-, 32-, and 64-bit floats
(plus extended precision on some platforms), a similar set of complex
values, and is further extensible with new types. We have a similarly
rich dispatch system for operations that attempts to find the best
matching optimized-loop for an arbitrary set of input types. We're
thus quite interested in the proposed mixed-type operations, and if
implemented we'll probably start transparently taking advantage of
them (i.e., in cases where we're currently upcasting to perform a
requested operation, we may switch to using the native precision
operation without requiring any changes in user code).

Two comments, though:

1. The various tricks to make names less redundant (e.g., specifying
just one type if they're all the same) are convenient for those
calling these routines by hand, but rather inconvenient for those of
us who will be generating a table mapping different type combinations
to different functions. When designing a compression scheme for names,
please remember that the scheme will have to be unambiguously
implemented in code by multiple projects. Another option to consider:
ask vendors to implement the full "uncompressed" names, and then
provide a shim library mapping short "convenience" names to their full
form.

2. Regarding the suggestion that the naming scheme will allow for a
wide variety of differently typed routines, but that only some subset
will be implemented by any particular vendor: we are worried about
this subset business. For a library like NumPy that wants to wrap
"all" the provided routines, while supporting "all" the different BLAS
libraries ... how will this work? **How do we know which routines any
particular library provides?** What if new routines are added to the
list later?

### Reproducible BLAS

This is interesting, but we don't have any particularly useful comments.

### Batch BLAS

NumPy/SciPy actually provide batched operations in many cases,
currently implemented using a `for` loop around individual calls to
BLAS/LAPACK. However, our interface is relatively restricted compared
to some of the proposals considered here: generally all we need is the
ability to perform an operation on two "stacks" of size-homogenous
matrices, with the offset between matrices determined by an arbitrary
(potentially zero) stride.

### Fixed-point BLAS

This is interesting, but we don't have any particularly useful comments.

-- 
Nathaniel J. Smith -- https://vorpus.org