[SciPy-Dev] RFC: comments to BLAS committee from numpy/scipy devs

Tue Jan 9 06:40:02 EST 2018

I couldn't find an item to place this but I think ilaenv and also calling
the function twice (one with lwork=-1 and reading the optimal block size
and the call the function again properly with lwork=<result>) in LAPACK
needs to be gotten rid of.

That's a major annoyance during the wrapping of LAPACK routines for SciPy.

I don't know if this is realistic but the values ilaenv needed can be
computed once (or again if hardware is changed) at the install and can be
read off by the routines.

On Jan 9, 2018 09:25, "Nathaniel Smith" <njs at pobox.com> wrote:

> Hi all,
>
> As mentioned earlier [1][2], there's work underway to revise and
> update the BLAS standard -- e.g. we might get support for strided
> arrays and lose xerbla! There's a draft at [3]. They're interested in
> feedback from users, so I've written up a first draft of comments
> about what we would like as NumPy/SciPy developers. This is very much
> a first attempt -- I know we have lots of people who are more expert
> on BLAS than me on these lists :-). Please let me know what you think.
>
> -n
>
> [1] https://mail.python.org/pipermail/numpy-discussion/
> 2017-November/077420.html
> [2] https://mail.python.org/pipermail/scipy-dev/2017-November/022267.html
> [3] https://docs.google.com/document/d/1DY4ImZT1coqri2382GusXgBTTTVdB
> DvtD5I14QHp9OE/edit
>
> -----
>
> # Comments from NumPy / SciPy developers on "A Proposal for a
> Next-Generation BLAS"
>
> These are comments on [A Proposal for a Next-Generation
> BLAS](https://docs.google.com/document/d/1DY4ImZT1coqri2382GusXgBTTTVdB
> DvtD5I14QHp9OE/edit#)
> (version as of 2017-12-13), from the perspective of the developers of
> the NumPy and SciPy libraries. We hope this feedback is useful, and
> welcome further discussion.
>
> ## Who are we?
>
> NumPy and SciPy are the two foundational libraries of the Python
> numerical ecosystem, and one of their duties is to wrap BLAS and
> expose it for the use of other Python libraries. (NumPy primarily
> provides a GEMM wrapper, while SciPy exposes more specialized
> operations.) It's unclear how many users we have exactly, but we
> certainly ship multiple million copies of BLAS every month, and
> provide one of the most popular numerical toolkits for both novice and
> expert users.
>
> Looking at the original BLAS and LAPACK interfaces, it often seems
> that their imagined user is something like a classic supercomputer
> consumer, who writes code directly in Fortran or C against the BLAS
> API, and where the person writing the code and running the code are
> the same. NumPy/SciPy are coming from a very different perspective:
> our users generally know nothing about the details of the underlying
> BLAS; they just want to describe their problem in some high-level way,
> and the library is responsible for making it happen as efficiently as
> possible, and is often integrated into some larger system (e.g. a
> real-time analytics platform embedded in a web server).
>
> When it comes to our BLAS usage, we mostly use only a small subset of
> the routines. However, as "consumer software" used by a wide variety
> of users with differing degress of technical expertise, we're expected
> to Just Work on a wide variety of systems, and with as many different
> vendor BLAS libraries as possible. On the other hand, the fact that
> we're working with Python means we don't tend to worry about small
> inefficiencies that will be lost in the noise in any case, and are
> willing to sacrifice some performance to get more reliable operation
> across our diverse userbase.
>
> ## Comments on specific aspects of the proposal
>
> ### Data Layout
>
> We are **strongly in favor** of the proposal to support arbitrary
> strided data layouts. Ideally, this would support strides *specified
> in bytes* (allowing for unaligned data layouts), and allow for truly
> arbitrary strides, including *zero or negative* values. However, we
> think it's fine if some of the weirder cases suffer a performance
> penalty.
>
> Rationale: NumPy – and thus, most of the scientific Python ecosystem –
> only has one way of representing an array: the `numpy.ndarray` type,
> which is an arbitrary dimensional tensor with arbitrary strides. It is
> common to encounter matrices with non-trivial strides. For example::
>
>     # Make a 3-dimensional tensor, 10 x 9 x 8
>     t = np.zeros((10, 9, 8))
>     # Considering this as a stack of eight 10x9 matrices, extract the
> first:
>     mat = t[:, :, 0]
>
> Now `mat` has non-trivial strides on both axes. (If running this in a
> Python interpreter, you can see this by looking at the value of
> `mat.strides`.) Another case where interesting strides arise is when
> performing ["broadcasting"](https://docs.scipy.org/doc/numpy-1.13.0/
> user/basics.broadcasting.html),
> which is the name for NumPy's rules for stretching arrays to make
> their shapes match. For example, in an expression like::
>
>     np.array([1, 2, 3]) + 1
>
> the scalar `1` is "broadcast" to create a vector `[1, 1, 1]`. This is
> accomplished without allocating memory, by creating a vector with
> settings length = 3, strides = 0 – so all the elements share a single
> location in memory. Similarly, by using negative strides we can
> reverse an array without allocating memory::
>
>     a = np.array([1, 2, 3])
>     a_flipped = a[::-1]
>
> Now `a_flipped` has the value `[3, 2, 1]`, while sharing storage with
> the array `a = [1, 2, 3]`. Misaligned data is also possible (e.g. an
> array of 8-byte doubles with a 9-byte stride), though it arises more
> rarely. (An example of when it might occurs is in an on-disk data
> format that alternates between storing a double value and then a
> single byte value, which is then memory-mapped.)
>
> While this array representation is very flexible and useful, it makes
> interfacing with BLAS a challenge: how do you perform a GEMM operation
> on two arrays with arbitrary strides? Currently, NumPy attempts to
> detect a number of special cases: if the strides in both arrays imply
> a column-major layout, then call BLAS directly; if one of them has
> strides corresponding to a row-major layout, then set the
> corresponding `transA`/`transB` argument, etc. – and if all else
> fails, either copy the data into a contiguous buffer, or else fall
> back on a naive triple-nested-loop GEMM implementation. (There's also
> a check where if we can determine through examining the arrays' data
> pointers and strides that they're actually transposes of each other,
> then we instead dispatch to SYRK.)
>
> So for us, native stride support in the BLAS has two major advantages:
>
> - It allows a wider variety of operations to be transparently handled
> using high-speed kernels.
>
> - It reduces the complexity of the NumPy/BLAS interface code.
>
> Any kind of stride support produces both of these advantages to some
> extent. The ideal would be if BLAS supports strides specified in bytes
> (not elements), and allowing truly arbitrary values, because in this
> case, we can simply pass through our arrays directly to the library
> without having to do any fiddly checking at all. Note that for these
> purposes, it's OK if "weird" arrays pay a speed penalty: if the BLAS
> didn't support them, we'd just have to fall back on some even worse
> strategy for handling them ourselves. And this approach would mean
> that if some BLAS vendors do at some point decide to optimize these
> cases, then our users will immediately see the advantages.
>
> ### Error-reporting
>
> We are **strongly in favor** of the draft's proposal to replace XERBLA
> with proper error codes. XERBLA is fine for one-off programs in
> Fortran, but awful for wrapper libraries like NumPy/SciPy.
>
> ### Handling of exceptional values
>
> A recurring issue in low-level linear-algebra libraries is that they
> may crash, freeze, or produce unexpected results when receiving
> non-finite inputs. Of course as a wrapper library, NumPy/SciPy can't
> control what kind of strange data our users might pass in, and we have
> to produce sensible results regardless, so we sometimes accumulate
> hacks like having to validate all incoming data for exceptional values
>
> We support the proposal's recommendation of "NaN interpretation (4)".
>
> We also note that contrary to the note under "Interpretation (1)"; in
> the R statistical environment NaN does *not* represent missing data. R
> maintains a clear distinction between "NA" (missing) values and "NaN"
> (invalid) values. This is important for various statistical procedures
> such as imputation, where you might want to estimate the true value of
> an unmeasured variable, but you don't want to estimate the true value
> of 0/0. For efficiency, they do perform some calculations on NA values
> by storing them as NaNs with a specific payload; in the R context,
> therefore, it's particularly valuable if operations **correctly
> propagate NaN payloads**. NumPy does not yet provide first-class
> missing value support, but we plan to add it within the next 1-2
> years, and when we do it seems likely that we'll have similar needs to
> R.
>
> ### BLAS G2
>
> NumPy has a rich type system, including 16-, 32-, and 64-bit floats
> (plus extended precision on some platforms), a similar set of complex
> values, and is further extensible with new types. We have a similarly
> rich dispatch system for operations that attempts to find the best
> matching optimized-loop for an arbitrary set of input types. We're
> thus quite interested in the proposed mixed-type operations, and if
> implemented we'll probably start transparently taking advantage of
> them (i.e., in cases where we're currently upcasting to perform a
> requested operation, we may switch to using the native precision
> operation without requiring any changes in user code).
>
> Two comments, though:
>
> 1. The various tricks to make names less redundant (e.g., specifying
> just one type if they're all the same) are convenient for those
> calling these routines by hand, but rather inconvenient for those of
> us who will be generating a table mapping different type combinations
> to different functions. When designing a compression scheme for names,
> please remember that the scheme will have to be unambiguously
> implemented in code by multiple projects. Another option to consider:
> ask vendors to implement the full "uncompressed" names, and then
> provide a shim library mapping short "convenience" names to their full
> form.
>
> 2. Regarding the suggestion that the naming scheme will allow for a
> wide variety of differently typed routines, but that only some subset
> will be implemented by any particular vendor: we are worried about
> this subset business. For a library like NumPy that wants to wrap
> "all" the provided routines, while supporting "all" the different BLAS
> libraries ... how will this work? **How do we know which routines any
> particular library provides?** What if new routines are added to the
> list later?
>
> ### Reproducible BLAS
>
> This is interesting, but we don't have any particularly useful comments.
>
> ### Batch BLAS
>
> NumPy/SciPy actually provide batched operations in many cases,
> currently implemented using a `for` loop around individual calls to
> BLAS/LAPACK. However, our interface is relatively restricted compared
> to some of the proposals considered here: generally all we need is the
> ability to perform an operation on two "stacks" of size-homogenous
> matrices, with the offset between matrices determined by an arbitrary
> (potentially zero) stride.
>
> ### Fixed-point BLAS
>
> This is interesting, but we don't have any particularly useful comments.
>
>
> --
> Nathaniel J. Smith -- https://vorpus.org
> _______________________________________________
> SciPy-Dev mailing list
> SciPy-Dev at python.org
> https://mail.python.org/mailman/listinfo/scipy-dev
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scipy-dev/attachments/20180109/0e48d2a0/attachment-0001.html>