[Numpy-discussion] Tensor Contraction (HPTT) and Tensor Transposition (TCL)

Wed Aug 16 12:08:39 EDT 2017

(NB all: the thread title seems to interchange the acronyms for the Thread
Contraction Library (TCL) and the High-Perormance Tensor Transpose (HPTT)
packages. I'm not fixing it so as not to break threading.)

On Wed, Aug 16, 2017 at 11:40 AM Paul Springer <pavdev at gmx.de> wrote:

> If you want to get it into Numpy, it would be worth checking if the
> existing functions can be improved before adding new ones.
>
> Note that Numpy transposition method just rearranges the indices, so the
> advantage of actual transposition is to have better cache performance or
> allow direct use of CBLAS. I assume TCL uses some tricks to do
> transposition in a way that is more cache friendly?
>
> HPTT is a sophisticated library for tensor transpositions, as such it
> blocks the tensors such that (1) spatial locality can be exploited.
> Moreover, (2) it uses explicit vectorization to take advantage of the CPU's
> vector units.
>

I think this library provides functionality that isn't readily accessible
from within numpy at the moment. The only functions I know of to rearrange
the memory layout of data are things like ascontiguousarray and
asfortranarray, as well as assignment (e.g. a[...] = b). The general
strategy within numpy is to assume that all functions work equally well on
arrays with arbitrary memory layouts, so that users often don't even know
the memory layouts of their data. The striding functionality means data
usually doesn't actually get transposed until absolutely necessary.

Of course, few if any numpy functions work equally well on different memory
layouts; unary ufuncs contain code to try to carry out their iteration in
the fastest way, but it's not clear how well that works or whether they
have the freedom to choose the layouts of their output arrays.

If you wanted to integrate HPTT into numpy, I think the best approach might
be to wire it into the assignment machinery, so that when users do things
like a[::2,:] = b[:,::3].T HPTT springs into action behind the scenes and
makes this assignment as efficient as possible (how well does it handle
arrays with spaces between elements?). Then ascontiguousarray and
asfortranarray and the like could simply use assignment to an
appropriately-ordered destination when they actually needed to do anything.

TCL uses the Transpose-Transpose-GEMM-Transpose approach where all tensors
> are flattened into matrices (via HPTT) and then contracted via GEMM; the
> final result is eventually folded (via HPTT) into the desired output tensor.
>

This is a pretty direct replacement of einsum, but I think einsum may well
already do pretty much this, apart from not using HPTT to do the
transposes. So the way to get this functionality would be to make the
matrix-rearrangement primitives use HPTT, as above.

Would it be possible to expose HPTT and TCL as optional packages within
> NumPY? This way I don't have to redo the work that I've already put into
> those libraries.
>

I think numpy should be regarded as a basically-complete package for
manipulating strided in-memory data, to which we are reluctant to add new
user-visible functionality. Tools that can act under the hood to make
existing code faster, or to reduce the work users must to to make their
code run fast enough, are valuable.

> Might check the license if your work uses code from a publication.
>
> As far as licenses are concerned that should not be a problem since I
> wrote to code myself and it doesn't use code from publications other than
> mine.
>

Would some of your techniques help numpy to more rapidly evaluate things
like C[...] = A+B, when A B and C are arbitrarily strided and there are no
ordering constraints on the result? Or just A+B where numpy is free to
choose the memory layout for the result?

Anne
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20170816/3febb715/attachment.html>