[Numpy-discussion] Tensor Contraction (HPTT) and Tensor Transposition (TCL)
sebastian at sipsolutions.net
Thu Aug 17 03:55:57 EDT 2017
On Thu, 2017-08-17 at 00:33 +0200, Paul Springer wrote:
> Am 8/16/17 um 6:08 PM schrieb Anne Archibald:
> > If you wanted to integrate HPTT into numpy, I think the best
> > approach might be to wire it into the assignment machinery, so that
> > when users do things like a[::2,:] = b[:,::3].T HPTT springs into
> > action behind the scenes and makes this assignment as efficient as
> > possible (how well does it handle arrays with spaces between
> > elements?). Then ascontiguousarray and asfortranarray and the like
> > could simply use assignment to an appropriately-ordered destination
> > when they actually needed to do anything.
> HPTT offers support for subtensor (via the outerSize parameter,
> which is similar to the leading dimension in BLAS), thus, HPTT can
> also deal with arbitrarily strided transpositions.
> However, a non-unite stride for the fastest-varying index is
> devastating for performance since this prohibits the use of
> vectorization and the exploitation of spatial locality.
> How would the integration of HPTT into NumPY look like?
> Which steps would need to be taken?
> Would it be required the HPTT be distributed in source code along
> side NumPY (at that point I might have to change the license for
> HPTT) or would it be fine to add an git dependency? That way users
> who build NumPY from source could fetch HPTT and set a flag during
> the build process of NumPY, indicating the HPTT is available?
> How would the process look like if NumPY is distributed as a
> precompiled binary?
Well, numpy is BSD, and the official binaries will be BSD, someone else
could do less free binaries of course. I doubt we can have a hard
dependency unless it is part of the numpy source (some trick like this
at one point existed for fftw, but....). I doubt including the source
itself is going to happen quickly since we would first have to decide
to actually use a modern C++ compiler (I have no idea if that is
problematic or not).
Having a blocked/fancier (I assume) iterator jump in at least for
simple operations such as transposed+copy as Anne suggested sounds very
cool though. It could be nice for simple ufuncs at least as well. I
have no idea how difficult that may be though or how much complexity it
would add to maintenance. My guess is it might require quite a lot of
work to integrate such optimizations into the Iterator itself (even
though it would be awesome), compared to just trying to plug it into
some selected fast paths as Anne suggested.
One thing that might be very simple and also pretty nice is just trying
to keep the documentation (or wiki page or so linked from the
documentation) up to date with suggestions for people interested in
speed improvements listing things such as (not sure if we have that):
* Use pyfftw for speeding up ffts
* numexpr can be nice and gives a way to quickly use multiple cores
* numba can automagically compile some python functions to be fast
* Use TCL if you need faster einsum(like) operations
Just a few thoughts, did not think about details really. But yes, it is
sounds reasonable to me to re-add support for optional dependencies
such as fftw or your TCL. But packagers have to make use of that or I
fear it is actually less available than a standalone python module.
> The same questions apply with respect to TCL.
> > > TCL uses the Transpose-Transpose-GEMM-Transpose approach where
> > > all tensors are flattened into matrices (via HPTT) and then
> > > contracted via GEMM; the final result is eventually folded (via
> > > HPTT) into the desired output tensor.
> > >
> > This is a pretty direct replacement of einsum, but I think einsum
> > may well already do pretty much this, apart from not using HPTT to
> > do the transposes. So the way to get this functionality would be to
> > make the matrix-rearrangement primitives use HPTT, as above.
> That would certainly be one approach, however, TCL also explores
> several different strategies/candidates and picks the one that
> minimizes the data movements required by the transpositions.
> > > Would it be possible to expose HPTT and TCL as optional packages
> > > within NumPY? This way I don't have to redo the work that I've
> > > already put into those libraries.
> > >
> > I think numpy should be regarded as a basically-complete package
> > for manipulating strided in-memory data, to which we are reluctant
> > to add new user-visible functionality. Tools that can act under the
> > hood to make existing code faster, or to reduce the work users must
> > to to make their code run fast enough, are valuable.
> It seems to me that TCL is such a candidate, since it can replace a
> significant portion of the functionality offered by numpy.einsum(),
> yielding significantly higher performance.
> I imagine some thing of the form:
> def einsum(...):
> if( tclApplicable and tclAvailable ):
> > Would some of your techniques help numpy to more rapidly evaluate
> > things like C[...] = A+B, when A B and C are arbitrarily strided
> > and there are no ordering constraints on the result? Or just A+B
> > where numpy is free to choose the memory layout for the result?
> Actually, HPTT is only concerned with the operation of the form
> B[perm(i0,i1,...)] = alpha * A[i0,i1,...] + beta * B[perm(i0,i1,...)]
> (where alpha and beta are scalars). Summing over multiple transposed
> tensors can be quite challenging (https://arxiv.org/abs/1705.06661)
> and is not covered by HPTT. Does this answer your question?
> NumPy-Discussion mailing list
> NumPy-Discussion at python.org
-------------- next part --------------
A non-text attachment was scrubbed...
Size: 801 bytes
Desc: This is a digitally signed message part
More information about the NumPy-Discussion