<html>
<head>
<meta content="text/html; charset=windows-1252"
http-equiv="Content-Type">
</head>
<body bgcolor="#FFFFFF" text="#000000">
Am 8/16/17 um 6:08 PM schrieb Anne Archibald:<br>
<blockquote
cite="mid:CANm_+ZognufzFdpF430O_GV7wqOkduKAdz22u6j5D3DDkqqDgw@mail.gmail.com"
type="cite">
<div dir="ltr">(NB all: the thread title seems to interchange the
acronyms for the Thread Contraction Library (TCL) and the
High-Perormance Tensor Transpose (HPTT) packages. I'm not fixing
it so as not to break threading.) <br>
<br>
<div class="gmail_quote">
<div dir="ltr">On Wed, Aug 16, 2017 at 11:40 AM Paul Springer
<<a moz-do-not-send="true" href="mailto:pavdev@gmx.de">pavdev@gmx.de</a>>
wrote:<br>
</div>
<blockquote class="gmail_quote" style="margin:0 0 0
.8ex;border-left:1px #ccc solid;padding-left:1ex">
<div bgcolor="#FFFFFF" text="#000000">
<blockquote type="cite">
<div dir="ltr">
<div class="gmail_extra">
<div class="gmail_quote">
<div>If you want to get it into Numpy, it would be
worth checking if the existing functions can be
improved before adding new ones.<br>
<br>
</div>
<div>Note that Numpy transposition method just
rearranges the indices, so the advantage of
actual transposition is to have better cache
performance or allow direct use of CBLAS. I
assume TCL uses some tricks to do transposition
in a way that is more cache friendly?<br>
</div>
</div>
</div>
</div>
</blockquote>
</div>
<div bgcolor="#FFFFFF" text="#000000"> HPTT is a
sophisticated library for tensor transpositions, as such
it blocks the tensors such that (1) spatial locality can
be exploited. Moreover, (2) it uses explicit vectorization
to take advantage of the CPU's vector units.<br>
</div>
</blockquote>
<div><br>
</div>
<div>I think this library provides functionality that isn't
readily accessible from within numpy at the moment. The only
functions I know of to rearrange the memory layout of data
are things like ascontiguousarray and asfortranarray, as
well as assignment (e.g. a[...] = b). The general strategy
within numpy is to assume that all functions work equally
well on arrays with arbitrary memory layouts, so that users
often don't even know the memory layouts of their data. The
striding functionality means data usually doesn't actually
get transposed until absolutely necessary.</div>
<div><br>
</div>
<div>Of course, few if any numpy functions work equally well
on different memory layouts; unary ufuncs contain code to
try to carry out their iteration in the fastest way, but
it's not clear how well that works or whether they have the
freedom to choose the layouts of their output arrays. </div>
<div><br>
</div>
<div>If you wanted to integrate HPTT into numpy, I think the
best approach might be to wire it into the assignment
machinery, so that when users do things like a[::2,:] =
b[:,::3].T HPTT springs into action behind the scenes and
makes this assignment as efficient as possible (how well
does it handle arrays with spaces between elements?). Then
ascontiguousarray and asfortranarray and the like could
simply use assignment to an appropriately-ordered
destination when they actually needed to do anything.</div>
</div>
</div>
</blockquote>
HPTT offers support for subtensor (via the outerSize parameter,
which is similar to the leading dimension in BLAS), thus, HPTT can
also deal with arbitrarily strided transpositions.<br>
However, a non-unite stride for the fastest-varying index is
devastating for performance since this prohibits the use of
vectorization and the exploitation of spatial locality.<br>
<br>
How would the integration of HPTT into NumPY look like? <br>
Which steps would need to be taken?<br>
Would it be required the HPTT be distributed in source code along
side NumPY (at that point I might have to change the license for
HPTT) or would it be fine to add an git dependency? That way users
who build NumPY from source could fetch HPTT and set a flag during
the build process of NumPY, indicating the HPTT is available? <br>
How would the process look like if NumPY is distributed as a
precompiled binary?<br>
<br>
The same questions apply with respect to TCL.<br>
<blockquote
cite="mid:CANm_+ZognufzFdpF430O_GV7wqOkduKAdz22u6j5D3DDkqqDgw@mail.gmail.com"
type="cite">
<div dir="ltr">
<div class="gmail_quote">
<div><br>
</div>
<blockquote class="gmail_quote" style="margin:0 0 0
.8ex;border-left:1px #ccc solid;padding-left:1ex">
<div bgcolor="#FFFFFF" text="#000000"> TCL uses the
Transpose-Transpose-GEMM-Transpose approach where all
tensors are flattened into matrices (via HPTT) and then
contracted via GEMM; the final result is eventually folded
(via HPTT) into the desired output tensor.<br>
</div>
</blockquote>
<div><br>
</div>
<div>This is a pretty direct replacement of einsum, but I
think einsum may well already do pretty much this, apart
from not using HPTT to do the transposes. So the way to get
this functionality would be to make the matrix-rearrangement
primitives use HPTT, as above.</div>
</div>
</div>
</blockquote>
That would certainly be one approach, however, TCL also explores
several different strategies/candidates and picks the one that
minimizes the data movements required by the transpositions.<br>
<blockquote
cite="mid:CANm_+ZognufzFdpF430O_GV7wqOkduKAdz22u6j5D3DDkqqDgw@mail.gmail.com"
type="cite">
<div dir="ltr">
<div class="gmail_quote">
<div><br>
</div>
<blockquote class="gmail_quote" style="margin:0 0 0
.8ex;border-left:1px #ccc solid;padding-left:1ex">
<div bgcolor="#FFFFFF" text="#000000"> Would it be possible
to expose HPTT and TCL as optional packages within NumPY?
This way I don't have to redo the work that I've already
put into those libraries.</div>
</blockquote>
<div><br>
</div>
<div>I think numpy should be regarded as a basically-complete
package for manipulating strided in-memory data, to which we
are reluctant to add new user-visible functionality. Tools
that can act under the hood to make existing code faster, or
to reduce the work users must to to make their code run fast
enough, are valuable.</div>
</div>
</div>
</blockquote>
It seems to me that TCL is such a candidate, since it can replace a
significant portion of the functionality offered by numpy.einsum(),
yielding significantly higher performance.<br>
<br>
I imagine some thing of the form:<br>
<br>
def einsum(...):<br>
if( tclApplicable and tclAvailable ):<br>
tcl.tensorMult(...)<br>
<blockquote
cite="mid:CANm_+ZognufzFdpF430O_GV7wqOkduKAdz22u6j5D3DDkqqDgw@mail.gmail.com"
type="cite">
<div dir="ltr">
<div class="gmail_quote">
<div> </div>
<div>Would some of your techniques help numpy to more rapidly
evaluate things like C[...] = A+B, when A B and C are
arbitrarily strided and there are no ordering constraints on
the result? Or just A+B where numpy is free to choose the
memory layout for the result?</div>
</div>
</div>
</blockquote>
Actually, HPTT is only concerned with the operation of the form
B[perm(i0,i1,...)] = alpha * A[i0,i1,...] + beta *
B[perm(i0,i1,...)] (where alpha and beta are scalars). Summing over
multiple transposed tensors can be quite challenging
(<a class="moz-txt-link-freetext" href="https://arxiv.org/abs/1705.06661">https://arxiv.org/abs/1705.06661</a>) and is not covered by HPTT. Does
this answer your question?<br>
</body>
</html>