<html>

  <head>

    <meta content="text/html; charset=windows-1252"

      http-equiv="Content-Type">

  </head>

  <body bgcolor="#FFFFFF" text="#000000">

    Am 8/16/17 um 6:08 PM schrieb Anne Archibald:<br>

    <blockquote

cite="mid:CANm_+ZognufzFdpF430O_GV7wqOkduKAdz22u6j5D3DDkqqDgw@mail.gmail.com"

      type="cite">

      <div dir="ltr">(NB all: the thread title seems to interchange the

        acronyms for the Thread Contraction Library (TCL) and the

        High-Perormance Tensor Transpose (HPTT) packages. I'm not fixing

        it so as not to break threading.) <br>

        <br>

        <div class="gmail_quote">

          <div dir="ltr">On Wed, Aug 16, 2017 at 11:40 AM Paul Springer

            <<a moz-do-not-send="true" href="mailto:pavdev@gmx.de">pavdev@gmx.de</a>>

            wrote:<br>

          </div>

          <blockquote class="gmail_quote" style="margin:0 0 0

            .8ex;border-left:1px #ccc solid;padding-left:1ex">

            <div bgcolor="#FFFFFF" text="#000000">

              <blockquote type="cite">

                <div dir="ltr">

                  <div class="gmail_extra">

                    <div class="gmail_quote">

                      <div>If you want to get it into Numpy, it would be

                        worth checking if the existing functions can be

                        improved before adding new ones.<br>

                        <br>

                      </div>

                      <div>Note that Numpy transposition method just

                        rearranges the indices, so the advantage of

                        actual transposition is to have better cache

                        performance or allow direct use of CBLAS. I

                        assume TCL uses some tricks to do transposition

                        in a way that is more cache friendly?<br>

                      </div>

                    </div>

                  </div>

                </div>

              </blockquote>

            </div>

            <div bgcolor="#FFFFFF" text="#000000"> HPTT is a

              sophisticated library for tensor transpositions, as such

              it blocks the tensors such that (1) spatial locality can

              be exploited. Moreover, (2) it uses explicit vectorization

              to take advantage of the CPU's vector units.<br>

            </div>

          </blockquote>

          <div><br>

          </div>

          <div>I think this library provides functionality that isn't

            readily accessible from within numpy at the moment. The only

            functions I know of to rearrange the memory layout of data

            are things like ascontiguousarray and asfortranarray, as

            well as assignment (e.g. a[...] = b). The general strategy

            within numpy is to assume that all functions work equally

            well on arrays with arbitrary memory layouts, so that users

            often don't even know the memory layouts of their data. The

            striding functionality means data usually doesn't actually

            get transposed until absolutely necessary.</div>

          <div><br>

          </div>

          <div>Of course, few if any numpy functions work equally well

            on different memory layouts; unary ufuncs contain code to

            try to carry out their iteration in the fastest way, but

            it's not clear how well that works or whether they have the

            freedom to choose the layouts of their output arrays. </div>

          <div><br>

          </div>

          <div>If you wanted to integrate HPTT into numpy, I think the

            best approach might be to wire it into the assignment

            machinery, so that when users do things like a[::2,:] =

            b[:,::3].T HPTT springs into action behind the scenes and

            makes this assignment as efficient as possible (how well

            does it handle arrays with spaces between elements?). Then

            ascontiguousarray and asfortranarray and the like could

            simply use assignment to an appropriately-ordered

            destination when they actually needed to do anything.</div>

        </div>

      </div>

    </blockquote>

    HPTT offers support for subtensor (via the outerSize parameter,

    which is similar to the leading dimension in BLAS), thus, HPTT can

    also deal with arbitrarily strided transpositions.<br>

    However, a non-unite stride for the fastest-varying index is

    devastating for performance since this prohibits the use of

    vectorization and the exploitation of spatial locality.<br>

    <br>

    How would the integration of HPTT into NumPY look like? <br>

    Which steps would need to be taken?<br>

    Would it be required the HPTT be distributed in source code along

    side NumPY (at that point I might have to change the license for

    HPTT) or would it be fine to add an git dependency? That way users

    who build NumPY from source could fetch HPTT and set a flag during

    the build process of NumPY, indicating the HPTT is available? <br>

    How would the process look like if NumPY is distributed as a

    precompiled binary?<br>

    <br>

    The same questions apply with respect to TCL.<br>

    <blockquote

cite="mid:CANm_+ZognufzFdpF430O_GV7wqOkduKAdz22u6j5D3DDkqqDgw@mail.gmail.com"

      type="cite">

      <div dir="ltr">

        <div class="gmail_quote">

          <div><br>

          </div>

          <blockquote class="gmail_quote" style="margin:0 0 0

            .8ex;border-left:1px #ccc solid;padding-left:1ex">

            <div bgcolor="#FFFFFF" text="#000000"> TCL uses the

              Transpose-Transpose-GEMM-Transpose approach where all

              tensors are flattened into matrices (via HPTT) and then

              contracted via GEMM; the final result is eventually folded

              (via HPTT) into the desired output tensor.<br>

            </div>

          </blockquote>

          <div><br>

          </div>

          <div>This is a pretty direct replacement of einsum, but I

            think einsum may well already do pretty much this, apart

            from not using HPTT to do the transposes. So the way to get

            this functionality would be to make the matrix-rearrangement

            primitives use HPTT, as above.</div>

        </div>

      </div>

    </blockquote>

    That would certainly be one approach, however, TCL also explores

    several different strategies/candidates and picks the one that

    minimizes the data movements required by the transpositions.<br>

    <blockquote

cite="mid:CANm_+ZognufzFdpF430O_GV7wqOkduKAdz22u6j5D3DDkqqDgw@mail.gmail.com"

      type="cite">

      <div dir="ltr">

        <div class="gmail_quote">

          <div><br>

          </div>

          <blockquote class="gmail_quote" style="margin:0 0 0

            .8ex;border-left:1px #ccc solid;padding-left:1ex">

            <div bgcolor="#FFFFFF" text="#000000"> Would it be possible

              to expose HPTT and TCL as optional packages within NumPY?

              This way I don't have to redo the work that I've already

              put into those libraries.</div>

          </blockquote>

          <div><br>

          </div>

          <div>I think numpy should be regarded as a basically-complete

            package for manipulating strided in-memory data, to which we

            are reluctant to add new user-visible functionality. Tools

            that can act under the hood to make existing code faster, or

            to reduce the work users must to to make their code run fast

            enough, are valuable.</div>

        </div>

      </div>

    </blockquote>

    It seems to me that TCL is such a candidate, since it can replace a

    significant portion of the functionality offered by numpy.einsum(),

    yielding significantly higher performance.<br>

    <br>

    I imagine some thing of the form:<br>

    <br>

    def einsum(...):<br>

        if( tclApplicable and tclAvailable ):<br>

           tcl.tensorMult(...)<br>

    <blockquote

cite="mid:CANm_+ZognufzFdpF430O_GV7wqOkduKAdz22u6j5D3DDkqqDgw@mail.gmail.com"

      type="cite">

      <div dir="ltr">

        <div class="gmail_quote">

          <div> </div>

          <div>Would some of your techniques help numpy to more rapidly

            evaluate things like C[...] = A+B, when A B and C are

            arbitrarily strided and there are no ordering constraints on

            the result? Or just A+B where numpy is free to choose the

            memory layout for the result?</div>

        </div>

      </div>

    </blockquote>

    Actually, HPTT is only concerned with the operation of the form

    B[perm(i0,i1,...)] = alpha * A[i0,i1,...] + beta *

    B[perm(i0,i1,...)] (where alpha and beta are scalars). Summing over

    multiple transposed tensors can be quite challenging

    (<a class="moz-txt-link-freetext" href="https://arxiv.org/abs/1705.06661">https://arxiv.org/abs/1705.06661</a>) and is not covered by HPTT. Does

    this answer your question?<br>

  </body>

</html>