<div dir="ltr">(NB all: the thread title seems to interchange the acronyms for the Thread Contraction Library (TCL) and the High-Perormance Tensor Transpose (HPTT) packages. I'm not fixing it so as not to break threading.) <br><br><div class="gmail_quote"><div dir="ltr">On Wed, Aug 16, 2017 at 11:40 AM Paul Springer <<a href="mailto:pavdev@gmx.de">pavdev@gmx.de</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div bgcolor="#FFFFFF" text="#000000"><blockquote type="cite"><div dir="ltr"><div class="gmail_extra"><div class="gmail_quote"><div>If you want to get it into Numpy, it would be worth

              checking if the existing functions can be improved before

              adding new ones.<br>

              <br>

            </div>

            <div>Note that Numpy transposition method just rearranges

              the indices, so the advantage of actual transposition is

              to have better cache performance or allow direct use of

              CBLAS. I assume TCL uses some tricks to do transposition

              in a way that is more cache friendly?<br>

            </div>

          </div>

        </div>

      </div>

    </blockquote></div><div bgcolor="#FFFFFF" text="#000000">

    HPTT is a sophisticated library for tensor transpositions, as such

    it blocks the tensors such that (1) spatial locality can be

    exploited. Moreover, (2) it uses explicit vectorization to take

    advantage of the CPU's vector units.<br></div></blockquote><div><br></div><div>I think this library provides functionality that isn't readily accessible from within numpy at the moment. The only functions I know of to rearrange the memory layout of data are things like ascontiguousarray and asfortranarray, as well as assignment (e.g. a[...] = b). The general strategy within numpy is to assume that all functions work equally well on arrays with arbitrary memory layouts, so that users often don't even know the memory layouts of their data. The striding functionality means data usually doesn't actually get transposed until absolutely necessary.</div><div><br></div><div>Of course, few if any numpy functions work equally well on different memory layouts; unary ufuncs contain code to try to carry out their iteration in the fastest way, but it's not clear how well that works or whether they have the freedom to choose the layouts of their output arrays. </div><div><br></div><div>If you wanted to integrate HPTT into numpy, I think the best approach might be to wire it into the assignment machinery, so that when users do things like a[::2,:] = b[:,::3].T HPTT springs into action behind the scenes and makes this assignment as efficient as possible (how well does it handle arrays with spaces between elements?). Then ascontiguousarray and asfortranarray and the like could simply use assignment to an appropriately-ordered destination when they actually needed to do anything.</div><div><br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div bgcolor="#FFFFFF" text="#000000">

    TCL uses the Transpose-Transpose-GEMM-Transpose approach where all

    tensors are flattened into matrices (via HPTT) and then contracted

    via GEMM; the final result is eventually folded (via HPTT) into the

    desired output tensor.<br></div></blockquote><div><br></div><div>This is a pretty direct replacement of einsum, but I think einsum may well already do pretty much this, apart from not using HPTT to do the transposes. So the way to get this functionality would be to make the matrix-rearrangement primitives use HPTT, as above.</div><div><br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div bgcolor="#FFFFFF" text="#000000">

    Would it be possible to expose HPTT and TCL as optional packages

    within NumPY? This way I don't have to redo the work that I've

    already put into those libraries.</div></blockquote><div><br></div><div>I think numpy should be regarded as a basically-complete package for manipulating strided in-memory data, to which we are reluctant to add new user-visible functionality. Tools that can act under the hood to make existing code faster, or to reduce the work users must to to make their code run fast enough, are valuable.</div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div bgcolor="#FFFFFF" text="#000000">

    <blockquote type="cite">

      <div dir="ltr">

        <div class="gmail_extra">

          <div class="gmail_quote">

            <div>Might check the license if your work uses code from a

              publication.<br>

            </div>

          </div>

        </div>

      </div>

    </blockquote></div><div bgcolor="#FFFFFF" text="#000000">

    As far as licenses are concerned that should not be a problem since

    I wrote to code myself and it doesn't use code from publications

    other than mine.<br></div></blockquote><div> </div><div>Would some of your techniques help numpy to more rapidly evaluate things like C[...] = A+B, when A B and C are arbitrarily strided and there are no ordering constraints on the result? Or just A+B where numpy is free to choose the memory layout for the result?</div><div><br></div><div>Anne</div></div></div>