[Numpy-discussion] Optimizing numpy's einsum expression (again)

Fri Jan 16 04:54:32 EST 2015

Thanks for taking the time to think about this; good work.

Personally, I don't think a factor 5 memory overhead is much to sweat over.
The most complex einsum I have ever needed in a production environment was
5/6 terms, and for what this anecdote is worth, speed was a far
bigger concern to me than memory.

On Fri, Jan 16, 2015 at 12:30 AM, Daniel Smith <dgasmith at icloud.com> wrote:

> Hello everyone,
> I originally brought an optimized einsum routine forward a few weeks back
> that attempts to contract numpy arrays together in an optimal way. This can
> greatly reduce the scaling and overall cost of the einsum expression for
> the cost of a few intermediate arrays. The current version (and more
> details) can be found here:
> https://github.com/dgasmith/opt_einsum
>
> I think this routine is close to a finalized version, but there are two
> problems which I would like the community to weigh in on:
>
> Memory usage-
> Currently the routine only considers the maximum size of intermediates
> created rather than cumulative memory usage. If we only use np.einsum it is
> straightforward to calculate cumulative memory usage as einsum does not
> require any copies of inputs to be made; however, if we attempt to use a
> vendor BLAS the memory usage can becomes quite complex. For example, the
> np.tensordot routine always forces a copy for ndarrays because it uses
> ndarray.transpose(…).reshape(…). A more memory-conscious way to do this is
> to first try and do ndarray.reshape(…).T which does not copy the data and
> numpy can just pass a transpose flag to the vendor BLAS. The caveat here is
> that the summed indices must be in the correct order- if not a copy is
> required. Maximum array size is usually close to the total overhead of the
> opt_einsum function, but can occasionally be 2-5 times this size. I see the
> following ways forward:
>
>    - Ignore cumulative memory and stick with maximum array size as the
>    limiting memory factor.
>    - Implement logic to figure out if the input arrays needs to be copied
>    to use BLAS, compute the extra memory required, and add an extra dimension
>    to the optimization algorithms (use BLAS or do not use BLAS at each step).
>    Some of this is already done, but may get fairly complex.
>    - Build an in-place nd-transpose algorithm.
>    - Cut out BLAS entirely. Keeping in mind that vendor BLAS can be
>    orders of magnitude faster than a pure einsum implementation, especially if
>    the BLAS threading is used.
>
>
> Path algorithms-
> There are two algorithms “optimal” (a brute force algorithm, scales like
> N!) and “opportunistic” (a greedy algorithm, scales like N^3). The optimal
> path can take seconds to minutes to calculate for a 7-10 term expression
> while the opportunistic path takes microseconds even for 20+ term
> expressions. The opportunistic algorithm works surprisingly well and
> appears to obtain the correct scaling in all test cases that I can think
> of. Both algorithms use the maximum array size as a sieve- this is very
> beneficial from several aspects. The problem occurs when a much needed
> intermediate cannot be created due to this limit- on occasions not making
> this intermediate can have slowdowns of orders of magnitude even for small
> systems. This leads to certain (and sometimes unexpected) performance
> walls. Possible solutions:
>
>    - Warn the user if the ratio between an unlimited memory solution and
>    a limited memory solution becomes large.
>    - Do not worry about it.
>
>
> Thank you for your time,
> -Daniel Smith
>
>
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Discussion at scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20150116/469f35b2/attachment.html>