I originally brought an optimized einsum routine forward a few weeks back that attempts to contract numpy arrays together in an optimal way. This can greatly reduce the scaling and overall cost of the einsum expression for the cost of a few intermediate arrays. The current version (and more details) can be found here:

Currently the routine only considers the maximum size of intermediates created rather than cumulative memory usage. If we only use np.einsum it is straightforward to calculate cumulative memory usage as einsum does not require any copies of inputs to be made; however, if we attempt to use a vendor BLAS the memory usage can becomes quite complex. For example, the np.tensordot routine always forces a copy for ndarrays because it uses ndarray.transpose(…).reshape(…). A more memory-conscious way to do this is to first try and do ndarray.reshape(…).T which does not copy the data and numpy can just pass a transpose flag to the vendor BLAS. The caveat here is that the summed indices must be in the correct order- if not a copy is required. Maximum array size is usually close to the total overhead of the opt_einsum function, but can occasionally be 2-5 times this size. I see the following ways forward:

Ignore cumulative memory and stick with maximum array size as the limiting memory factor.
Implement logic to figure out if the input arrays needs to be copied to use BLAS, compute the extra memory required, and add an extra dimension to the optimization algorithms (use BLAS or do not use BLAS at each step). Some of this is already done, but may get fairly complex.
Build an in-place nd-transpose algorithm.
Cut out BLAS entirely. Keeping in mind that vendor BLAS can be orders of magnitude faster than a pure einsum implementation, especially if the BLAS threading is used.

Path algorithms-

There are two algorithms “optimal” (a brute force algorithm, scales like N!) and “opportunistic” (a greedy algorithm, scales like N^3). The optimal path can take seconds to minutes to calculate for a 7-10 term expression while the opportunistic path takes microseconds even for 20+ term expressions. The opportunistic algorithm works surprisingly well and appears to obtain the correct scaling in all test cases that I can think of. Both algorithms use the maximum array size as a sieve- this is very beneficial from several aspects. The problem occurs when a much needed intermediate cannot be created due to this limit- on occasions not making this intermediate can have slowdowns of orders of magnitude even for small systems. This leads to certain (and sometimes unexpected) performance walls. Possible solutions:

Warn the user if the ratio between an unlimited memory solution and a limited memory solution becomes large.
Do not worry about it.

Thank you for your time,