Hello everyone,
I originally brought an optimized einsum routine forward a few weeks back that attempts to contract numpy arrays together in an optimal way. This can greatly reduce the scaling and overall cost of the einsum expression for the cost of a few intermediate arrays. The current version (and more details) can be found here:
https://github.com/dgasmith/opt_einsum

I think this routine is close to a finalized version, but there are two problems which I would like the community to weigh in on:

Memory usage- 
Currently the routine only considers the maximum size of intermediates created rather than cumulative memory usage. If we only use np.einsum it is straightforward to calculate cumulative memory usage as einsum does not require any copies of inputs to be made; however, if we attempt to use a vendor BLAS the memory usage can becomes quite complex. For example, the np.tensordot routine always forces a copy for ndarrays because it uses ndarray.transpose(…).reshape(…). A more memory-conscious way to do this is to first try and do ndarray.reshape(…).T which does not copy the data and numpy can just pass a transpose flag to the vendor BLAS. The caveat here is that the summed indices must be in the correct order- if not a copy is required. Maximum array size is usually close to the total overhead of the opt_einsum function, but can occasionally be 2-5 times this size. I see the following ways forward:

Path algorithms-
There are two algorithms “optimal” (a brute force algorithm, scales like N!) and “opportunistic” (a greedy algorithm, scales like N^3). The optimal path can take seconds to minutes to calculate for a 7-10 term expression while the opportunistic path takes microseconds even for 20+ term expressions. The opportunistic algorithm works surprisingly well and appears to obtain the correct scaling in all test cases that I can think of. Both algorithms use the maximum array size as a sieve- this is very beneficial from several aspects. The problem occurs when a much needed intermediate cannot be created due to this limit- on occasions not making this intermediate can have slowdowns of orders of magnitude even for small systems. This leads to certain (and sometimes unexpected) performance walls. Possible solutions:

Thank you for your time,
-Daniel Smith