Optimizing numpy's einsum expression (again)
Hello everyone, I originally brought an optimized einsum routine forward a few weeks back that attempts to contract numpy arrays together in an optimal way. This can greatly reduce the scaling and overall cost of the einsum expression for the cost of a few intermediate arrays. The current version (and more details) can be found here: https://github.com/dgasmith/opt_einsum <https://github.com/dgasmith/opt_einsum> I think this routine is close to a finalized version, but there are two problems which I would like the community to weigh in on: Memory usage- Currently the routine only considers the maximum size of intermediates created rather than cumulative memory usage. If we only use np.einsum it is straightforward to calculate cumulative memory usage as einsum does not require any copies of inputs to be made; however, if we attempt to use a vendor BLAS the memory usage can becomes quite complex. For example, the np.tensordot routine always forces a copy for ndarrays because it uses ndarray.transpose(…).reshape(…). A more memory-conscious way to do this is to first try and do ndarray.reshape(…).T which does not copy the data and numpy can just pass a transpose flag to the vendor BLAS. The caveat here is that the summed indices must be in the correct order- if not a copy is required. Maximum array size is usually close to the total overhead of the opt_einsum function, but can occasionally be 2-5 times this size. I see the following ways forward: Ignore cumulative memory and stick with maximum array size as the limiting memory factor. Implement logic to figure out if the input arrays needs to be copied to use BLAS, compute the extra memory required, and add an extra dimension to the optimization algorithms (use BLAS or do not use BLAS at each step). Some of this is already done, but may get fairly complex. Build an in-place nd-transpose algorithm. Cut out BLAS entirely. Keeping in mind that vendor BLAS can be orders of magnitude faster than a pure einsum implementation, especially if the BLAS threading is used. Path algorithms- There are two algorithms “optimal” (a brute force algorithm, scales like N!) and “opportunistic” (a greedy algorithm, scales like N^3). The optimal path can take seconds to minutes to calculate for a 7-10 term expression while the opportunistic path takes microseconds even for 20+ term expressions. The opportunistic algorithm works surprisingly well and appears to obtain the correct scaling in all test cases that I can think of. Both algorithms use the maximum array size as a sieve- this is very beneficial from several aspects. The problem occurs when a much needed intermediate cannot be created due to this limit- on occasions not making this intermediate can have slowdowns of orders of magnitude even for small systems. This leads to certain (and sometimes unexpected) performance walls. Possible solutions: Warn the user if the ratio between an unlimited memory solution and a limited memory solution becomes large. Do not worry about it. Thank you for your time, -Daniel Smith
Thanks for taking the time to think about this; good work. Personally, I don't think a factor 5 memory overhead is much to sweat over. The most complex einsum I have ever needed in a production environment was 5/6 terms, and for what this anecdote is worth, speed was a far bigger concern to me than memory. On Fri, Jan 16, 2015 at 12:30 AM, Daniel Smith <dgasmith@icloud.com> wrote:
Hello everyone, I originally brought an optimized einsum routine forward a few weeks back that attempts to contract numpy arrays together in an optimal way. This can greatly reduce the scaling and overall cost of the einsum expression for the cost of a few intermediate arrays. The current version (and more details) can be found here: https://github.com/dgasmith/opt_einsum
I think this routine is close to a finalized version, but there are two problems which I would like the community to weigh in on:
Memory usage- Currently the routine only considers the maximum size of intermediates created rather than cumulative memory usage. If we only use np.einsum it is straightforward to calculate cumulative memory usage as einsum does not require any copies of inputs to be made; however, if we attempt to use a vendor BLAS the memory usage can becomes quite complex. For example, the np.tensordot routine always forces a copy for ndarrays because it uses ndarray.transpose(…).reshape(…). A more memory-conscious way to do this is to first try and do ndarray.reshape(…).T which does not copy the data and numpy can just pass a transpose flag to the vendor BLAS. The caveat here is that the summed indices must be in the correct order- if not a copy is required. Maximum array size is usually close to the total overhead of the opt_einsum function, but can occasionally be 2-5 times this size. I see the following ways forward:
- Ignore cumulative memory and stick with maximum array size as the limiting memory factor. - Implement logic to figure out if the input arrays needs to be copied to use BLAS, compute the extra memory required, and add an extra dimension to the optimization algorithms (use BLAS or do not use BLAS at each step). Some of this is already done, but may get fairly complex. - Build an in-place nd-transpose algorithm. - Cut out BLAS entirely. Keeping in mind that vendor BLAS can be orders of magnitude faster than a pure einsum implementation, especially if the BLAS threading is used.
Path algorithms- There are two algorithms “optimal” (a brute force algorithm, scales like N!) and “opportunistic” (a greedy algorithm, scales like N^3). The optimal path can take seconds to minutes to calculate for a 7-10 term expression while the opportunistic path takes microseconds even for 20+ term expressions. The opportunistic algorithm works surprisingly well and appears to obtain the correct scaling in all test cases that I can think of. Both algorithms use the maximum array size as a sieve- this is very beneficial from several aspects. The problem occurs when a much needed intermediate cannot be created due to this limit- on occasions not making this intermediate can have slowdowns of orders of magnitude even for small systems. This leads to certain (and sometimes unexpected) performance walls. Possible solutions:
- Warn the user if the ratio between an unlimited memory solution and a limited memory solution becomes large. - Do not worry about it.
Thank you for your time, -Daniel Smith
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Daniel Smith <dgasmith <at> icloud.com> writes:
Hello everyone,I originally brought an optimized einsum routine
forward a few weeks back that attempts to contract numpy arrays together in an optimal way. This can greatly reduce the scaling and overall cost of the einsum expression for the cost of a few intermediate arrays. The current version (and more details) can be found here:
https://github.com/dgasmith/opt_einsum
I think this routine is close to a finalized version, but there are two problems which I would like the community to weigh in on:
Thank you for your time,
-Daniel Smith
I wasn't aware of this work, but it's very interesting to me as a user of einsum whose principal reason for doing so is speed. Even though I use it on largish arrays I'm only concerned with the performance as I'm on x64 with plenty of memory even were it to require temporaries 5x the original size. I don't use einsum that much because I've noticed the performance can be very problem dependant so I've always profiled it to check. Hopefully this work will make the performance more consistent, allowing it to be used more generally throughout my code. Thanks, Dave * An anecdotal example from a user only.
participants (3)
-
Daniel Smith
-
Dave Hirschfeld
-
Eelco Hoogendoorn