Optimizing multi-tensor contractions in numpy
Hello everyone, I have been working on a chunk of code that basically sets out to provide a single function that can take an arbitrary einsum expression and computes it in the most optimal way. While np.einsum can compute arbitrary expressions, there are two drawbacks to using pure einsum: einsum does not consider building intermediate arrays for possible reductions in overall rank and is not currently capable of using a vendor BLAS. I have been working on a project that aims to solve both issues simultaneously: https://github.com/dgasmith/opt_einsum <https://github.com/dgasmith/opt_einsum> This program first builds the optimal way to contract the tensors together, or using my own nomenclature a “path.” This path is then iterated over and uses tensordot when possible and einsum for everything else. In test cases the worst case scenario adds a 20 microsecond overhead penalty and, in the best case scenario, it can reduce the overall rank of the tensor. The primary (if somewhat exaggerated) example is a 5-term N^8 index transformation that can be reduced to N^5; even when N is very small (N=10) there is a 2,000 fold speed increase over pure einsum or, if using tensordot, a 2,400 fold speed increase. This is somewhat similar to the new np.linalg.multi_dot function. If you are interested in this function please head over to the github repo and check it out. I believe the README is starting to become self-explanatory, but feel free to email me with any questions. This originally started because I was looking into using numpy to rapidly prototype quantum chemistry codes. The results of which can be found here: https://github.com/dgasmith/psi4numpy <https://github.com/dgasmith/psi4numpy> As such, I am very interested in implementing this into numpy. While I think opt_einsum is in a pretty good place, there is still quite a bit to do (see outstanding issues in the README). Even if this is not something that would fit into numpy I would still be very interested in your comments. Thank you for your time, -Daniel Smith
On Fri, Jan 2, 2015 at 7:45 PM, Daniel Smith <dgasmith@icloud.com> wrote:
Hello everyone,
I have been working on a chunk of code that basically sets out to provide a single function that can take an arbitrary einsum expression and computes it in the most optimal way. While np.einsum can compute arbitrary expressions, there are two drawbacks to using pure einsum: einsum does not consider building intermediate arrays for possible reductions in overall rank and is not currently capable of using a vendor BLAS. I have been working on a project that aims to solve both issues simultaneously:
https://github.com/dgasmith/opt_einsum
This program first builds the optimal way to contract the tensors together, or using my own nomenclature a “path.” This path is then iterated over and uses tensordot when possible and einsum for everything else. In test cases the worst case scenario adds a 20 microsecond overhead penalty and, in the best case scenario, it can reduce the overall rank of the tensor. The primary (if somewhat exaggerated) example is a 5-term N^8 index transformation that can be reduced to N^5; even when N is very small (N=10) there is a 2,000 fold speed increase over pure einsum or, if using tensordot, a 2,400 fold speed increase. This is somewhat similar to the new np.linalg.multi_dot function.
If you are interested in this function please head over to the github repo and check it out. I believe the README is starting to become self-explanatory, but feel free to email me with any questions.
This originally started because I was looking into using numpy to rapidly prototype quantum chemistry codes. The results of which can be found here: https://github.com/dgasmith/psi4numpy
As such, I am very interested in implementing this into numpy. While I think opt_einsum is in a pretty good place, there is still quite a bit to do (see outstanding issues in the README). Even if this is not something that would fit into numpy I would still be very interested in your comments.
Sounds interesting. I wouldn't be opposed to including an optimized einsum in numpy, there has even been some mention of using blas. Note that cblas is used in multiarray in the current development branch, so that might be useful. I also looked into using einsum to implement the '@' operator coming in Python 3.5, but there was a rather large fixed overhead involved in parsing the input string, and the multiplications were much slower than the numpy dot operator. If those times could be reduced einsum might become a real possibility. Chuck
On 3 Jan 2015 02:46, "Daniel Smith" <dgasmith@icloud.com> wrote:
Hello everyone,
I have been working on a chunk of code that basically sets out to provide
a single function that can take an arbitrary einsum expression and computes it in the most optimal way. While np.einsum can compute arbitrary expressions, there are two drawbacks to using pure einsum: einsum does not consider building intermediate arrays for possible reductions in overall rank and is not currently capable of using a vendor BLAS. I have been working on a project that aims to solve both issues simultaneously:
https://github.com/dgasmith/opt_einsum
This program first builds the optimal way to contract the tensors
As such, I am very interested in implementing this into numpy. While I
together, or using my own nomenclature a “path.” This path is then iterated over and uses tensordot when possible and einsum for everything else. In test cases the worst case scenario adds a 20 microsecond overhead penalty and, in the best case scenario, it can reduce the overall rank of the tensor. The primary (if somewhat exaggerated) example is a 5-term N^8 index transformation that can be reduced to N^5; even when N is very small (N=10) there is a 2,000 fold speed increase over pure einsum or, if using tensordot, a 2,400 fold speed increase. This is somewhat similar to the new np.linalg.multi_dot function. This sounds super awesome. Who *wouldn't* want a free 2,000x speed increase? And I especially like your test suite. I'd also be interested in hearing more about the memory requirements of this approach. How does the temporary memory required typically scale with the size of the input arrays? Is there an upper bound on the temporary memory required? think opt_einsum is in a pretty good place, there is still quite a bit to do (see outstanding issues in the README). Even if this is not something that would fit into numpy I would still be very interested in your comments. We would definitely be interested in integrating this functionality into numpy. After all, half the point of having an interface like einsum is that it provides a clean boundary where we can swap in complicated, sophisticated machinery without users having to care. No one wants to curate their own pile of custom optimized libraries. :-) -n
Hello Nathaniel,
I'd also be interested in hearing more about the memory requirements of this approach. How does the temporary memory required typically scale with the size of the input arrays? Is there an upper bound on the temporary memory required?
Currently the algorithm will not create an array larger than the largest input or output array (maximum_array_size). This gives a maximum upper bound of (number_of_terms/2 + 1) * maximum_array_size. In practice, this rarely goes beyond the maximum_array_size as building large outer products is not usually helpful. The views are also dereferenced after they are used, I believe this should delete the arrays correctly. However, this is one thing I am not sure is being handled in the best way and can use further testing. Figuring out cumulative memory should also be possible for the brute force path algorithm, but I am not sure if this is possible for the faster greedy path algorithm without large changes. Overall this sounds great. If anyone has a suggestion of where this should go I can start working on a PR and we can work out the remaining issues there? -Daniel Smith
On Jan 3, 2015, at 10:57 AM, Nathaniel Smith <njs@pobox.com> wrote:
On 3 Jan 2015 02:46, "Daniel Smith" <dgasmith@icloud.com <mailto:dgasmith@icloud.com>> wrote:
Hello everyone,
I have been working on a chunk of code that basically sets out to provide a single function that can take an arbitrary einsum expression and computes it in the most optimal way. While np.einsum can compute arbitrary expressions, there are two drawbacks to using pure einsum: einsum does not consider building intermediate arrays for possible reductions in overall rank and is not currently capable of using a vendor BLAS. I have been working on a project that aims to solve both issues simultaneously:
https://github.com/dgasmith/opt_einsum <https://github.com/dgasmith/opt_einsum>
This program first builds the optimal way to contract the tensors together, or using my own nomenclature a “path.” This path is then iterated over and uses tensordot when possible and einsum for everything else. In test cases the worst case scenario adds a 20 microsecond overhead penalty and, in the best case scenario, it can reduce the overall rank of the tensor. The primary (if somewhat exaggerated) example is a 5-term N^8 index transformation that can be reduced to N^5; even when N is very small (N=10) there is a 2,000 fold speed increase over pure einsum or, if using tensordot, a 2,400 fold speed increase. This is somewhat similar to the new np.linalg.multi_dot function.
This sounds super awesome. Who *wouldn't* want a free 2,000x speed increase? And I especially like your test suite.
I'd also be interested in hearing more about the memory requirements of this approach. How does the temporary memory required typically scale with the size of the input arrays? Is there an upper bound on the temporary memory required?
As such, I am very interested in implementing this into numpy. While I think opt_einsum is in a pretty good place, there is still quite a bit to do (see outstanding issues in the README). Even if this is not something that would fit into numpy I would still be very interested in your comments.
We would definitely be interested in integrating this functionality into numpy. After all, half the point of having an interface like einsum is that it provides a clean boundary where we can swap in complicated, sophisticated machinery without users having to care. No one wants to curate their own pile of custom optimized libraries. :-)
-n
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
participants (3)
-
Charles R Harris
-
Daniel Smith
-
Nathaniel Smith