Introducing Arm Optimized Routines
Hello, Here at Arm, we've been investigating how we can improve performance on AArch64. One way in which we can improve performance is by integrating some existing optimized routines (https://github.com/ARM-software/optimized-routines), similar to the SVML methods for AVX512 that are currently included as a git submodule. Our intent is to include the optimized routines repository as an additional submodule which we can then use to provide routines on AArch64 for ASIMD, SVE and beyond. Currently, we're targeting 4-ULP as this aligns with libmvec (https://sourceware.org/glibc/wiki/libmvec) and the SVML integration (https://github.com/numpy/numpy/pull/19478). This is alongside adding sufficient error handling to pass the Numpy test suite, meeting the test requirements highlighted in the SVML integration (https://github.com/numpy/numpy/pull/19478#issuecomment-893001722). We've already started curating the necessary functions, let us know if you have any feedback. Cheers, Chris IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.
Thanks, this seems like it would improve performance on aarch64. Would the routines also work with the Apple silicon arm64? If these are new routines, it would be better to implement them in terms of the numpy universal intrinsics rather than adding a new submodule. Matti On 8/11/22 13:30, Chris Sidebottom wrote:
Hello,
Here at Arm, we've been investigating how we can improve performance on AArch64. One way in which we can improve performance is by integrating some existing optimized routines (https://github.com/ARM-software/optimized-routines), similar to the SVML methods for AVX512 that are currently included as a git submodule. Our intent is to include the optimized routines repository as an additional submodule which we can then use to provide routines on AArch64 for ASIMD, SVE and beyond.
Currently, we're targeting 4-ULP as this aligns with libmvec (https://sourceware.org/glibc/wiki/libmvec) and the SVML integration (https://github.com/numpy/numpy/pull/19478). This is alongside adding sufficient error handling to pass the Numpy test suite, meeting the test requirements highlighted in the SVML integration (https://github.com/numpy/numpy/pull/19478#issuecomment-893001722).
We've already started curating the necessary functions, let us know if you have any feedback.
Cheers,
Chris
IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.
_______________________________________________ NumPy-Discussion mailing list -- numpy-discussion@python.org To unsubscribe send an email to numpy-discussion-leave@python.org https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ Member address: matti.picus@gmail.com
Hi Matti, Thanks for your questions :-)
This seems like it would improve performance on aarch64. Would the routines also work with the Apple silicon?
Yip, I can't see a reason why that wouldn't be the case.
If these are new routines, it would be better to implement them in terms of the numpy universal intrinsics rather than adding a new submodule.
These would be the same routines as seen in SVML (integrated here: https://github.com/numpy/numpy/blob/main/numpy/core/src/umath/loops_umath_fp...), which use the universal intrinsics before using the SVML library, the actual surface area is minimal so I'd propose we follow a similar path with our existing routines and then aim to apply universal intrinsics if that's possible in the future - does that sound like a good approach? Cheers, Chris
On 8/11/22 19:45, Chris Sidebottom wrote:
Hi Matti,
Thanks for your questions :-)
This seems like it would improve performance on aarch64. Would the routines also work with the Apple silicon? Yip, I can't see a reason why that wouldn't be the case.
If these are new routines, it would be better to implement them in terms of the numpy universal intrinsics rather than adding a new submodule. These would be the same routines as seen in SVML (integrated here: https://github.com/numpy/numpy/blob/main/numpy/core/src/umath/loops_umath_fp...), which use the universal intrinsics before using the SVML library, the actual surface area is minimal so I'd propose we follow a similar path with our existing routines and then aim to apply universal intrinsics if that's possible in the future - does that sound like a good approach?
Cheers, Chris
Yes, if the routines already exist then it would seem an additional submodule of code would be the best path forward, as long as the license is compatible. Matti
Hello again :-) Just as an update for the list, the first PR has now been raised to integrate Optimized Routines, demonstrating the performance improvements (sometimes 2x faster): https://github.com/numpy/numpy/pull/23171 Once we've achieved the initial milestone of getting these routines integrated and the performance improved it would be interesting to understand what's required to translate them into universal intrinsics? I notice that SVE support (https://github.com/numpy/numpy/pull/22265) isn't quite ready for universal intrinsics which would lead me to believe we would need to use the library there either way? Cheers, Chris
participants (3)
-
Chris Sidebottom
-
Chris Sidebottom
-
Matti Picus