
Hi Matti, Thanks for your questions :-)
This seems like it would improve performance on aarch64. Would the routines also work with the Apple silicon?
Yip, I can't see a reason why that wouldn't be the case.
If these are new routines, it would be better to implement them in terms of the numpy universal intrinsics rather than adding a new submodule.
These would be the same routines as seen in SVML (integrated here: https://github.com/numpy/numpy/blob/main/numpy/core/src/umath/loops_umath_fp...), which use the universal intrinsics before using the SVML library, the actual surface area is minimal so I'd propose we follow a similar path with our existing routines and then aim to apply universal intrinsics if that's possible in the future - does that sound like a good approach? Cheers, Chris