Hello,
Here at Arm, we've been investigating how we can improve performance on AArch64. One way in which we can improve performance is by integrating some existing optimized routines (https://github.com/ARM-software/optimized-routines), similar to the SVML methods for AVX512 that are currently included as a git submodule. Our intent is to include the optimized routines repository as an additional submodule which we can then use to provide routines on AArch64 for ASIMD, SVE and beyond.
Currently, we're targeting 4-ULP as this aligns with libmvec (https://sourceware.org/glibc/wiki/libmvec) and the SVML integration (https://github.com/numpy/numpy/pull/19478). This is alongside adding sufficient error handling to pass the Numpy test suite, meeting the test requirements highlighted in the SVML integration (https://github.com/numpy/numpy/pull/19478#issuecomment-893001722).
We've already started curating the necessary functions, let us know if you have any feedback.
Cheers,
Chris