I would like to understand how to go about extending the SIMD framework in order to add support for POWER10. Specifically, I would like to add the following instructions: `lxvp` and `stxvp` which loads/stores 256 bits into/from two vectors. I believe that this will be able to give a decent performance boost for those on POWER machines since it can halved the amount of loads/stores issued.
Additionally, matrix engines (2-D SIMD instructions) are becoming quite popular due to their performance improvements for deep learning and scientific computing. Would it be beneficial to add these new advanced SIMD instructions into the framework or should these instructions be left to libraries such as OpenBLAS and MKL?
Thank you,
Nicholai Tukanov