[RFC]  numpy/SVML appears to be poorly optimized
The numpy SVML library: https://github.com/numpy/SVML appears to be poorly optimized. Since its just the raw assembly dump this also makes it quite difficult to improve (with either a better compiler or by hand). Some of the glaring issues are: 1. register allocation / spilling 2. rodata layouts / constpropagation of the values. 3. Very odd use of internal functions that really ought to be inlined. Are these functions meant to be heavily optimized? If so, are people open to patches that optimize them (either with new C implementations are in the current assembly implementations).
They are meant to be optimized. Any contribution to improve them further is more than welcome. Raghuveer Original Message From: Noah Goldstein <goldstein.w.n@gmail.com> Sent: Thursday, November 4, 2021 10:46 AM To: numpydiscussion@python.org Subject: [Numpydiscussion] [RFC]  numpy/SVML appears to be poorly optimized The numpy SVML library: https://github.com/numpy/SVML appears to be poorly optimized. Since its just the raw assembly dump this also makes it quite difficult to improve (with either a better compiler or by hand). Some of the glaring issues are: 1. register allocation / spilling 2. rodata layouts / constpropagation of the values. 3. Very odd use of internal functions that really ought to be inlined. Are these functions meant to be heavily optimized? If so, are people open to patches that optimize them (either with new C implementations are in the current assembly implementations). _______________________________________________ NumPyDiscussion mailing list  numpydiscussion@python.org To unsubscribe send an email to numpydiscussionleave@python.org https://mail.python.org/mailman3/lists/numpydiscussion.python.org/ Member address: raghuveer.devulapalli@intel.com
On Fri, Nov 5, 2021 at 1:38 PM Devulapalli, Raghuveer <raghuveer.devulapalli@intel.com> wrote:
They are meant to be optimized. Any contribution to improve them further is more than welcome.
Fantastic. I don't see any tests for any of the functions in there. Does anyone know where I can find them?
Raghuveer
Original Message From: Noah Goldstein <goldstein.w.n@gmail.com> Sent: Thursday, November 4, 2021 10:46 AM To: numpydiscussion@python.org Subject: [Numpydiscussion] [RFC]  numpy/SVML appears to be poorly optimized
The numpy SVML library: https://github.com/numpy/SVML
appears to be poorly optimized. Since its just the raw assembly dump this also makes it quite difficult to improve (with either a better compiler or by hand).
Some of the glaring issues are: 1. register allocation / spilling 2. rodata layouts / constpropagation of the values. 3. Very odd use of internal functions that really ought to be inlined.
Are these functions meant to be heavily optimized?
If so, are people open to patches that optimize them (either with new C implementations are in the current assembly implementations). _______________________________________________ NumPyDiscussion mailing list  numpydiscussion@python.org To unsubscribe send an email to numpydiscussionleave@python.org https://mail.python.org/mailman3/lists/numpydiscussion.python.org/ Member address: raghuveer.devulapalli@intel.com _______________________________________________ NumPyDiscussion mailing list  numpydiscussion@python.org To unsubscribe send an email to numpydiscussionleave@python.org https://mail.python.org/mailman3/lists/numpydiscussion.python.org/ Member address: goldstein.w.n@gmail.com
On Sat, Nov 6, 2021 at 1:18 PM Noah Goldstein <goldstein.w.n@gmail.com> wrote:
On Fri, Nov 5, 2021 at 1:38 PM Devulapalli, Raghuveer <raghuveer.devulapalli@intel.com> wrote:
They are meant to be optimized. Any contribution to improve them further
is more than welcome.
Fantastic. I don't see any tests for any of the functions in there. Does anyone know where I can find them?
Use the main NumPy test suite by updating the svml submodule to the commit with your changes, then run the test suite the regular way (e.g. `python runtests.py`). Cheers, Ralf
appears to be poorly optimized.
It should perform well, not poor neither heavily optimized.
this also makes it quite difficult to improve (with either a better compiler or by hand). We can put the blame on Intel for not sharing their source code but honestly, it seems we had no other option except accept what they provide. Some of the glaring issues are: 1. register allocation / spilling 2. rodata layouts / constpropagation of the values. 3. Very odd use of internal functions that really ought to be inlined.
let me add to your list another two points:  It only works on Linux.  It only works with AVX512.
If so, are people open to patches that optimize them (either with new C implementations are in the current assembly implementations).
Hopefully, we will able to convert them to universal intrinsics (nep38) one day. As one of the team, I will try to push more time for it. Thanks, Sayed. On Nov 6 2021, at 5:54 pm, Ralf Gommers <ralf.gommers@gmail.com> wrote:
On Sat, Nov 6, 2021 at 1:18 PM Noah Goldstein <goldstein.w.n@gmail.com (mailto:goldstein.w.n@gmail.com)> wrote:
On Fri, Nov 5, 2021 at 1:38 PM Devulapalli, Raghuveer <raghuveer.devulapalli@intel.com (mailto:raghuveer.devulapalli@intel.com)> wrote:
They are meant to be optimized. Any contribution to improve them further is more than welcome.
Fantastic. I don't see any tests for any of the functions in there. Does anyone know where I can find them?
Use the main NumPy test suite by updating the svml submodule to the commit with your changes, then run the test suite the regular way (e.g. `python runtests.py`).
Cheers, Ralf
_______________________________________________ NumPyDiscussion mailing list  numpydiscussion@python.org To unsubscribe send an email to numpydiscussionleave@python.org https://mail.python.org/mailman3/lists/numpydiscussion.python.org/ Member address: seiko@imavr.com
On 6/11/21 6:56 pm, Sayed Adel wrote:
appears to be poorly optimized.
It should perform well, not poor neither heavily optimized.
this also makes it quite difficult to improve (with either a better compiler or by hand).
We can put the blame on Intel for not sharing their source code but honestly, it seems we had no other option except accept what they provide.
Some of the glaring issues are: 1. register allocation / spilling 2. rodata layouts / constpropagation of the values. 3. Very odd use of internal functions that really ought to be inlined.
let me add to your list another two points:  It only works on Linux.  It only works with AVX512.
If so, are people open to patches that optimize them (either with new C implementations are in the current assembly implementations).
Hopefully, we will able to convert them to universal intrinsics (nep38) one day. As one of the team, I will try to push more time for it.
Thanks, Sayed.
Note the benchmarks on Sayed's PR [0] to move tanh to universal intrinsics. It not only supplies the routines for all universalintrinsicssupported platforms, it even slightly increased performance on AVX512 (usual disclaimers about dangers of comparing benchmarks apply). Matti [0] https://github.com/numpy/numpy/pull/20363
participants (5)

Devulapalli, Raghuveer

Matti Picus

Noah Goldstein

Ralf Gommers

Sayed Adel