On Wed, Feb 12, 2020 at 1:37 PM Devulapalli, Raghuveer <raghuveer.devulapalli@intel.com> wrote:

>> I hope there will not be a demand to use many non-universal intrinsics in ufuncs, we will need to work this out on a case-by-case basis in each ufunc. Does that sound reasonable? Are there intrinsics you have already used that have no parallel on other platforms?

I think that is reasonable. It's hard to anticipate the future need and benefit of specialized intrinsics but I tried to make a list of some of the specialized intrinsics that are currently in use in NumPy that I don’t believe exist on other platforms (most of these actually don’t exist on AVX2 either). I am not an expert in ARM or VSX architecture, so please correct me if I am wrong.

a. _mm512_mask_i32gather_ps
b. _mm512_mask_i32scatter_ps/_mm512_mask_i32scatter_pd
c. _mm512_maskz_loadu_pd/_mm512_maskz_loadu_ps
d. _mm512_getexp_ps
e. _mm512_getmant_ps
f. _mm512_scalef_ps
g. _mm512_permutex2var_ps, _mm512_permutex2var_pd
h. _mm512_maskz_div_ps, _mm512_maskz_div_pd
i. _mm512_permute_ps/_mm512_permute_pd
j. _mm512_sqrt_ps/pd (I could be wrong on this one, but from the little google search I did, it seems like power ISA doesn’t have a vectorized sqrt instruction)

Software implementations of these instructions is definitely possible. But some of them are not trivial to implement and are surely not going to be one line macro's either. I am also unsure of what implications this has on performance, but we will hopefully find out once we convert these to universal intrinsic and then benchmark.

For these it seems like we don't want software implementations of the universal intrinsics - if there's no equivalent on PPC/ARM and there's enough value (performance gain given additional code complexity) in the additional AVX instructions, then we should still simply use AVX instructions directly.

Ralf

Raghuveer

-----Original Message-----
From: NumPy-Discussion <numpy-discussion-bounces+raghuveer.devulapalli=intel.com@python.org> On Behalf Of Matti Picus
Sent: Tuesday, February 11, 2020 11:19 PM
To: numpy-discussion@python.org
Subject: Re: [Numpy-discussion] NEP 38 - Universal SIMD intrinsics

On 11/2/20 8:02 pm, Devulapalli, Raghuveer wrote:
>
> On top of that the performance implications aren’t clear. Software
> implementations of hardware instructions might perform worse and might
> not even produce the same result.
>

The proposal for universal intrinsics does not enable replacing an intrinsic on one platform with a software emulation on another: the intrinsics are meant to be compile-time defines that overlay the universal intrinsic with a platform specific one. In order to use a new intrinsic, it must have parallel intrinsics on the other platforms, or cannot be used there: "NPY_CPU_HAVE(FEATURE_NAME)" will always return false so the compiler will not even build a loop for that platform. I will try to clarify that intention in the NEP.

I hope there will not be a demand to use many non-universal intrinsics in ufuncs, we will need to work this out on a case-by-case basis in each ufunc. Does that sound reasonable? Are there intrinsics you have already used that have no parallel on other platforms?

Matti

_______________________________________________
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion
_______________________________________________
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion