[Numpy-discussion] NEP 38 - Universal SIMD intrinsics

Tue Feb 11 00:16:44 EST 2020

On Tue, Feb 4, 2020 at 2:00 PM Hameer Abbasi <einstein.edison at gmail.com>
wrote:

> —snip—
>
> > 1) Once NumPy adds the framework and initial set of Universal Intrinsic,
> if contributors want to leverage a new architecture specific SIMD
> instruction, will they be expected to add software implementation of this
> instruction for all other architectures too?
>
> In my opinion, if the instructions are lower, then yes. For example, one
> cannot add AVX-512 without adding, for example adding AVX-256 and AVX-128
> and SSE*.  However, I would not expect one person or team to be an expert
> in all assemblies, so intrinsics for one architecture can be developed
> independently of another.
>

I think this doesn't quite answer the question. If I understand correctly,
it's about a single instruction (e.g. one needs "VEXP2PD" and it's missing
from the supported AVX512 instructions in master). I think the answer is
yes, it needs to be added for other architectures as well. Otherwise, if
universal intrinsics are added ad-hoc and there's no guarantee that a
universal instruction is available for all main supported platforms, then
over time there won't be much that's "universal" about the framework.

This is a different question though from adding a new ufunc implementation.
I would expect accelerating ufuncs via intrinsics that are already
supported to be much more common than having to add new intrinsics. Does
that sound right?

> > 2) On whom does the burden lie to ensure that new implementations are
> benchmarked and shows benefits on every architecture? What happens if
> optimizing an Ufunc leads to improving performance on one architecture and
> worsens performance on another?
>

This is slightly hard to provide a recipe for. I suspect it may take a
while before this becomes an issue, since we don't have much SIMD code to
begin with. So adding new code with benchmarks will likely show
improvements on all architectures (we should ensure benchmarks can be run
via CI, otherwise it's too onerous). And if not and it's not easily
fixable, the problematic platform could be skipped so performance there is
unchanged.

Only once there's existing universal intrinsics and then they're tweaked
will we have to be much more careful I'd think.

Cheers,
Ralf

>
> I would look at this from a maintainability point of view. If we are
> increasing the code size by 20% for a certain ufunc, there must be a
> domonstrable 20% increase in performance on any CPU. That is to say,
> micro-optimisation will be unwelcome, and code readability will be
> preferable. Usually we ask the submitter of the PR to test the PR with a
> machine they have on hand, and I would be inclined to keep this trend of
> self-reporting. Of course, if someone else came along and reported a
> performance regression of, say, 10%, then we have increased code by 20%,
> with only a net 5% gain in performance, and the PR will have to be reverted.
>
> —snip—
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Discussion at python.org
> https://mail.python.org/mailman/listinfo/numpy-discussion
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20200210/4b8354cf/attachment.html>