Together with Sayed Adel (cc) and Ralf, I am pleased to put the draft version of NEP 38 [0] up for discussion. As per NEP 0, this is the next step in the community accepting the approach layed out in the NEP. The NEP PR [1] has already garnered a fair amount of discussion about the viability of Universal SIMD Intrinsics, so I will try to capture some of that here as well. Abstract While compilers are getting better at using hardware-specific routines to optimize code, they sometimes do not produce optimal results. Also, we would like to be able to copy binary optimized C-extension modules from one machine to another with the same base architecture (x86, ARM, PowerPC) but with different capabilities without recompiling. We have a mechanism in the ufunc machinery to build alternative loops indexed by CPU feature name. At import (in InitOperators), the loop function that matches the run-time CPU info is chosen from the candidates.This NEP proposes a mechanism to build on that for many more features and architectures. The steps proposed are to: Establish a set of well-defined, architecture-agnostic, universal intrisics which capture features available across architectures. Capture these universal intrisics in a set of C macros and use the macros to build code paths for sets of features from the baseline up to the maximum set of features available on that architecture. Offer these as a limited number of compiled alternative code paths. At runtime, discover which CPU features are available, and choose from among the possible code paths accordingly. Motivation and Scope Traditionally NumPy has counted on the compilers to generate optimal code specifically for the target architecture. However few users today compile NumPy locally for their machines. Most use the binary packages which must provide run-time support for the lowest-common denominator CPU architecture. Thus NumPy cannot take advantage of more advanced features of their CPU processors, since they may not be available on all users’ systems. The ufunc machinery already has a loop-selection protocol based on dtypes, so it is easy to extend this to also select an optimal loop for specifically available CPU features at runtime. Traditionally, these features have been exposed through intrinsics which are compiler-specific instructions that map directly to assembly instructions. Recently there were discussions about the effectiveness of adding more intrinsics (e.g., `gh-11113`_ for AVX optimizations for floats). In the past, architecture-specific code was added to NumPy for fast avx512 routines in various ufuncs, using the mechanism described above to choose the best loop for the architecture. However the code is not generic and does not generalize to other architectures. Recently, OpenCV moved to using universal intrinsics in the Hardware Abstraction Layer (HAL) which provided a nice abstraction for common shared Single Instruction Multiple Data (SIMD) constructs. This NEP proposes a similar mechanism for NumPy. There are three stages to using the mechanism: - Infrastructure is provided in the code for abstract intrinsics. The ufunc machinery will be extended using sets of these abstract intrinsics, so that a single ufunc will be expressed as a set of loops, going from a minimal to a maximal set of possibly availabe intrinsics. - At compile time, compiler macros and CPU detection are used to turn the abstract intrinsics into concrete intrinsic calls. Any intrinsics not available on the platform, either because the CPU does not support them (and so cannot be tested) or because the abstract intrinsic does not have a parallel concrete intrinsic on the platform will not error, rather the corresponding loop will not be produced and added to the set of possibilities. - At runtime, the CPU detection code will further limit the set of loops available, and the optimal one will be chosen for the ufunc. The current NEP proposes only to use the runtime feature detection and optimal loop selection mechanism for ufuncs. Future NEPS may propose other uses for the proposed solution. Usage and Impact The end user will be able to get a list of intrinsics available for their platform and compiler. Optionally, the user may be able to specify which of the loops available at runtime will be used, perhaps via an environment variable to enable benchmarking the impact of the different loops. There should be no direct impact to naive end users, the results of all the loops should be identical to within a small number (1-3?) ULPs. On the other hand, users with more powerful machines should notice a significant performance boost. Binary releases - wheels on PyPI and conda packages The binaries released by this process will be larger since they include all possible loops for the architecture. Some packagers may prefer to limit the number of loops in order to limit the size of the binaries, we would hope they would still support a wide range of families of architectures. Note this problem already exists in the Intel MKL offering, where the binary package includes an extensive set of alternative shared objects (DLLs) for various CPU alternatives. Source builds See “Detailed Description” below. A source build where the packager knows details of the target machine could theoretically produce a smaller binary by choosing to compile only the loops needed by the target via command line arguments. How to run benchmarks to assess performance benefits Adding more code which use intrinsics will make the code harder to maintain. Therefore, such code should only be added if it yields a significant performance benefit. Assessing this performance benefit can be nontrivial. To aid with this, the implementation for this NEP will add a way to select which instruction sets can be used at runtime via environment variables. (name TBD). This ablility is critical for CI code verification. Diagnostics A new dictionary __cpu_features__ will be available to python. The keys are the available features, the value is a boolean whether the feature is available or not. Various new private C functions will be used internally to query available features. These might be exposed via specific c-extension modules for testing. Workflow for adding a new CPU architecture-specific optimization NumPy will always have a baseline C implementation for any code that may be a candidate for SIMD vectorization. If a contributor wants to add SIMD support for some architecture (typically the one of most interest to them), this is the proposed workflow: TODO (see https://github.com/numpy/numpy/pull/13516#issuecomment-558859638, needs to be worked out more) Reuse by other projects It would be nice if the universal intrinsics would be available to other libraries like SciPy or Astropy that also build ufuncs, but that is not an explicit goal of the first implementation of this NEP. ----------------------------------------------------------------------------------- My biased summary of select comments from the PR: (Raghuveer): A very similar SIMD library has been proposed for C++. Here is the link to the details: 1. http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2018/p0214r8.pdf 2. http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2019/n4808.pdf There is good discussion on the minimal/common set of instructions across architectures (which narrows down to loads, stores, arithmetic, compare, bitwise and shuffle instructions). Based on my developer experience so far, these instructions aren't by themselves enough to implement and optimize NumPy ufuncs. As i pointed out earlier, I think I would find it useful to learn the workflow of how to use instructions that don't fit in the Universal Intrinsic framework. (Raguveer) gave a well laid out table of currently proposed unversal intrinsics by use: load/store, reorder, operators, conversions, arithmatic and misc [2] which led to a long response from Sayed [3] with some sample code, demonstrating how more complex operations can be built up from the primitives. (catree) mentioned the Simd Library [4] and Halide [5] and asked about maintainability. (Ralf) responded [6] with concerns about competent developer bandwidth for code review. He also mentioned that our CI system currently supports all the architectures we are targeting (x86, aarch64, s390x, ppc64le) although some of these machines may not have the most advanced hardware to support the latest intrinsics. I apologize if my summary is not accurate, pleas correct any mistakes or misconceptions. ---------------------------------------------------------------------------------------- Barring complete rejection of the idea here, we will be pushing forward with PRs to implement this. Comments either on the mailing list or in those PRs are welcome. Matti [0] https://numpy.org/neps/nep-0038-SIMD-optimizations.html [1] https://github.com/numpy/numpy/pull/15228 [2] https://github.com/numpy/numpy/pull/15228#issuecomment-580479336 [3] https://github.com/numpy/numpy/pull/15228#issuecomment-580605718 [4] https://github.com/ermig1979/Simd [5] https://halide-lang.org [6] https://github.com/numpy/numpy/pull/15228#issuecomment-581029991
On 04-02-2020 08:08, Matti Picus wrote:
Together with Sayed Adel (cc) and Ralf, I am pleased to put the draft version of NEP 38 [0] up for discussion. As per NEP 0, this is the next step in the community accepting the approach layed out in the NEP. The NEP PR [1] has already garnered a fair amount of discussion about the viability of Universal SIMD Intrinsics, so I will try to capture some of that here as well.
Hello, more interesting prior art may be found in VOLK https://www.libvolk.org. VOLK is developed mainly to be used in GNURadio, and this reflects in the available kernels and in the supported data types, I think the approach used there may be of interest. Cheers, Dan
Hi everyone, I know had raised these questions in the PR, but wanted to post them in the mailing list as well. 1) Once NumPy adds the framework and initial set of Universal Intrinsic, if contributors want to leverage a new architecture specific SIMD instruction, will they be expected to add software implementation of this instruction for all other architectures too? 2) On whom does the burden lie to ensure that new implementations are benchmarked and shows benefits on every architecture? What happens if optimizing an Ufunc leads to improving performance on one architecture and worsens performance on another? Thanks, Raghuveer -----Original Message----- From: NumPy-Discussion <numpy-discussion-bounces+raghuveer.devulapalli=intel.com@python.org> On Behalf Of Daniele Nicolodi Sent: Tuesday, February 4, 2020 10:01 AM To: numpy-discussion@python.org Subject: Re: [Numpy-discussion] NEP 38 - Universal SIMD intrinsics On 04-02-2020 08:08, Matti Picus wrote:
Together with Sayed Adel (cc) and Ralf, I am pleased to put the draft version of NEP 38 [0] up for discussion. As per NEP 0, this is the next step in the community accepting the approach layed out in the NEP. The NEP PR [1] has already garnered a fair amount of discussion about the viability of Universal SIMD Intrinsics, so I will try to capture some of that here as well.
Hello, more interesting prior art may be found in VOLK https://www.libvolk.org. VOLK is developed mainly to be used in GNURadio, and this reflects in the available kernels and in the supported data types, I think the approach used there may be of interest. Cheers, Dan _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion
—snip—
1) Once NumPy adds the framework and initial set of Universal Intrinsic, if contributors want to leverage a new architecture specific SIMD instruction, will they be expected to add software implementation of this instruction for all other architectures too?
In my opinion, if the instructions are lower, then yes. For example, one cannot add AVX-512 without adding, for example adding AVX-256 and AVX-128 and SSE*. However, I would not expect one person or team to be an expert in all assemblies, so intrinsics for one architecture can be developed independently of another.
2) On whom does the burden lie to ensure that new implementations are benchmarked and shows benefits on every architecture? What happens if optimizing an Ufunc leads to improving performance on one architecture and worsens performance on another?
I would look at this from a maintainability point of view. If we are increasing the code size by 20% for a certain ufunc, there must be a domonstrable 20% increase in performance on any CPU. That is to say, micro-optimisation will be unwelcome, and code readability will be preferable. Usually we ask the submitter of the PR to test the PR with a machine they have on hand, and I would be inclined to keep this trend of self-reporting. Of course, if someone else came along and reported a performance regression of, say, 10%, then we have increased code by 20%, with only a net 5% gain in performance, and the PR will have to be reverted. —snip—
On Tue, Feb 4, 2020 at 2:00 PM Hameer Abbasi <einstein.edison@gmail.com> wrote:
—snip—
1) Once NumPy adds the framework and initial set of Universal Intrinsic, if contributors want to leverage a new architecture specific SIMD instruction, will they be expected to add software implementation of this instruction for all other architectures too?
In my opinion, if the instructions are lower, then yes. For example, one cannot add AVX-512 without adding, for example adding AVX-256 and AVX-128 and SSE*. However, I would not expect one person or team to be an expert in all assemblies, so intrinsics for one architecture can be developed independently of another.
I think this doesn't quite answer the question. If I understand correctly, it's about a single instruction (e.g. one needs "VEXP2PD" and it's missing from the supported AVX512 instructions in master). I think the answer is yes, it needs to be added for other architectures as well. Otherwise, if universal intrinsics are added ad-hoc and there's no guarantee that a universal instruction is available for all main supported platforms, then over time there won't be much that's "universal" about the framework. This is a different question though from adding a new ufunc implementation. I would expect accelerating ufuncs via intrinsics that are already supported to be much more common than having to add new intrinsics. Does that sound right?
2) On whom does the burden lie to ensure that new implementations are benchmarked and shows benefits on every architecture? What happens if optimizing an Ufunc leads to improving performance on one architecture and worsens performance on another?
This is slightly hard to provide a recipe for. I suspect it may take a while before this becomes an issue, since we don't have much SIMD code to begin with. So adding new code with benchmarks will likely show improvements on all architectures (we should ensure benchmarks can be run via CI, otherwise it's too onerous). And if not and it's not easily fixable, the problematic platform could be skipped so performance there is unchanged. Only once there's existing universal intrinsics and then they're tweaked will we have to be much more careful I'd think. Cheers, Ralf
I would look at this from a maintainability point of view. If we are increasing the code size by 20% for a certain ufunc, there must be a domonstrable 20% increase in performance on any CPU. That is to say, micro-optimisation will be unwelcome, and code readability will be preferable. Usually we ask the submitter of the PR to test the PR with a machine they have on hand, and I would be inclined to keep this trend of self-reporting. Of course, if someone else came along and reported a performance regression of, say, 10%, then we have increased code by 20%, with only a net 5% gain in performance, and the PR will have to be reverted.
—snip— _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion
On 11/2/20 7:16 am, Ralf Gommers wrote:
On Tue, Feb 4, 2020 at 2:00 PM Hameer Abbasi <einstein.edison@gmail.com <mailto:einstein.edison@gmail.com>> wrote:
—snip—
> 1) Once NumPy adds the framework and initial set of Universal Intrinsic, if contributors want to leverage a new architecture specific SIMD instruction, will they be expected to add software implementation of this instruction for all other architectures too?
In my opinion, if the instructions are lower, then yes. For example, one cannot add AVX-512 without adding, for example adding AVX-256 and AVX-128 and SSE*. However, I would not expect one person or team to be an expert in all assemblies, so intrinsics for one architecture can be developed independently of another.
I think this doesn't quite answer the question. If I understand correctly, it's about a single instruction (e.g. one needs |"VEXP2PD"and it's missing from the supported AVX512 instructions in master). I think the answer is yes, it needs to be added for other architectures as well. Otherwise, if universal intrinsics are added ad-hoc and there's no guarantee that a universal instruction is available for all main supported platforms, then over time there won't be much that's "universal" about the framework.| | | |This is a different question though from adding a new ufunc implementation. I would expect accelerating ufuncs via intrinsics that are already supported to be much more common than having to add new intrinsics. Does that sound right? |
|Yes. Universal intrinsics are cross-platform. However the NEP is open to the possibility that certain architectures may have SIMD intrinsics that cannot be expressed in terms of intrinsics for other platforms, and so there may be a use case for architecture-specific loops. This is explicitly stated in the latest PR to the NEP: "|If the regression is not minimal, we may choose to keep the X86-specific code for that platform and use the universal intrisic code for other platforms."
| |
> 2) On whom does the burden lie to ensure that new implementations are benchmarked and shows benefits on every architecture? What happens if optimizing an Ufunc leads to improving performance on one architecture and worsens performance on another?
This is slightly hard to provide a recipe for. I suspect it may take a while before this becomes an issue, since we don't have much SIMD code to begin with. So adding new code with benchmarks will likely show improvements on all architectures (we should ensure benchmarks can be run via CI, otherwise it's too onerous). And if not and it's not easily fixable, the problematic platform could be skipped so performance there is unchanged.
On HEAD, out of the 89 ufuncs in numpy.core.code_generators.generate_umath.defdict, 34 have X86-specific simd loops:
[x for x in defdict.keys() if any([td.simd for td in defdict[x].type_descriptions])] ['add', 'subtract', 'multiply', 'conjugate', 'square', 'reciprocal', 'absolute', 'negative', 'greater', 'greater_equal', 'less', 'less_equal', 'equal', 'not_equal', 'logical_and', 'logical_not', 'logical_or', 'maximum', 'minimum', 'bitwise_and', 'bitwise_or', 'bitwise_xor', 'invert', 'left_shift', 'right_shift', 'cos', 'sin', 'exp', 'log', 'sqrt', 'ceil', 'trunc', 'floor', 'rint']
They would be the first targets for universal intrinsics. Of them I estimate that the ones with more than one loop for at least one dtype signature would be the most difficult, since these have different optimizations for avx2, fma, and/or avx512f: ['square', 'reciprocal', 'absolute', 'cos', 'sin', 'exp', 'log', 'sqrt', 'ceil', 'trunc', 'floor', 'rint'] The other 55 ufuncs, for completeness, are ['floor_divide', 'true_divide', 'fmod', '_ones_like', 'power', 'float_power', '_arg', 'positive', 'sign', 'logical_xor', 'clip', 'fmax', 'fmin', 'logaddexp', 'logaddexp2', 'heaviside', 'degrees', 'rad2deg', 'radians', 'deg2rad', 'arccos', 'arccosh', 'arcsin', 'arcsinh', 'arctan', 'arctanh', 'tan', 'cosh', 'sinh', 'tanh', 'exp2', 'expm1', 'log2', 'log10', 'log1p', 'cbrt', 'fabs', 'arctan2', 'remainder', 'divmod', 'hypot', 'isnan', 'isnat', 'isinf', 'isfinite', 'signbit', 'copysign', 'nextafter', 'spacing', 'modf', 'ldexp', 'frexp', 'gcd', 'lcm', 'matmul'] As for testing accuracy: we recently added a framework for testing ulp variation of ufuncs against "golden results" in numpy/core/tests/test_umath_accuracy. So far float32 is tested for exp, log, cos, sin. Others may be tested elsewhere by specific tests, for instance numpy/core/test/test_half.py has test_half_ufuncs. It is difficult to do benchmarking on CI: the machines that run CI vary too much. We would need to set aside a machine for this and carefully set it up to keep CPU speed and temperature constant. We do have benchmarks for ufuncs (they could always be improved). I think Pauli runs the benchmarks carefully on X86, and may even makes the results public, but that resource is not really on PR reviewers' radar. We could run benchmarks on the gcc build farm machines for other architectures. Those machines are shared but not heavily utilized.
Only once there's existing universal intrinsics and then they're tweaked will we have to be much more careful I'd think.
I would look at this from a maintainability point of view. If we are increasing the code size by 20% for a certain ufunc, there must be a domonstrable 20% increase in performance on any CPU. That is to say, micro-optimisation will be unwelcome, and code readability will be preferable. Usually we ask the submitter of the PR to test the PR with a machine they have on hand, and I would be inclined to keep this trend of self-reporting. Of course, if someone else came along and reported a performance regression of, say, 10%, then we have increased code by 20%, with only a net 5% gain in performance, and the PR will have to be reverted.
—snip—
I think we should be careful not to increase the reviewer burden, and try to automate as much as possible. It would be nice if we could at some point set up a set of bots that can be triggered to run benchmarks for us and report in the PR the results. Matti
I think this doesn't quite answer the question. If I understand correctly, it's about a single instruction (e.g. one needs "VEXP2PD" and it's missing from the supported AVX512 instructions in master). I think the answer is yes, it needs to be added for other architectures as well.
That adds a lot of overhead to write SIMD based optimizations which can discourage contributors. It’s also an unreasonable expectation that a developer be familiar with SIMD of all the architectures. On top of that the performance implications aren’t clear. Software implementations of hardware instructions might perform worse and might not even produce the same result. From: NumPy-Discussion <numpy-discussion-bounces+raghuveer.devulapalli=intel.com@python.org> On Behalf Of Ralf Gommers Sent: Monday, February 10, 2020 9:17 PM To: Discussion of Numerical Python <numpy-discussion@python.org> Subject: Re: [Numpy-discussion] NEP 38 - Universal SIMD intrinsics On Tue, Feb 4, 2020 at 2:00 PM Hameer Abbasi <einstein.edison@gmail.com<mailto:einstein.edison@gmail.com>> wrote: —snip—
1) Once NumPy adds the framework and initial set of Universal Intrinsic, if contributors want to leverage a new architecture specific SIMD instruction, will they be expected to add software implementation of this instruction for all other architectures too?
In my opinion, if the instructions are lower, then yes. For example, one cannot add AVX-512 without adding, for example adding AVX-256 and AVX-128 and SSE*. However, I would not expect one person or team to be an expert in all assemblies, so intrinsics for one architecture can be developed independently of another. I think this doesn't quite answer the question. If I understand correctly, it's about a single instruction (e.g. one needs "VEXP2PD" and it's missing from the supported AVX512 instructions in master). I think the answer is yes, it needs to be added for other architectures as well. Otherwise, if universal intrinsics are added ad-hoc and there's no guarantee that a universal instruction is available for all main supported platforms, then over time there won't be much that's "universal" about the framework. This is a different question though from adding a new ufunc implementation. I would expect accelerating ufuncs via intrinsics that are already supported to be much more common than having to add new intrinsics. Does that sound right?
2) On whom does the burden lie to ensure that new implementations are benchmarked and shows benefits on every architecture? What happens if optimizing an Ufunc leads to improving performance on one architecture and worsens performance on another?
This is slightly hard to provide a recipe for. I suspect it may take a while before this becomes an issue, since we don't have much SIMD code to begin with. So adding new code with benchmarks will likely show improvements on all architectures (we should ensure benchmarks can be run via CI, otherwise it's too onerous). And if not and it's not easily fixable, the problematic platform could be skipped so performance there is unchanged. Only once there's existing universal intrinsics and then they're tweaked will we have to be much more careful I'd think. Cheers, Ralf I would look at this from a maintainability point of view. If we are increasing the code size by 20% for a certain ufunc, there must be a domonstrable 20% increase in performance on any CPU. That is to say, micro-optimisation will be unwelcome, and code readability will be preferable. Usually we ask the submitter of the PR to test the PR with a machine they have on hand, and I would be inclined to keep this trend of self-reporting. Of course, if someone else came along and reported a performance regression of, say, 10%, then we have increased code by 20%, with only a net 5% gain in performance, and the PR will have to be reverted. —snip— _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@python.org<mailto:NumPy-Discussion@python.org> https://mail.python.org/mailman/listinfo/numpy-discussion
On Tue, Feb 11, 2020 at 12:03 PM Devulapalli, Raghuveer < raghuveer.devulapalli@intel.com> wrote:
I think this doesn't quite answer the question. If I understand correctly, it's about a single instruction (e.g. one needs "VEXP2PD" and it's missing from the supported AVX512 instructions in master). I think the answer is yes, it needs to be added for other architectures as well.
That adds a lot of overhead to write SIMD based optimizations which can discourage contributors.
Keep in mind that a new universal intrinsics instruction is just a bunch of defines. That is way less work than writing a ufunc that uses that instruction. We can also ping a platform expert in case it's not obvious what the corresponding arch-specific instruction is - that's a bit of a chicken-and-egg problem; once we get going we hopefully get more interested people that can help each other out.
It’s also an unreasonable expectation that a developer be familiar with SIMD of all the architectures. On top of that the performance implications aren’t clear. Software implementations of hardware instructions might perform worse and might not even produce the same result.
I think you are worrying about writing ufuncs here, not about adding an instruction. If the same result is not produced, we have CI that should fail - and if it does, we can deal with that by (if it's not easy to figure out) making that platform fall back to the generic non-SIMD version of the ufunc. Cheers, Ralf
*From:* NumPy-Discussion <numpy-discussion-bounces+raghuveer.devulapalli= intel.com@python.org> *On Behalf Of *Ralf Gommers *Sent:* Monday, February 10, 2020 9:17 PM *To:* Discussion of Numerical Python <numpy-discussion@python.org> *Subject:* Re: [Numpy-discussion] NEP 38 - Universal SIMD intrinsics
On Tue, Feb 4, 2020 at 2:00 PM Hameer Abbasi <einstein.edison@gmail.com> wrote:
—snip—
1) Once NumPy adds the framework and initial set of Universal Intrinsic, if contributors want to leverage a new architecture specific SIMD instruction, will they be expected to add software implementation of this instruction for all other architectures too?
In my opinion, if the instructions are lower, then yes. For example, one cannot add AVX-512 without adding, for example adding AVX-256 and AVX-128 and SSE*. However, I would not expect one person or team to be an expert in all assemblies, so intrinsics for one architecture can be developed independently of another.
I think this doesn't quite answer the question. If I understand correctly, it's about a single instruction (e.g. one needs "VEXP2PD" and it's missing from the supported AVX512 instructions in master). I think the answer is yes, it needs to be added for other architectures as well. Otherwise, if universal intrinsics are added ad-hoc and there's no guarantee that a universal instruction is available for all main supported platforms, then over time there won't be much that's "universal" about the framework.
This is a different question though from adding a new ufunc implementation. I would expect accelerating ufuncs via intrinsics that are already supported to be much more common than having to add new intrinsics. Does that sound right?
2) On whom does the burden lie to ensure that new implementations are benchmarked and shows benefits on every architecture? What happens if optimizing an Ufunc leads to improving performance on one architecture and worsens performance on another?
This is slightly hard to provide a recipe for. I suspect it may take a while before this becomes an issue, since we don't have much SIMD code to begin with. So adding new code with benchmarks will likely show improvements on all architectures (we should ensure benchmarks can be run via CI, otherwise it's too onerous). And if not and it's not easily fixable, the problematic platform could be skipped so performance there is unchanged.
Only once there's existing universal intrinsics and then they're tweaked will we have to be much more careful I'd think.
Cheers,
Ralf
I would look at this from a maintainability point of view. If we are increasing the code size by 20% for a certain ufunc, there must be a domonstrable 20% increase in performance on any CPU. That is to say, micro-optimisation will be unwelcome, and code readability will be preferable. Usually we ask the submitter of the PR to test the PR with a machine they have on hand, and I would be inclined to keep this trend of self-reporting. Of course, if someone else came along and reported a performance regression of, say, 10%, then we have increased code by 20%, with only a net 5% gain in performance, and the PR will have to be reverted.
—snip—
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion
On 11/2/20 8:02 pm, Devulapalli, Raghuveer wrote:
On top of that the performance implications aren’t clear. Software implementations of hardware instructions might perform worse and might not even produce the same result.
The proposal for universal intrinsics does not enable replacing an intrinsic on one platform with a software emulation on another: the intrinsics are meant to be compile-time defines that overlay the universal intrinsic with a platform specific one. In order to use a new intrinsic, it must have parallel intrinsics on the other platforms, or cannot be used there: "NPY_CPU_HAVE(FEATURE_NAME)" will always return false so the compiler will not even build a loop for that platform. I will try to clarify that intention in the NEP. I hope there will not be a demand to use many non-universal intrinsics in ufuncs, we will need to work this out on a case-by-case basis in each ufunc. Does that sound reasonable? Are there intrinsics you have already used that have no parallel on other platforms? Matti
On Wed, Feb 12, 2020 at 12:19 AM Matti Picus <matti.picus@gmail.com> wrote:
On 11/2/20 8:02 pm, Devulapalli, Raghuveer wrote:
On top of that the performance implications aren’t clear. Software implementations of hardware instructions might perform worse and might not even produce the same result.
The proposal for universal intrinsics does not enable replacing an intrinsic on one platform with a software emulation on another: the intrinsics are meant to be compile-time defines that overlay the universal intrinsic with a platform specific one. In order to use a new intrinsic, it must have parallel intrinsics on the other platforms, or cannot be used there: "NPY_CPU_HAVE(FEATURE_NAME)" will always return false so the compiler will not even build a loop for that platform. I will try to clarify that intention in the NEP.
I hope there will not be a demand to use many non-universal intrinsics in ufuncs, we will need to work this out on a case-by-case basis in each ufunc. Does that sound reasonable? Are there intrinsics you have already used that have no parallel on other platforms?
Intrinsics are not an irreversible change, they are, after all, private. The question is whether they are sufficiently useful to justify the time spent on them. I don't think we will know that until we attempt actual implementations. There will probably be some changes as a result of experience, but that is normal. Chuck
I hope there will not be a demand to use many non-universal intrinsics in ufuncs, we will need to work this out on a case-by-case basis in each ufunc. Does that sound reasonable? Are there intrinsics you have already used that have no parallel on other platforms?
I think that is reasonable. It's hard to anticipate the future need and benefit of specialized intrinsics but I tried to make a list of some of the specialized intrinsics that are currently in use in NumPy that I don’t believe exist on other platforms (most of these actually don’t exist on AVX2 either). I am not an expert in ARM or VSX architecture, so please correct me if I am wrong. a. _mm512_mask_i32gather_ps b. _mm512_mask_i32scatter_ps/_mm512_mask_i32scatter_pd c. _mm512_maskz_loadu_pd/_mm512_maskz_loadu_ps d. _mm512_getexp_ps e. _mm512_getmant_ps f. _mm512_scalef_ps g. _mm512_permutex2var_ps, _mm512_permutex2var_pd h. _mm512_maskz_div_ps, _mm512_maskz_div_pd i. _mm512_permute_ps/_mm512_permute_pd j. _mm512_sqrt_ps/pd (I could be wrong on this one, but from the little google search I did, it seems like power ISA doesn’t have a vectorized sqrt instruction) Software implementations of these instructions is definitely possible. But some of them are not trivial to implement and are surely not going to be one line macro's either. I am also unsure of what implications this has on performance, but we will hopefully find out once we convert these to universal intrinsic and then benchmark. Raghuveer -----Original Message----- From: NumPy-Discussion <numpy-discussion-bounces+raghuveer.devulapalli=intel.com@python.org> On Behalf Of Matti Picus Sent: Tuesday, February 11, 2020 11:19 PM To: numpy-discussion@python.org Subject: Re: [Numpy-discussion] NEP 38 - Universal SIMD intrinsics On 11/2/20 8:02 pm, Devulapalli, Raghuveer wrote:
On top of that the performance implications aren’t clear. Software implementations of hardware instructions might perform worse and might not even produce the same result.
The proposal for universal intrinsics does not enable replacing an intrinsic on one platform with a software emulation on another: the intrinsics are meant to be compile-time defines that overlay the universal intrinsic with a platform specific one. In order to use a new intrinsic, it must have parallel intrinsics on the other platforms, or cannot be used there: "NPY_CPU_HAVE(FEATURE_NAME)" will always return false so the compiler will not even build a loop for that platform. I will try to clarify that intention in the NEP. I hope there will not be a demand to use many non-universal intrinsics in ufuncs, we will need to work this out on a case-by-case basis in each ufunc. Does that sound reasonable? Are there intrinsics you have already used that have no parallel on other platforms? Matti _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion
On Wed, 12 Feb 2020 19:36:10 +0000 "Devulapalli, Raghuveer" <raghuveer.devulapalli@intel.com> wrote:
j. _mm512_sqrt_ps/pd (I could be wrong on this one, but from the little google search I did, it seems like power ISA doesn’t have a vectorized sqrt instruction)
Hi, starting at Power7 (we are at Power9), the sqrt is available both in single and double precision: https://www.ibm.com/support/knowledgecenter/SSGH2K_12.1.0/com.ibm.xlc121.aix... Cheers, -- Jérôme Kieffer tel +33 476 882 445
On Wed, Feb 12, 2020 at 1:37 PM Devulapalli, Raghuveer < raghuveer.devulapalli@intel.com> wrote:
I hope there will not be a demand to use many non-universal intrinsics in ufuncs, we will need to work this out on a case-by-case basis in each ufunc. Does that sound reasonable? Are there intrinsics you have already used that have no parallel on other platforms?
I think that is reasonable. It's hard to anticipate the future need and benefit of specialized intrinsics but I tried to make a list of some of the specialized intrinsics that are currently in use in NumPy that I don’t believe exist on other platforms (most of these actually don’t exist on AVX2 either). I am not an expert in ARM or VSX architecture, so please correct me if I am wrong.
a. _mm512_mask_i32gather_ps b. _mm512_mask_i32scatter_ps/_mm512_mask_i32scatter_pd c. _mm512_maskz_loadu_pd/_mm512_maskz_loadu_ps d. _mm512_getexp_ps e. _mm512_getmant_ps f. _mm512_scalef_ps g. _mm512_permutex2var_ps, _mm512_permutex2var_pd h. _mm512_maskz_div_ps, _mm512_maskz_div_pd i. _mm512_permute_ps/_mm512_permute_pd j. _mm512_sqrt_ps/pd (I could be wrong on this one, but from the little google search I did, it seems like power ISA doesn’t have a vectorized sqrt instruction)
Software implementations of these instructions is definitely possible. But some of them are not trivial to implement and are surely not going to be one line macro's either. I am also unsure of what implications this has on performance, but we will hopefully find out once we convert these to universal intrinsic and then benchmark.
For these it seems like we don't want software implementations of the universal intrinsics - if there's no equivalent on PPC/ARM and there's enough value (performance gain given additional code complexity) in the additional AVX instructions, then we should still simply use AVX instructions directly. Ralf
Raghuveer
-----Original Message----- From: NumPy-Discussion <numpy-discussion-bounces+raghuveer.devulapalli= intel.com@python.org> On Behalf Of Matti Picus Sent: Tuesday, February 11, 2020 11:19 PM To: numpy-discussion@python.org Subject: Re: [Numpy-discussion] NEP 38 - Universal SIMD intrinsics
On 11/2/20 8:02 pm, Devulapalli, Raghuveer wrote:
On top of that the performance implications aren’t clear. Software implementations of hardware instructions might perform worse and might not even produce the same result.
The proposal for universal intrinsics does not enable replacing an intrinsic on one platform with a software emulation on another: the intrinsics are meant to be compile-time defines that overlay the universal intrinsic with a platform specific one. In order to use a new intrinsic, it must have parallel intrinsics on the other platforms, or cannot be used there: "NPY_CPU_HAVE(FEATURE_NAME)" will always return false so the compiler will not even build a loop for that platform. I will try to clarify that intention in the NEP.
I hope there will not be a demand to use many non-universal intrinsics in ufuncs, we will need to work this out on a case-by-case basis in each ufunc. Does that sound reasonable? Are there intrinsics you have already used that have no parallel on other platforms?
Matti
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion
participants (7)
-
Charles R Harris
-
Daniele Nicolodi
-
Devulapalli, Raghuveer
-
Hameer Abbasi
-
Jerome Kieffer
-
Matti Picus
-
Ralf Gommers