Hi all,
there is a proposal to add some Intel specific fast math routine to NumPy:
https://github.com/numpy/numpy/pull/19478
part of numerical algorithms is that there is always a speed vs. precision trade-off, giving a more precise result is slower.
So there is a question what the general precision expectation should be in NumPy. And how much is it acceptable to diverge in the precision/speed trade-off depending on CPU/system?
I doubt we can formulate very clear rules here, but any input on what precision you would expect or trade-offs seem acceptable would be appreciated!
Some more details -----------------
This is mainly interesting e.g. for functions like logarithms, trigonometric functions, or cubic roots.
Some basic functions (multiplication, addition) are correct as per IEEE standard and give the best possible result, but these are typically only correct within very small numerical errors.
This is typically measured as "ULP":
https://en.wikipedia.org/wiki/Unit_in_the_last_place
where 0.5 ULP would be the best possible result.
Merging the PR may mean relaxing the current precision slightly in some places. In general Intel advertises 4 ULP of precision (although the actual precision for most functions seems better).
Here are two tables, one from glibc and one for the Intel functions:
https://www.gnu.org/software/libc/manual/html_node/Errors-in-Math-Functions.... (Mainly the LA column) https://software.intel.com/content/www/us/en/develop/documentation/onemkl-vm...
Different implementation give different accuracy, but formulating some guidelines/expectation (or referencing them) would be useful guidance.
For basic
Am 28.07.2021 um 01:50 schrieb Sebastian Berg sebastian@sipsolutions.net:
Hi all,
there is a proposal to add some Intel specific fast math routine to NumPy:
Many years ago I wrote a package https://github.com/geggo/uvml that makes the VML, a fast implementation of transcendetal math functions, available for numpy. Don’t know if it still compiles. It uses Intel VML, designed for processing arrays, not the SVML intrinsics. By this it is less machine dependent (optimized implementations are selected automatically depending on the availability of, e.g., SSE, AVX, or AVX512), just link to a library. It compiles as an external module, can be activated at runtime.
Different precision models can be selected at runtime (globally). I thinks Intel advocates to use the LA (low accuracy) mode as a good compromise between performance and accuracy. Different people have strongly diverging opinions about what to expect.
The speedups possibly gained by these approaches often vaporize in non-benchmark applications, as for those functions performance is often limited by memory bandwidth, unless all your data stays in CPU cache. By default I would go for high accuracy mode, with option to switch to low accuracy if one urgently needs the better performance. But then one should use different approaches for speeding up numpy.
Gregor
part of numerical algorithms is that there is always a speed vs. precision trade-off, giving a more precise result is slower.
So there is a question what the general precision expectation should be in NumPy. And how much is it acceptable to diverge in the precision/speed trade-off depending on CPU/system?
I doubt we can formulate very clear rules here, but any input on what precision you would expect or trade-offs seem acceptable would be appreciated!
Some more details
This is mainly interesting e.g. for functions like logarithms, trigonometric functions, or cubic roots.
Some basic functions (multiplication, addition) are correct as per IEEE standard and give the best possible result, but these are typically only correct within very small numerical errors.
This is typically measured as "ULP":
https://en.wikipedia.org/wiki/Unit_in_the_last_place
where 0.5 ULP would be the best possible result.
Merging the PR may mean relaxing the current precision slightly in some places. In general Intel advertises 4 ULP of precision (although the actual precision for most functions seems better).
Here are two tables, one from glibc and one for the Intel functions:
https://www.gnu.org/software/libc/manual/html_node/Errors-in-Math-Functions.... (Mainly the LA column) https://software.intel.com/content/www/us/en/develop/documentation/onemkl-vm...
Different implementation give different accuracy, but formulating some guidelines/expectation (or referencing them) would be useful guidance.
For basic
NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion
On Tue, Jul 27, 2021 at 4:55 PM Sebastian Berg sebastian@sipsolutions.net wrote:
Hi all,
there is a proposal to add some Intel specific fast math routine to NumPy:
https://github.com/numpy/numpy/pull/19478
part of numerical algorithms is that there is always a speed vs. precision trade-off, giving a more precise result is slower.
So there is a question what the general precision expectation should be in NumPy. And how much is it acceptable to diverge in the precision/speed trade-off depending on CPU/system?
I doubt we can formulate very clear rules here, but any input on what precision you would expect or trade-offs seem acceptable would be appreciated!
Some more details
This is mainly interesting e.g. for functions like logarithms, trigonometric functions, or cubic roots.
Some basic functions (multiplication, addition) are correct as per IEEE standard and give the best possible result, but these are typically only correct within very small numerical errors.
This is typically measured as "ULP":
https://en.wikipedia.org/wiki/Unit_in_the_last_place
where 0.5 ULP would be the best possible result.
Merging the PR may mean relaxing the current precision slightly in some places. In general Intel advertises 4 ULP of precision (although the actual precision for most functions seems better).
Here are two tables, one from glibc and one for the Intel functions:
https://www.gnu.org/software/libc/manual/html_node/Errors-in-Math-Functions.... (Mainly the LA column) https://software.intel.com/content/www/us/en/develop/documentation/onemkl-vm...
Different implementation give different accuracy, but formulating some guidelines/expectation (or referencing them) would be useful guidance.
"Close enough" depends on the application but non-linear models can get the "butterfly effect" where the results diverge if they aren't identical.
For a certain class of scientific programming applications, reproducibility is paramount.
Development teams may use a variety of development laptops, workstations, scientific computing clusters, and cloud computing platforms. If the tests pass on your machine but fail in CI, you have a debugging problem.
If your published scientific article links to source code that replicates your computation, scientists will expect to be able to run that code, now or in a couple decades, and replicate the same outputs. They'll be using different OS releases and maybe different CPU + accelerator architectures.
Reproducible Science is good. Replicated Science is better. http://rescience.github.io/
Clearly there are other applications where it's easy to trade reproducibility and some precision for speed.
On Fri, 2021-07-30 at 11:04 -0700, Jerry Morrison wrote:
On Tue, Jul 27, 2021 at 4:55 PM Sebastian Berg < sebastian@sipsolutions.net> wrote:
Hi all,
there is a proposal to add some Intel specific fast math routine to NumPy:
https://github.com/numpy/numpy/pull/19478
part of numerical algorithms is that there is always a speed vs. precision trade-off, giving a more precise result is slower.
<snip>
I have to make a correction: I linked the SVML, which is distinct from VML (which the PR proposes), the actual table for precision is here:
https://github.com/numpy/numpy/pull/19485#issuecomment-887995864
"Close enough" depends on the application but non-linear models can get the "butterfly effect" where the results diverge if they aren't identical.
Right, so my hope was to gauge what the general expectation is. I take it you expect a high accuracy.
The error for the computations itself is seems low on first sight, but of course they can explode quickly in non-linear settings... (In the chaotic systems I worked with, the shadowing theorem would usually alleviate such worries. And testing the integration would be more important. But I am sure for certain questions things may be far more tricky.)
For a certain class of scientific programming applications, reproducibility is paramount.
Development teams may use a variety of development laptops, workstations, scientific computing clusters, and cloud computing platforms. If the tests pass on your machine but fail in CI, you have a debugging problem.
If your published scientific article links to source code that replicates your computation, scientists will expect to be able to run that code, now or in a couple decades, and replicate the same outputs. They'll be using different OS releases and maybe different CPU + accelerator architectures.
Reproducible Science is good. Replicated Science is better. http://rescience.github.io/
Clearly there are other applications where it's easy to trade reproducibility and some precision for speed.
Agreed, although there are so many factors, often out of our control, that I am not sure that true replicability is achievable without containers :(.
It would be amazing if NumPy could have a "replicable" mode, but I am not sure how that could be done, or if the "ground work" in the math and linear algebra libraries even exists.
However, even if it is practically impossible to make things replicable, there is an argument for improving reproducibility and replicability, e.g. by choosing the high-accuracy version here. Even if it is impossible to actually ensure.
Cheers,
Sebastian
NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion
I strongly agree with you Gregor: * Best precision should remain the default. I lost months in finding the compiler option (in ICC) which switched to LA mode and broke all my calculations. * I wonder how those SVML behaves on non-intel plateform ? Sleef provides the same approach but it works also on Power and ARM platforms (and is designed to be extended...).
Cheers,
Jerome
On Wed, 28 Jul 2021 12:13:44 +0200 Gregor Thalhammer gregor.thalhammer@gmail.com wrote:
Am 28.07.2021 um 01:50 schrieb Sebastian Berg sebastian@sipsolutions.net:
Hi all,
there is a proposal to add some Intel specific fast math routine to NumPy:
Many years ago I wrote a package https://github.com/geggo/uvml that makes the VML, a fast implementation of transcendetal math functions, available for numpy. Don’t know if it still compiles. It uses Intel VML, designed for processing arrays, not the SVML intrinsics. By this it is less machine dependent (optimized implementations are selected automatically depending on the availability of, e.g., SSE, AVX, or AVX512), just link to a library. It compiles as an external module, can be activated at runtime.
Different precision models can be selected at runtime (globally). I thinks Intel advocates to use the LA (low accuracy) mode as a good compromise between performance and accuracy. Different people have strongly diverging opinions about what to expect.
The speedups possibly gained by these approaches often vaporize in non-benchmark applications, as for those functions performance is often limited by memory bandwidth, unless all your data stays in CPU cache. By default I would go for high accuracy mode, with option to switch to low accuracy if one urgently needs the better performance. But then one should use different approaches for speeding up numpy.
Gregor
part of numerical algorithms is that there is always a speed vs. precision trade-off, giving a more precise result is slower.
So there is a question what the general precision expectation should be in NumPy. And how much is it acceptable to diverge in the precision/speed trade-off depending on CPU/system?
I doubt we can formulate very clear rules here, but any input on what precision you would expect or trade-offs seem acceptable would be appreciated!
Some more details
This is mainly interesting e.g. for functions like logarithms, trigonometric functions, or cubic roots.
Some basic functions (multiplication, addition) are correct as per IEEE standard and give the best possible result, but these are typically only correct within very small numerical errors.
This is typically measured as "ULP":
https://en.wikipedia.org/wiki/Unit_in_the_last_place
where 0.5 ULP would be the best possible result.
Merging the PR may mean relaxing the current precision slightly in some places. In general Intel advertises 4 ULP of precision (although the actual precision for most functions seems better).
Here are two tables, one from glibc and one for the Intel functions:
https://www.gnu.org/software/libc/manual/html_node/Errors-in-Math-Functions.... (Mainly the LA column) https://software.intel.com/content/www/us/en/develop/documentation/onemkl-vm...
Different implementation give different accuracy, but formulating some guidelines/expectation (or referencing them) would be useful guidance.
For basic
NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion
NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion
On Fri, Jul 30, 2021 at 12:22 PM Sebastian Berg sebastian@sipsolutions.net wrote:
On Fri, 2021-07-30 at 11:04 -0700, Jerry Morrison wrote:
On Tue, Jul 27, 2021 at 4:55 PM Sebastian Berg < sebastian@sipsolutions.net> wrote:
Hi all,
there is a proposal to add some Intel specific fast math routine to NumPy:
https://github.com/numpy/numpy/pull/19478
part of numerical algorithms is that there is always a speed vs. precision trade-off, giving a more precise result is slower.
<snip>
"Close enough" depends on the application but non-linear models can get the "butterfly effect" where the results diverge if they aren't identical.
Right, so my hope was to gauge what the general expectation is. I take it you expect a high accuracy.
The error for the computations itself is seems low on first sight, but of course they can explode quickly in non-linear settings... (In the chaotic systems I worked with, the shadowing theorem would usually alleviate such worries. And testing the integration would be more important. But I am sure for certain questions things may be far more tricky.)
I'll put forth an expectation that after installing a specific set of libraries, the floating point results would be identical across platforms and into the future. Ideally developers could install library updates (for hardware compatibility, security fixes, or other reasons) and still get identical results.
That expectation is for reproducibility, not high accuracy. So it'd be fine to install different libraries [or maybe use those pip package options in brackets, whatever they do?] to trade accuracy for speed. Could any particular choice of accuracy still provide reproducible results across platforms and time?
For a certain class of scientific programming applications, reproducibility is paramount.
Development teams may use a variety of development laptops, workstations, scientific computing clusters, and cloud computing platforms. If the tests pass on your machine but fail in CI, you have a debugging problem.
If your published scientific article links to source code that replicates your computation, scientists will expect to be able to run that code, now or in a couple decades, and replicate the same outputs. They'll be using different OS releases and maybe different CPU + accelerator architectures.
Reproducible Science is good. Replicated Science is better. http://rescience.github.io/
Clearly there are other applications where it's easy to trade reproducibility and some precision for speed.
Agreed, although there are so many factors, often out of our control, that I am not sure that true replicability is achievable without containers :(.
It would be amazing if NumPy could have a "replicable" mode, but I am not sure how that could be done, or if the "ground work" in the math and linear algebra libraries even exists.
However, even if it is practically impossible to make things replicable, there is an argument for improving reproducibility and replicability, e.g. by choosing the high-accuracy version here. Even if it is impossible to actually ensure.
Yes! Let's at least have reproducibility in mind and work on improving it, e.g. by removing failure modes. (Ditto for security :-)
On Thu, Aug 19, 2021 at 2:13 AM Jerry Morrison < jerry.morrison+numpy@gmail.com> wrote:
I'll put forth an expectation that after installing a specific set of libraries, the floating point results would be identical across platforms and into the future. Ideally developers could install library updates (for hardware compatibility, security fixes, or other reasons) and still get identical results.
That expectation is for reproducibility, not high accuracy. So it'd be fine to install different libraries [or maybe use those pip package options in brackets, whatever they do?] to trade accuracy for speed. Could any particular choice of accuracy still provide reproducible results across platforms and time?
While this would be nice, in practice bit-identical results for floating point NumPy functions across different operating systems and future time is going to be impractical to achieve. IEEE-754 helps by specifying the result of basic floating point operations, but once you move into special math functions (like cos()) or other algorithms that can be implemented in several "mathematically equivalent" ways, bit-level stability basically becomes impossible without snapshotting your entire software stack. Many of these special math functions are provided by the operating system, which generally do not make such guarantees.
Quick example: Suppose you want to implement sum() on a floating point array. If you start at the beginning of the array and iterate to the end, adding each element to an accumulator, you will get one answer. If you do mathematically equivalent pairwise summations (using a temporary array for storage), you will get a different, and probably more accurate answer. Neither answer will (in general) be the same as summing those numbers together with infinite precision, then rounding to the closest floating point number at the end. We could decide to make the specification for sum() also specify the algorithm for computing sum() to ensure we make the same round-off errors every time. However, this kind of detailed specification might be harder to write for other functions, or might even lock the library into accuracy bugs that can't be fixed in the future.
I think the most pragmatic thing you can hope for is:
- Bit-identical results with containers that snapshot everything, including the system math library. - Libraries that specify their accuracy levels when possible, and disclose when algorithm changes will affect the bit-identicalness of results.
On a meta-level, if analysis conclusions depend on getting bit-identical results from floating point operations, then you really want to use a higher precision float and/or an algorithm less sensitive to round-off error. Floating point numbers are not real numbers. :)