# [Python-checkins] bpo-41513: Expand comments and add references for a better understanding (GH-22123)

Raymond Hettinger webhook-mailer at python.org
Sun Sep 6 18:10:17 EDT 2020

```https://github.com/python/cpython/commit/67c998de24985b36498b0fdac68d1beceb43aba7
commit: 67c998de24985b36498b0fdac68d1beceb43aba7
branch: master
author: Raymond Hettinger <rhettinger at users.noreply.github.com>
committer: GitHub <noreply at github.com>
date: 2020-09-06T15:10:07-07:00
summary:

bpo-41513: Expand comments and add references for a better understanding (GH-22123)

files:
M Modules/mathmodule.c

diff --git a/Modules/mathmodule.c b/Modules/mathmodule.c
index d227a5d15dca2..29137ae91a22f 100644
--- a/Modules/mathmodule.c
+++ b/Modules/mathmodule.c
@@ -2419,9 +2419,9 @@ To avoid overflow/underflow and to achieve high accuracy giving results
that are almost always correctly rounded, four techniques are used:

* lossless scaling using a power-of-two scaling factor
-* accurate squaring using Veltkamp-Dekker splitting
-* compensated summation using a variant of the Neumaier algorithm
-* differential correction of the square root
+* accurate squaring using Veltkamp-Dekker splitting 
+* compensated summation using a variant of the Neumaier algorithm 
+* differential correction of the square root 

The usual presentation of the Neumaier summation algorithm has an
expensive branch depending on which operand has the larger
@@ -2456,7 +2456,11 @@ Given that csum >= 1.0, we have:
Since lo**2 is less than 1/2 ulp(csum), we have csum+lo*lo == csum.

To minimize loss of information during the accumulation of fractional
-values, each term has a separate accumulator.
+values, each term has a separate accumulator.  This also breaks up
+sequential dependencies in the inner loop so the CPU can maximize
+floating point throughput.  On a 2.6 GHz Haswell, adding one
+dimension has an incremental cost of only 5ns -- for example when
+moving from hypot(x,y) to hypot(x,y,z).

The square root differential correction is needed because a
correctly rounded square root of a correctly rounded sum of
@@ -2466,7 +2470,7 @@ The differential correction starts with a value *x* that is
the difference between the square of *h*, the possibly inaccurately
rounded square root, and the accurately computed sum of squares.
The correction is the first order term of the Maclaurin series
-expansion of sqrt(h**2 + x) == h + x/(2*h) + O(x**2).
+expansion of sqrt(h**2 + x) == h + x/(2*h) + O(x**2). 

Essentially, this differential correction is equivalent to one
refinement step in Newton's divide-and-average square root
@@ -2474,12 +2478,24 @@ algorithm, effectively doubling the number of accurate bits.
This technique is used in Dekker's SQRT2 algorithm and again in
Borges' ALGORITHM 4 and 5.

+Without proof for all cases, hypot() cannot claim to be always
+correctly rounded.  However for n <= 1000, prior to the final addition
+that rounds the overall result, the internal accuracy of "h" together
+with its correction of "x / (2.0 * h)" is at least 100 bits. 
+Also, hypot() was tested against a Decimal implementation with
+prec=300.  After 100 million trials, no incorrectly rounded examples
+were found.  In addition, perfect commutativity (all permutations are
+exactly equal) was verified for 1 billion random inputs with n=5. 
+
References:

1. Veltkamp-Dekker splitting: http://csclub.uwaterloo.ca/~pbarfuss/dekker1971.pdf
2. Compensated summation:  http://www.ti3.tu-harburg.de/paper/rump/Ru08b.pdf
3. Square root differential correction:  https://arxiv.org/pdf/1904.09481.pdf
4. https://www.wolframalpha.com/input/?i=Maclaurin+series+sqrt%28h**2+%2B+x%29+at+x%3D0
+5. https://bugs.python.org/file49439/hypot.png
+6. https://bugs.python.org/file49435/best_frac.py
+7. https://bugs.python.org/file49448/test_hypot_commutativity.py

*/

```