np.dot yields a different result when computed in two pieces
Hello everyone, I asked this question https://stackoverflow.com/questions/76707696/np-dot-yields-a-different-resul... on StackOverflow a few days ago but still haven't got an answer. Long story short, I discovered (in certain common settings like Google Colab) that the method NumPy uses to compute the dot product of a 4D vector with a 4 x m matrix depends on m, so that if you cut a 4 x 2m matrix into two 4 x m matrices and compute the dot products with the vector individually, you get a slightly different result. More precisely, I found that: For m = 1, the order ((0+1)+2)+3 is used; for m = 2 and 3, the order (0+1)+(2+3) is used for all entries; for 4 ≤ m ≤ 7, unknown method is used for the first 4 entries, and the order ((0+1)+2)+3 is used for the rest; for 8 ≤ m ≤ 11, unknown method is used for the first 8 entries, and the order ((0+1)+2)+3 is used for the rest; for 12 ≤ m ≤ 15, unknown method is used for the first 12 entries, and the order ((0+1)+2)+3 is used for the rest; and the pattern probably continues. If we instead take the dot product of a m x 4 matrix with a 4D vector (rowMode = True in my code): For m = 1, the order ((0+1)+2)+3 is used; for m ≥ 2, the order (0+2)+(1+3) is used for all entries (verified up to m = 17). Moreover, if you replace 4 by an arbitrary n, the difference between the results grows, super-linearly in n in the vec-dot-mat case, but sub-linearly in the mat-dot-vec case. I think this behavior is highly non-intuitive and probably unnecessary: we are simply repeating the same operation m times (possibly in parallel), why should the method chosen depends on how many times the operation is repeated? Therefore, I'd like to propose an enhancement item to make the result independent of m. I realize this may need to be done upstream, so as a first step we may need to find out which BLAS library functions etc. are called to compute dot products and which are responsible for this behavior. Notebook for figuring out order of addition: https://colab.research.google.com/drive/1O4NtTWMiZbSxKFVg-lfBJDVypIBdVJ22?us... Notebook for growth with n: https://colab.research.google.com/drive/10ecRWvcMEXY0RchVlaLyPUFHJBA4KwA6?us... Thank you in advance for your thoughts on this. Best regards, Junyan
Accelerated BLAS operations are accelerated precisely by taking these opportunities to rearrange the computations, not just (or primarily) by parallelism. They are very finely tuned kernels that use the fine details of the CPU/FPU to pipeline instructions (which might be SIMD) and optimize memory movement and register use. In particular, SIMD instructions might be used, on your machine, to deal with rows in batches of 4, giving the pattern that you are seeing. I, personally, am not willing to give up those optimizations for a slightly more predictable order of additions. On Tue, Jul 25, 2023 at 8:04 AM Junyan Xu <junyanxumath@gmail.com> wrote:
Hello everyone,
I asked this question https://stackoverflow.com/questions/76707696/np-dot-yields-a-different-resul... on StackOverflow a few days ago but still haven't got an answer. Long story short, I discovered (in certain common settings like Google Colab) that the method NumPy uses to compute the dot product of a 4D vector with a 4 x m matrix depends on m, so that if you cut a 4 x 2m matrix into two 4 x m matrices and compute the dot products with the vector individually, you get a slightly different result. More precisely, I found that:
For m = 1, the order ((0+1)+2)+3 is used; for m = 2 and 3, the order (0+1)+(2+3) is used for all entries; for 4 ≤ m ≤ 7, unknown method is used for the first 4 entries, and the order ((0+1)+2)+3 is used for the rest; for 8 ≤ m ≤ 11, unknown method is used for the first 8 entries, and the order ((0+1)+2)+3 is used for the rest; for 12 ≤ m ≤ 15, unknown method is used for the first 12 entries, and the order ((0+1)+2)+3 is used for the rest; and the pattern probably continues.
If we instead take the dot product of a m x 4 matrix with a 4D vector (rowMode = True in my code):
For m = 1, the order ((0+1)+2)+3 is used; for m ≥ 2, the order (0+2)+(1+3) is used for all entries (verified up to m = 17).
Moreover, if you replace 4 by an arbitrary n, the difference between the results grows, super-linearly in n in the vec-dot-mat case, but sub-linearly in the mat-dot-vec case.
I think this behavior is highly non-intuitive and probably unnecessary: we are simply repeating the same operation m times (possibly in parallel), why should the method chosen depends on how many times the operation is repeated? Therefore, I'd like to propose an enhancement item to make the result independent of m. I realize this may need to be done upstream, so as a first step we may need to find out which BLAS library functions etc. are called to compute dot products and which are responsible for this behavior.
Notebook for figuring out order of addition: https://colab.research.google.com/drive/1O4NtTWMiZbSxKFVg-lfBJDVypIBdVJ22?us... Notebook for growth with n: https://colab.research.google.com/drive/10ecRWvcMEXY0RchVlaLyPUFHJBA4KwA6?us...
Thank you in advance for your thoughts on this.
Best regards,
Junyan _______________________________________________ NumPy-Discussion mailing list -- numpy-discussion@python.org To unsubscribe send an email to numpy-discussion-leave@python.org https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ Member address: robert.kern@gmail.com
-- Robert Kern
participants (2)
-
Junyan Xu
-
Robert Kern