Hello there, I'm here again.
With the help of the previous emails, I finally found out the real C implementation of np.sum(x), where x is a Numpy array.
I found that there is a "template" function named @TYPE@_pairwise_sum in numpy/core/src/umath/loops_utils.h.src which is the underlying C code of np.sum(). I can understand the code but there is still something that I didn't figure out.
Here is the code link: https://github.com/numpy/numpy/blob/main/numpy/core/src/umath/loops_utils.h....
The control logic is quite simple, if the size of the array which is going to be summed up is smaller than 8, then we directly add them up; If the size is greater than 8 but smaller than PW_BLOCKSIZE, then we sum a block with 8 accumulators which will utilize the speed-up of AVX. Then if the size is greater than PW_BLOCKSIZE, we divide the array into two halves and recursively call function @TYPE@_pairwise_sum to calculate the two halves respectively.
My confusion lies here: why should we divide the array into two halves and recursively calculate this? Since we won't assign two threads to compute the result in parallel, we can't get any profits from "divide and conquer". Actually we waste some time and memory in calling functions and operations in heap.
There must be something that I didn't take into consideration. It will be appreciated if you can point it out! Thanks in advance~