Numpy uses pairwise summation along the fast axis if that axis contains no more than 8192 elements. How was 8192 chosen?

Doubling to 16384 would result in a lot more function call overhead due to the recursion. Is it a speed issue? Memory? Or something else entirely?