Hi all, I hope this is the right place to ask my questions. Recently I did some simple benchmarks on the numpy.conjugate routine for an 128 MB array, lets call it 'A'. It turned out that on my test machines there is a speed up of around 2.5 times if instead of simply calling A.conj()
one loops over sub-matrices that fit into L1 cache of my CPU, like: for i in range(0,A.shape,size): A[i:i+size].conj()
such that each A[i:i+size] fits into L1 cache. I posted example code and some graphs over at stackoverflow (https://stackoverflow.com/questions/73209565/strange-behaviour-during-multip...) I quickly checked and found a similar behavior for numpy.square.
Now for my questions. 1. Is this a known/expected behavior of NumPy ? 2. Would it be possible/sensible to make simple numerical operations like numpy.conjugate & numpy.square cache aware ?