![](https://secure.gravatar.com/avatar/bd813aa01a8be59c71e6c797194beffe.jpg?s=120&d=mm&r=g)
A Thursday 18 March 2010 16:26:09 Anne Archibald escrigué:
Speak for your own CPUs :).
But seriously, congratulations on the wide publication of the article; it's an important issue we often don't think enough about. I'm just a little snarky because this exact issue came up for us recently - a visiting astro speaker put it as "flops are free" - and so I did some tests and found that even without optimizing for memory access, our tasks are already CPU-bound: http://lighthouseinthesky.blogspot.com/2010/03/flops.html
Well, I thought that my introduction was enough to convince anybody about the problem, but forgot that you, the scientists, always try to demonstrate things experimentally :-/ Seriously, your example is a clear example of what I'm recommending in the article, i.e. always try to use libraries that are already leverage the blocking technique (that is, taking advantage of both temporal and spatial locality). Don't know about FFTW (never used it, sorry), but after having a look at its home page, I'm pretty convinced that its authors are very conscious about these techniques. Being said this, it seems that, in addition, you are applying the blocking technique yourself also: get the data in bunches (256 floating point elements, which fits perfectly well on modern L1 caches), apply your computation (in this case, FFTW) and put the result back in memory. A perfect example of what I wanted to show to the readers so, congratulations! you made it without the need to read my article (so perhaps the article was not so necessary after all :-)
In terms of specifics, I was a little surprised you didn't mention FFTW among your software tools that optimize memory access. FFTW's planning scheme seems ideal for ensuring memory locality, as much as possible, during large FFTs. (And in fact I also found that for really large FFTs, reducing padding - memory size - at the cost of a non-power-of-two size was also worth it.)
I must say that I'm quite naïve in many existing great tools for scientific computing. What I know, is that when I need to do something I always look for good existing tools first. So this is basically why I spoke about numexpr and BLAS/LAPACK: I know them well.
Heh. Indeed numexpr is a good tool for this sort of thing; it's an unfortunate fact that simple use of numpy tends to do operations in the pessimal order...
Well, to honor the truth, NumPy does not have control in the order of the operations in expressions and how temporaries are managed: it is Python who decides that. NumPy only can do what Python wants it to do, and do it as good as possible. And NumPy plays its role reasonably well here, but of course, this is not enough for providing performance. In fact, this problem probably affects to all interpreted languages out there, unless they implement a JIT compiler optimised for evaluating expressions --and this is basically what numexpr is. Anyway, thanks for constructive criticism, I really appreciate it! -- Francesc Alted