Re: [Numpy-discussion] [OT] Starving CPUs article featured in IEEE's ComputingNow portal

March 18, 2010

      A Thursday 18 March 2010 16:26:09 Anne Archibald escrigué:
...
Speak for your own CPUs :).
But seriously, congratulations on the wide publication of the article;
it's an important issue we often don't think enough about. I'm just a
little snarky because this exact issue came up for us recently - a
visiting astro speaker put it as "flops are free" - and so I did some
tests and found that even without optimizing for memory access, our
tasks are already CPU-bound:
http://lighthouseinthesky.blogspot.com/2010/03/flops.html
Well, I thought that my introduction was enough to convince anybody about the 
problem, but forgot that you, the scientists, always try to demonstrate things 
experimentally :-/

Seriously, your example is a clear example of what I'm recommending in the 
article, i.e. always try to use libraries that are already leverage the 
blocking technique (that is, taking advantage of both temporal and spatial 
locality).  Don't know about FFTW (never used it, sorry), but after having a 
look at its home page, I'm pretty convinced that its authors are very 
conscious about these techniques.

Being said this, it seems that, in addition, you are applying the blocking 
technique yourself also: get the data in bunches (256 floating point elements, 
which fits perfectly well on modern L1 caches), apply your computation (in 
this case, FFTW) and put the result back in memory.  A perfect example of what 
I wanted to show to the readers so, congratulations! you made it without the 
need to read my article (so perhaps the article was not so necessary after all 
:-)
...
In terms of specifics, I was a little surprised you didn't mention
FFTW among your software tools that optimize memory access. FFTW's
planning scheme seems ideal for ensuring memory locality, as much as
possible, during large FFTs. (And in fact I also found that for really
large FFTs, reducing padding - memory size - at the cost of a
non-power-of-two size was also worth it.)
I must say that I'm quite naïve in many existing great tools for scientific 
computing.  What I know, is that when I need to do something I always look for 
good existing tools first.  So this is basically why I spoke about numexpr and 
BLAS/LAPACK:  I know them well.
...
Heh. Indeed numexpr is a good tool for this sort of thing; it's an
unfortunate fact that simple use of numpy tends to do operations in
the pessimal order...
Well, to honor the truth, NumPy does not have control in the order of the 
operations in expressions and how temporaries are managed: it is Python who 
decides that.  NumPy only can do what Python wants it to do, and do it as good 
as possible.  And NumPy plays its role reasonably well here, but of course, 
this is not enough for providing performance.  In fact, this problem probably 
affects to all interpreted languages out there, unless they implement a JIT 
compiler optimised for evaluating expressions --and this is basically what 
numexpr is.

Anyway, thanks for constructive criticism, I really appreciate it!

-- 
Francesc Alted