Re: [SciPy-dev] FFTW performances in scipy and numpy
Steven G. Johnson wrote:
On Wed, 1 Aug 2007, David Cournapeau wrote:
- making numpy data 16 bytes aligned. This one is a bit tricky. I won't bother you with the details, but generally, numpy data may not be even "word aligned". Since some archs require some kind of alignement, there are some mechanisms to get aligned buffers from unaligned buffers in numpy API; I was thinking about an additional flag "SIMD alignement", since this could be quite useful for many optimized libraries using SIMD. But maybe this does not make sense, I have not yet thought enough about it to propose anything concrete to the main numpy developers.
One possibility might be to allocate *all* arrays over a certain size using posix_memalign, to get 16-byte alignnment, instead of malloc or whatever you are doing now, since if the array is big enough the overhead of alignment is negligible. It depends on the importance of FFTs, of course, although there may well be other things where 16-byte alignment will make SIMD optimization easier. I don't know anything about Python/NumPy's memory management, however, so I don't know whether this suggestion is feasible. from my understanding of numpy (quite limited, I cannot emphasize this enough), the data buffer can easily be made 16 bytes aligned, since they are always allocated using a macro which is calling malloc, but could replaced by posix_memalign.
But then, there is another problem: many numpy arrays are just than one data buffer, are not always contiguous, etc... To interface with most C libraries, this means that a copy has to be made somewhat if the arrays are not contiguous or aligned; a whole infrastructure exists in numpy to do exactly this, but I do not know it really well yet. Whether replacing the current allocation with posix_memalign would enable the current infrastructure to always return 16 bytes aligned buffers is unclear to me, I should discuss this with people more knowledgeable than me.
On x86, malloc is guaranteed to be 8-byte aligned, so barring conspiracy the probability of 16-byte alignment should be 50%. On x86_64, on the other hand, I've read that malloc is guaranteed to be 16-byte aligned (for allocations > 16 bytes) by the ABI, so you should have no worries in this case as long as Python uses malloc or similar.
- I have tried FFTW_UNALIGNED + FFTW_ESTIMATE plans; unfortunately, I found that the performances were worse than using FFTW_MEASURE + copy (the copies are done into aligned buffers). I have since discover that this may be due to the totally broken architecture of my main workstation (a Pentium four): on my recent macbook (On linux, 32 bits, CoreDuo2), using no copy with FFTW_UNALIGNED is much better.
If you use FFTW_UNALIGNED, then essentially FFTW cannot use SIMD (with some exceptions). Whether a copy will be worth it for performance in this case will depend upon the problem size. For very large problems, I'm guessing it's probably not worth it, since SIMD has less of an impact there and cache has a bigger impact.
- The above problem is fixable if we add a mechanisme to choose plans (ESTIMATE vs MEASURE vs ... I found that for 1d cases at least, ESTIMATE vs MEASURE is what really count performance wise).
One thing that might help is to build in wisdom at least for a few small power-of-two sizes, created at build time. e.g. wisdom for sizes 128, 256, 512, 1024, 2048, 4096, and 8192 is < 2kB.
(We also need to work on the estimator heuristics...this has proven to be a hard problem in general, unfortunately.) This is the way matlab works, right ? If I understand correctly, wisdoms are a way to compute plans "offline". So for example, if you compute
Se below for my findings about this problem. plans with FFTW_MEASURE | FFTW_UNALIGNED, for inplace transforms and a set of sizes, you can record it in a wisdom, and reload it such as later calls with FFTW_MEASURE | FFTW_UNALIGNED will be fast ? Anyway, all this sounds like it should be solved by adding a better infrastructure the current wrappers (ala matlab).
Your claim that numpy arrays are almost never 16-byte aligned strikes me as odd; if true, it means that NumPy is doing something terribly weird in its allocation.
I just did a quick test, and on i386 with glibc (Debian GNU/Linux), malloc'ing 10000 arrays of double variables of random size, almost exactly 50% are 16-byte aligned as expected. And on x86_64, 100% are 16-byte aligned (because of the abovementioned ABI requirement). Well, then I must be doing something wrong in my tests (I attached the test program I use to test pointers returned by malloc). This always return 1 for buffers allocated with fftw_malloc, but always 0 for buffers allocated with malloc on my machine, which is quite similar to yours (ubuntu, x86, glibc). Not sure what I am missing (surely something obvious).
regards, David /* * test.c: allocate buffers of random size, and count the ones which are 16 bytes aligned */ #include <stdlib.h> #include <stdio.h> #include <string.h> #include <stddef.h> #include <math.h> #include <time.h> #ifdef FORCE_ALIGNED_MALLOC #include <fftw3.h> #define MALLOC(size) fftw_malloc((size)) #define FREE(ptr) fftw_free((ptr)) #else #define MALLOC(size) malloc((size)) #define FREE(ptr) free((ptr)) #endif /* Return 1 if the pointer x is 16 bytes aligned */ #define CHECK_SIMD_ALIGNMENT(x) \ ( ((ptrdiff_t)(x) & 0xf) == 0) const size_t MAXMEM = 1 << 26; int allocmem(size_t n); int main(void) { int i; int niter = 100000; size_t s; int acc = 0; double mean = 0, min = 1e100, max = 0; double r; /* not really random, but enough for here */ srand48(time(0)); for(i = 0; i < niter; ++i) { /* forcing the size to be in [1, MAXMEM] */ r = drand48() * MAXMEM; s = floor(r); if (s < 1) { s = 1; } else if (s > MAXMEM) { s = MAXMEM; } /* computing max and min allocated size for summary */ mean += r; if (max < s) { max = s; } if (min > s) { min = s; } /* allocating */ acc += allocmem(s); } fprintf(stdout, "ratio %d / %d; average size is %f (m %f, M %f)\n", acc, niter, mean / niter, min, max); return 0; } /* * Alloc n bytes, and return 0 if the memory allocated with this size was 16 * bytes aligned. */ int allocmem(size_t n) { char* tmp; int isal = 0; tmp = MALLOC(n); isal = CHECK_SIMD_ALIGNMENT(tmp); /* Not sure this is useful, but this should guarantee that tmp is really * allocated, and that the compiler is not getting too smart */ tmp[0] = 0; tmp[n-1] = 0; FREE(tmp); return isal; }
participants (1)
-
David Cournapeau