scipy.fft module slow for complex inputs when linked to fftw3

Hi there, I noticed recently that when using the fft module of scipy, it is much slower (5-10 folds) than numpy for complex inputs (only in the 1d case) when linking to fftw3. This problem is reported on the ticket #1 for scipy : http://projects.scipy.org/scipy/scipy/ticket/1 I am not sure, because the code is a bit difficult to read, but it looks like in the case of complex input + fftw3, the plan is always recomputed for each call to zfft (file:zfft.c), whereas in the real case or in the complexe case + fftw2, the function drfft(file:drfft.c), called from zrfft (file:zrfft.c) is calling a plan which is cached. I am trying to see how the caching is done, but I am not sure I will have the time to make it work for fftw3. David P.S: I am wondering if there is a reason why the code is written with all those #ifdef ? Because it makes the hacking of the module quite difficult. Why not implementing each function for each fft library, and wraps them around in the header files ? Is is just a time constraint, or is there another reason ?

On Fri, Aug 18, 2006 at 05:32:14PM +0900, David Cournapeau wrote:
Hi there,
I noticed recently that when using the fft module of scipy, it is much slower (5-10 folds) than numpy for complex inputs (only in the 1d case) when linking to fftw3. This problem is reported on the ticket #1 for scipy : http://projects.scipy.org/scipy/scipy/ticket/1
I am not sure, because the code is a bit difficult to read, but it looks like in the case of complex input + fftw3, the plan is always recomputed for each call to zfft (file:zfft.c), whereas in the real case or in the complexe case + fftw2, the function drfft(file:drfft.c), called from zrfft (file:zrfft.c) is calling a plan which is cached. I am trying to see how the caching is done, but I am not sure I will have the time to make it work for fftw3.
Well, for fftw3 it uses FFTW_ESTIMATE for the plan. So it does a cheap estimate of what it needs. I tried changing it to FFTW_MEASURE, but it went slower. I'd have to dig into fftw3's source to see how it does the caching of wisdom. Basically, after playing around, I still have no idea why it's slow.
P.S: I am wondering if there is a reason why the code is written with all those #ifdef ? Because it makes the hacking of the module quite difficult. Why not implementing each function for each fft library, and wraps them around in the header files ? Is is just a time constraint, or is there another reason ?
I was looking at that code and was wondering the same thing :-) I'm thinking of writing separate wrappers for each library, so you can get at each one specifically (scipy.fftpack.impl.fftw3, for instance). This would be a big win, I think, for fftw3, where it would then be easier to handle the wisdom. -- |>|\/|< /--------------------------------------------------------------------------\ |David M. Cooke http://arbutus.physics.mcmaster.ca/dmc/ |cookedm@physics.mcmaster.ca

David M. Cooke wrote:
On Fri, Aug 18, 2006 at 05:32:14PM +0900, David Cournapeau wrote:
Hi there,
I noticed recently that when using the fft module of scipy, it is much slower (5-10 folds) than numpy for complex inputs (only in the 1d case) when linking to fftw3. This problem is reported on the ticket #1 for scipy : http://projects.scipy.org/scipy/scipy/ticket/1
I am not sure, because the code is a bit difficult to read, but it looks like in the case of complex input + fftw3, the plan is always recomputed for each call to zfft (file:zfft.c), whereas in the real case or in the complexe case + fftw2, the function drfft(file:drfft.c), called from zrfft (file:zrfft.c) is calling a plan which is cached. I am trying to see how the caching is done, but I am not sure I will have the time to make it work for fftw3.
Well, for fftw3 it uses FFTW_ESTIMATE for the plan. So it does a cheap estimate of what it needs. Well, it depends what you mean by cheap. If compared to FFTW_MEASURE, yes. But compared to pre-computing the plan, and then doing multiple fftw, then it is not cheap. Computing the fftw is negligeable compared to computing the plan !
I paste a c file which shows the difference, and which emulates (the function test_noncached) the scipy.fft module (emulates as I understand it): it gives the computation for a a complex array of 1024 elements, iterated 1000 times. First, the number of cycles for all iteration, then per fft on average, and min of all iteration. The only difference between cached and non cached is that the plan is computed again for each iteration in the non cached case (as done by scipy.fft now in the case of begin linked with fftw3): output: testing cached cycle is 69973016, 69973 per execution, min is 66416 testing non cached cycle is 946186540, 946186 per execution, min is 918036 Code: (cycle.h is available here: http://www.fftw.org/cycle.h, and is the code used for the estimation of best code in plans by fftw3) #include <stddef.h> #include <stdlib.h> #include <fftw3.h> #include "cycle.h" #define MALLOC(size) fftw_malloc((size)) #define FREE(ptr) fftw_free((ptr)) typedef struct { double total; double min; } result; result test_cached(size_t size, size_t niter); result test_noncached(size_t size, size_t niter); int main(void) { size_t niter = 1e3; size_t size = 1024; result res; printf("testing cached\n"); res = test_cached(size, niter); printf("\t cycle is %d, %d per execution, min is %d\n", (size_t)res.total, (size_t)res.total/niter, (size_t)res.min); printf("testing non cached\n"); res = test_noncached(size, niter); printf("\t cycle is %d, %d per execution, min is %d\n", (size_t)res.total, (size_t)res.total/niter, (size_t)res.min); fftw_cleanup(); return 0; } result test_cached(size_t size, size_t niter) { fftw_complex *in, *out; fftw_plan plan; ticks t0, t1; double res, min, acc; size_t i, j; result r; in = MALLOC(sizeof(*in)*size); out = MALLOC(sizeof(*out)*size); plan = fftw_plan_dft_1d(size, in, out, 1, FFTW_ESTIMATE); acc = 0; min = 1e100; for(i = 0; i < niter; ++i) { for(j = 0; j < size; ++j) { in[j][0] = (0.5 - (double)rand() / RAND_MAX); in[j][1] = 0.1*j; } t0 = getticks(); fftw_execute(plan); t1 = getticks(); res = elapsed(t1, t0); if (res < min) { min = res; } acc += res; } r.total = acc; r.min = min; fftw_destroy_plan(plan); FREE(in); FREE(out); return r; } result test_noncached(size_t size, size_t niter) { fftw_complex *in, *out; fftw_plan plan; ticks t0, t1; double res, min, acc; size_t i, j; result r; in = MALLOC(sizeof(*in)*size); out = MALLOC(sizeof(*out)*size); acc = 0; min = 1e100; for(i = 0; i < niter; ++i) { for(j = 0; j < size; ++j) { in[j][0] = (0.5 - (double)rand() / RAND_MAX); in[j][1] = 0.1*j; } t0 = getticks(); plan = fftw_plan_dft_1d(size, in, out, 1, FFTW_ESTIMATE); fftw_execute(plan); fftw_destroy_plan(plan); t1 = getticks(); res = elapsed(t1, t0); if (res < min) { min = res; } acc += res; } r.total = acc; r.min = min; FREE(in); FREE(out); return r; } Compile by: gcc -c test.c -o test.o -W -Wall gcc -lfftw3 -lm -o test test.o

On Fri, Aug 18, 2006 at 07:01:17PM +0900, David Cournapeau wrote:
David M. Cooke wrote:
On Fri, Aug 18, 2006 at 05:32:14PM +0900, David Cournapeau wrote:
Hi there,
I noticed recently that when using the fft module of scipy, it is much slower (5-10 folds) than numpy for complex inputs (only in the 1d case) when linking to fftw3. This problem is reported on the ticket #1 for scipy : http://projects.scipy.org/scipy/scipy/ticket/1
I am not sure, because the code is a bit difficult to read, but it looks like in the case of complex input + fftw3, the plan is always recomputed for each call to zfft (file:zfft.c), whereas in the real case or in the complexe case + fftw2, the function drfft(file:drfft.c), called from zrfft (file:zrfft.c) is calling a plan which is cached. I am trying to see how the caching is done, but I am not sure I will have the time to make it work for fftw3.
Well, for fftw3 it uses FFTW_ESTIMATE for the plan. So it does a cheap estimate of what it needs. Well, it depends what you mean by cheap. If compared to FFTW_MEASURE, yes. But compared to pre-computing the plan, and then doing multiple fftw, then it is not cheap. Computing the fftw is negligeable compared to computing the plan !
The only difference between cached and non cached is that the plan is computed again for each iteration in the non cached case (as done by scipy.fft now in the case of begin linked with fftw3):
The problem is that the plan depends on the input arrays! Caching it won't help with Python, unless you can guarantee that the same arrays are passed to successive calls. Getting around that will mean digging into the guru interface, I think (ugh). I'll have a clearer idea of what we can and can not do once I dig into fftw3. -- |>|\/|< /--------------------------------------------------------------------------\ |David M. Cooke http://arbutus.physics.mcmaster.ca/dmc/ |cookedm@physics.mcmaster.ca

David M. Cooke wrote:
The problem is that the plan depends on the input arrays! Caching it won't help with Python, unless you can guarantee that the same arrays are passed to successive calls.
I know that, but it really make no sense to use fftw3 as it is used now... For moderate sizes, it is more than 10 times slower !
Getting around that will mean digging into the guru interface, I think (ugh).
I tried a dirty hack using the function fftw_execute_dft, which executes a given plan for different arrays, given they have the same properties. The problem is that because of the obfuscated way fftpack is coded right now, it is difficult to track what is going on; I have a small test program which calls zfft directly from the module _fftpack.so, running it under valgrind shows no problem... So there is something going on in fftpack I don't understand. The other obvious thing is to copy the content of the array into a cached buffer, computing the fft on it, and recopying the result. This is stupid, but it is better than the current situation I think. I implemented this solution, the speed is much better, and tests succeed. Do you know how I am supposed to build a patch (I am not familiar with patch...) David

I tried a dirty hack using the function fftw_execute_dft, which executes a given plan for different arrays, given they have the same properties. The problem is that because of the obfuscated way fftpack is coded right now, it is difficult to track what is going on; I have a small test program which calls zfft directly from the module _fftpack.so, running it under valgrind shows no problem... So there is something going on in fftpack I don't understand.
The other obvious thing is to copy the content of the array into a cached buffer, computing the fft on it, and recopying the result. This is stupid, but it is better than the current situation I think. I implemented this solution, the speed is much better, and tests succeed. Do you know how I am supposed to build a patch (I am not familiar with patch...) I build a patch anyway; I am not sure the format is OK, as I am not familiar with
patch/diff options: Before patch (last numpy and scipy SVN), scipy.fftpack.test(10) gives: Fast Fourier Transform ================================================= | real input | complex input ------------------------------------------------- size | scipy | numpy | scipy | numpy ------------------------------------------------- 100 | 0.17 | 0.14 | 1.93 | 0.11 (secs for 7000 calls) 1000 | 0.12 | 0.16 | 1.12 | 0.14 (secs for 2000 calls) 256 | 0.28 | 0.28 | 3.09 | 0.22 (secs for 10000 calls) 512 | 0.37 | 0.48 | 3.82 | 0.38 (secs for 10000 calls) 1024 | 0.05 | 0.09 | 0.53 | 0.07 (secs for 1000 calls) 2048 | 0.10 | 0.16 | 0.88 | 0.15 (secs for 1000 calls) 4096 | 0.08 | 0.17 | 0.75 | 0.17 (secs for 500 calls) 8192 | 0.20 | 0.53 | 1.39 | 0.47 (secs for 500 calls) .... Inverse Fast Fourier Transform =============================================== | real input | complex input ----------------------------------------------- size | scipy | numpy | scipy | numpy ----------------------------------------------- 100 | 0.16 | 0.29 | 2.29 | 0.28 (secs for 7000 calls) 1000 | 0.13 | 0.27 | 1.22 | 0.26 (secs for 2000 calls) 256 | 0.29 | 0.57 | 3.43 | 0.51 (secs for 10000 calls) 512 | 0.41 | 0.84 | 4.14 | 0.77 (secs for 10000 calls) 1024 | 0.06 | 0.14 | 0.62 | 0.13 (secs for 1000 calls) 2048 | 0.10 | 0.26 | 0.91 | 0.25 (secs for 1000 calls) 4096 | 0.11 | 0.31 | 0.81 | 0.26 (secs for 500 calls) 8192 | 0.23 | 0.78 | 1.53 | 0.72 (secs for 500 calls) ....... After patch (last numpy and scipy SVN), scipy.fftpack.test(10) gives: Fast Fourier Transform ================================================= | real input | complex input ------------------------------------------------- size | scipy | numpy | scipy | numpy ------------------------------------------------- 100 | 0.17 | 0.14 | 0.12 | 0.11 (secs for 7000 calls) 1000 | 0.12 | 0.16 | 0.10 | 0.14 (secs for 2000 calls) 256 | 0.28 | 0.28 | 0.20 | 0.20 (secs for 10000 calls) 512 | 0.36 | 0.47 | 0.27 | 0.36 (secs for 10000 calls) 1024 | 0.05 | 0.08 | 0.05 | 0.07 (secs for 1000 calls) 2048 | 0.09 | 0.16 | 0.08 | 0.14 (secs for 1000 calls) 4096 | 0.10 | 0.17 | 0.10 | 0.16 (secs for 500 calls) 8192 | 0.23 | 0.53 | 0.23 | 0.47 (secs for 500 calls) .... Inverse Fast Fourier Transform =============================================== | real input | complex input ----------------------------------------------- size | scipy | numpy | scipy | numpy ----------------------------------------------- 100 | 0.17 | 0.29 | 0.14 | 0.26 (secs for 7000 calls) 1000 | 0.13 | 0.27 | 0.15 | 0.25 (secs for 2000 calls) 256 | 0.29 | 0.54 | 0.27 | 0.50 (secs for 10000 calls) 512 | 0.38 | 0.81 | 0.43 | 0.73 (secs for 10000 calls) 1024 | 0.06 | 0.14 | 0.08 | 0.13 (secs for 1000 calls) 2048 | 0.09 | 0.24 | 0.13 | 0.23 (secs for 1000 calls) 4096 | 0.09 | 0.24 | 0.13 | 0.23 (secs for 500 calls) 8192 | 0.22 | 0.75 | 0.37 | 0.73 (secs for 500 calls) This makes things much faster, particularly for small sizes (which is logical considering the main cause of slowness is the building of plans). http://projects.scipy.org/scipy/scipy/attachment/ticket/1/fftw3slow.patch Paste below: --- scipy/Lib/fftpack/src/zfft.c 2006-08-18 20:45:50.000000000 +0900 +++ scipy-new/Lib/fftpack/src/zfft.c 2006-08-18 21:00:26.000000000 +0900 @@ -35,9 +35,21 @@ #endif #if defined WITH_FFTW3 -/* - *don't cache anything - */ +GEN_CACHE(zfftw,(int n,int d) + ,int direction; + fftw_plan plan; + fftw_complex* ptr; + ,((caches_zfftw[i].n==n) && + (caches_zfftw[i].direction==d)) + ,caches_zfftw[id].direction = d; + caches_zfftw[id].ptr = fftw_malloc(sizeof(fftw_complex)*(n)); + caches_zfftw[id].plan = fftw_plan_dft_1d(n, caches_zfftw[id].ptr, + caches_zfftw[id].ptr, + (d>0?FFTW_FORWARD:FFTW_BACKWARD), + FFTW_ESTIMATE); + ,fftw_destroy_plan(caches_zfftw[id].plan); + fftw_free(caches_zfftw[id].ptr); + ,10) #elif defined WITH_FFTW /**************** FFTW2 *****************************/ GEN_CACHE(zfftw,(int n,int d) @@ -73,6 +85,7 @@ destroy_zdjbfft_caches(); #endif #ifdef WITH_FFTW3 + destroy_zfftw_caches(); #elif defined WITH_FFTW destroy_zfftw_caches(); #else @@ -118,6 +131,7 @@ if (f==0) #endif #ifdef WITH_FFTW3 + plan = caches_zfftw[get_cache_id_zfftw(n,direction)].plan; #elif defined WITH_FFTW plan = caches_zfftw[get_cache_id_zfftw(n,direction)].plan; #else @@ -147,11 +161,10 @@ } else #endif #ifdef WITH_FFTW3 - plan = fftw_plan_dft_1d(n, (fftw_complex*)ptr, (fftw_complex*)ptr, - (direction>0?FFTW_FORWARD:FFTW_BACKWARD), - FFTW_ESTIMATE); - fftw_execute(plan); - fftw_destroy_plan(plan); + ptrm = caches_zfftw[get_cache_id_zfftw(n,direction)].ptr; + memcpy(ptrm, ptr, sizeof(double)*2*n); + fftw_execute(plan); + memcpy(ptr, ptrm, sizeof(double)*2*n); #elif defined WITH_FFTW fftw_one(plan,(fftw_complex*)ptr,NULL); #else @@ -181,11 +194,10 @@ } else #endif #ifdef WITH_FFTW3 - plan = fftw_plan_dft_1d(n, (fftw_complex*)ptr, (fftw_complex*)ptr, - (direction>0?FFTW_FORWARD:FFTW_BACKWARD), - FFTW_ESTIMATE); - fftw_execute(plan); - fftw_destroy_plan(plan); + ptrm = caches_zfftw[get_cache_id_zfftw(n,direction)].ptr; + memcpy(ptrm, ptr, sizeof(double)*2*n); + fftw_execute(plan); + memcpy(ptr, ptrm, sizeof(double)*2*n); #elif defined WITH_FFTW fftw_one(plan,(fftw_complex*)ptr,NULL); #else This is really a quick/dirty hack, as I don't really know the mechanism. For example, I am not sure if there is no memory leak (I test the new function zfft through valgrind, though, with no memory leak reported). There should be a way to avoid the two full copies, too, which makes for a significant proportion of the computing time, I think, but this would require a rewrite of the module, I guess. David

On Aug 19, 2006, at 05:40 , David Cournapeau wrote:
I build a patch anyway; I am not sure the format is OK, as I am not familiar with patch/diff options:
Before patch (last numpy and scipy SVN), scipy.fftpack.test(10) gives:
Fast Fourier Transform ================================================= | real input | complex input ------------------------------------------------- size | scipy | numpy | scipy | numpy ------------------------------------------------- 100 | 0.17 | 0.14 | 1.93 | 0.11 (secs for 7000 calls) 1000 | 0.12 | 0.16 | 1.12 | 0.14 (secs for 2000 calls) 256 | 0.28 | 0.28 | 3.09 | 0.22 (secs for 10000 calls) 512 | 0.37 | 0.48 | 3.82 | 0.38 (secs for 10000 calls) 1024 | 0.05 | 0.09 | 0.53 | 0.07 (secs for 1000 calls) 2048 | 0.10 | 0.16 | 0.88 | 0.15 (secs for 1000 calls) 4096 | 0.08 | 0.17 | 0.75 | 0.17 (secs for 500 calls) 8192 | 0.20 | 0.53 | 1.39 | 0.47 (secs for 500 calls) .... Inverse Fast Fourier Transform =============================================== | real input | complex input ----------------------------------------------- size | scipy | numpy | scipy | numpy ----------------------------------------------- 100 | 0.16 | 0.29 | 2.29 | 0.28 (secs for 7000 calls) 1000 | 0.13 | 0.27 | 1.22 | 0.26 (secs for 2000 calls) 256 | 0.29 | 0.57 | 3.43 | 0.51 (secs for 10000 calls) 512 | 0.41 | 0.84 | 4.14 | 0.77 (secs for 10000 calls) 1024 | 0.06 | 0.14 | 0.62 | 0.13 (secs for 1000 calls) 2048 | 0.10 | 0.26 | 0.91 | 0.25 (secs for 1000 calls) 4096 | 0.11 | 0.31 | 0.81 | 0.26 (secs for 500 calls) 8192 | 0.23 | 0.78 | 1.53 | 0.72 (secs for 500 calls) .......
After patch (last numpy and scipy SVN), scipy.fftpack.test(10) gives:
Fast Fourier Transform ================================================= | real input | complex input ------------------------------------------------- size | scipy | numpy | scipy | numpy ------------------------------------------------- 100 | 0.17 | 0.14 | 0.12 | 0.11 (secs for 7000 calls) 1000 | 0.12 | 0.16 | 0.10 | 0.14 (secs for 2000 calls) 256 | 0.28 | 0.28 | 0.20 | 0.20 (secs for 10000 calls) 512 | 0.36 | 0.47 | 0.27 | 0.36 (secs for 10000 calls) 1024 | 0.05 | 0.08 | 0.05 | 0.07 (secs for 1000 calls) 2048 | 0.09 | 0.16 | 0.08 | 0.14 (secs for 1000 calls) 4096 | 0.10 | 0.17 | 0.10 | 0.16 (secs for 500 calls) 8192 | 0.23 | 0.53 | 0.23 | 0.47 (secs for 500 calls) .... Inverse Fast Fourier Transform =============================================== | real input | complex input ----------------------------------------------- size | scipy | numpy | scipy | numpy ----------------------------------------------- 100 | 0.17 | 0.29 | 0.14 | 0.26 (secs for 7000 calls) 1000 | 0.13 | 0.27 | 0.15 | 0.25 (secs for 2000 calls) 256 | 0.29 | 0.54 | 0.27 | 0.50 (secs for 10000 calls) 512 | 0.38 | 0.81 | 0.43 | 0.73 (secs for 10000 calls) 1024 | 0.06 | 0.14 | 0.08 | 0.13 (secs for 1000 calls) 2048 | 0.09 | 0.24 | 0.13 | 0.23 (secs for 1000 calls) 4096 | 0.09 | 0.24 | 0.13 | 0.23 (secs for 500 calls) 8192 | 0.22 | 0.75 | 0.37 | 0.73 (secs for 500 calls)
This makes things much faster, particularly for small sizes (which is logical considering the main cause of slowness is the building of plans).
http://projects.scipy.org/scipy/scipy/attachment/ticket/1/ fftw3slow.patch
The patch format is fine, btw. Although, it looks like you used diff; it's much easier to just do 'svn diff' (add the file or directory names to the end if you're only interested in some of what you've changed). I'm working on splitting the _fttpack module into separate extensions for each fft library, so it's clearer. Then, you could link in both fftw2 and fftw3, and use either (or fftpack). Along the way, I'd like to add more of their API functions (such as the wisdom functions), or have routines that may have better performance (fftw3, for instance, can be more efficient [or not] if it's allowed to destroy its input array, or if the input and output array are different). Also, it looks like using the guru interface, plans could be cached (but for each length, there would be possibly two plans: for aligned and unaligned data). -- |>|\/|< /------------------------------------------------------------------\ |David M. Cooke http://arbutus.physics.mcmaster.ca/dmc/ |cookedm@physics.mcmaster.ca

David M. Cooke wrote:
The patch format is fine, btw. Although, it looks like you used diff; it's much easier to just do 'svn diff' (add the file or directory names to the end if you're only interested in some of what you've changed).
I of course discovered svn diff a few minutes after having sent the patch... I am no really used to subversion, will do this way next time.
I'm working on splitting the _fttpack module into separate extensions for each fft library, so it's clearer. Then, you could link in both fftw2 and fftw3, and use either (or fftpack). Along the way, I'd like to add more of their API functions (such as the wisdom functions), or have routines that may have better performance (fftw3, for instance, can be more efficient [or not] if it's allowed to destroy its input array, or if the input and output array are different).
I was thinking about doing it myself; maybe we can share our effort ?
Also, it looks like using the guru interface, plans could be cached (but for each length, there would be possibly two plans: for aligned and unaligned data).
When I used this way, it didn't work, but I didn't investigate more than a few minutes. I was wondering if this could come from alignment problems, or some other issues. I don't know much about the C numpy interface: how is the memory allocated for arrays ? Is the memory always 16 bytes aligned ? Is it an option ? Can we check it ? David

A Dilluns 28 Agost 2006 04:37, David Cournapeau va escriure:
David M. Cooke wrote:
The patch format is fine, btw. Although, it looks like you used diff; it's much easier to just do 'svn diff' (add the file or directory names to the end if you're only interested in some of what you've changed).
I of course discovered svn diff a few minutes after having sent the patch... I am no really used to subversion, will do this way next time.
You don't need to install subversion for this (although learning it is always a good thing). "diff -urN" will do the trick as well. --
0,0< Francesc Altet http://www.carabos.com/ V V Cárabos Coop. V. Enjoy Data "-"

Francesc Altet wrote:
You don't need to install subversion for this (although learning it is always a good thing). "diff -urN" will do the trick as well.
That's what I used, because I didn't know about subversion integrated diff (I am using arch myself for my projects). I was already using subversion anyway to track all the last great things happening in numpy and scipy ;) David
participants (3)
-
David Cournapeau
-
David M. Cooke
-
Francesc Altet