![](https://secure.gravatar.com/avatar/37eae48a3760fab53c86a497f7655e49.jpg?s=120&d=mm&r=g)
Good day! My name is Alexander Levin. My colleague and I did a project on optimisation of two-dimensional Fourier transform algorithm six months ago. We took your implementation of numpy fft2d as a unit of quality. In the course of our research we found out that initially mathematically the method uses a far from optimal algorithm. As you may know, your operation goes first by rows then by columns applying a one-dimensional transformation. After spending some time researching mathematical papers on this topic, we found and implemented a new approach, the so-called Cooley-Tukey butterfly, optimised by Russian mathematicians in 2016. The output is a completely new approach, where we immediately apply a two-dimensional operation and save a serious proportion of time on it. As a result, we wrote a C++ package for Python, using Cython as a wrapper. The result was the package and an article on Medium describing the process. On tests for matrices ranging in size from 2048x512 to 8192x8192, our algorithm outperformed the NumPy transformation by an average of 50% in time. After discussing this matter with my colleague with whom we did the above development, we came to a common desire to share our results with you. Your company has been making a huge contribution to the IT community for almost 20 years on a pro bono basis. We share your philosophy and so we want to make the CS world a better place by providing you with our code to optimise the approach in an existing operation in your package. We would like to offer you our development version, though it'll need some minor improvements, we'd love to finish them collaboratively with you and hear your thoughts regarding what we have discovered and done so far. We'd be honored to help you and become NumPy contributors. I trust that our vision will resonate with you and that you will agree with us. I invite you to read our quick article about the process, which I have linked below, and our fully functioning package and share your views on our contribution to NumPy. We are willing to edit the code to fit your standards, as we believe that the best use of our work will be to contribute to the development of technology in the world. Thank you for your time. We look forward to your response and opinion. Medium Article about the process: https://medium.com/p/7963c3b2f3c9 GitHub Source code Repository: https://github.com/2D-FFT-Project/2d-fft
![](https://secure.gravatar.com/avatar/5f88830d19f9c83e2ddfd913496c5025.jpg?s=120&d=mm&r=g)
On Wed, Feb 28, 2024 at 6:59 AM camrymanjr--- via NumPy-Discussion < numpy-discussion@python.org> wrote:
Thanks for sharing Alexander.
Did you also benchmark against `scipy.fft` by any chance? SciPy uses a newer C++ version of PocketFFT. In NumPy we just switched over from the older C code of PocketFFT to the newer C++ code: https://github.com/numpy/numpy/pull/25536. That PR also solves the upcasting that your Medium article touches on; we now have proper float32/complex64 support. Note that SciPy also has a multi-threaded version, which you can enable with the `workers` keyword. It'd also be interesting perhaps to benchmark memory usage, since your article touches on eliminating temporary arrays. After discussing this matter with my colleague with whom we did the above
Great to hear that. Note that we're not a company - more a loosely organized team of mostly volunteers and some folks who get to work on NumPy as part of their day job at a company or university.
Performance improvements are always of interest, so if this is an algorithmic improvement without loss of accuracy, then that sounds very relevant. And more contributors are always welcome! Cheers, Ralf I trust that our vision will resonate with you and that you will agree with
![](https://secure.gravatar.com/avatar/37eae48a3760fab53c86a497f7655e49.jpg?s=120&d=mm&r=g)
Good evening, Ralf! I beg your pardon, for some reason I didn't get the notification of your response to this issue and couldn't answer in a more timely fashion. We'll cover all the mentioned points in shortest time possible (also some university and job projects) and I really appreciate such a profound person as you are finding time reviewing our contribution. I'll contact you later on here on all the mentioned points with the results. Thanks for you time, Regards, Aleksandr
![](https://secure.gravatar.com/avatar/37eae48a3760fab53c86a497f7655e49.jpg?s=120&d=mm&r=g)
Good afternoon, Ralf. We have done some of the measurements you recommended, for your convenience we have created a separate folder with notebooks where we measured memory usage and performance of our interpretation against Scipy. Separately you can run the tests on your hardware and separately measure memory. I've left the link below. https://github.com/2D-FFT-Project/2d-fft/tree/main/notebooks We measured efficiency for 4 versions - with multithreading and data type conversion. According to the results of the tests, our algorithm has the greatest lead in the case with multithreading and without data type conversion - 75%, the worst performance without multithreading and with data type conversion - 14%. In terms of memory usage we beat NumPy and Scipy by 2 times in all cases, I think this is a solid achievement at this point. I can generalise that our mathematical approach still has a serious advantage, nevertheless we lose always to Scipy in inverse operation case, we haven't figured out the reasons yet, we are discussing it at the moment, but we will fix it. It is important to note that at this stage our algorithm shows the above perfomance on matrices of size powers of two. This is a specificity of the mathematical butterfly formula. We are investigating ways to remove this limitation, we already assessed the effect of element imputation and column dropping, the result is not accurate enough. Otherwise, we can suggest putting our version to work only in cases of the mentioned matrices, it'll still be an upgrade for NumPy. At this point I can say that we are willing to work and improve the existing version within our skills, knowledge and available resources. We still live with the idea of adding our interpretation or idea to the existing NumPy package, as in theoretical perspective within the memory usage and efficiency, it can give a serious advantage on other projects built on NumPy. Thank you for your time, we will continue our work and look forward to your review.
![](https://secure.gravatar.com/avatar/37eae48a3760fab53c86a497f7655e49.jpg?s=120&d=mm&r=g)
i appreciate your correction, indeed you are right, it was my fault. i changed everything and i believe it is in the correct order of things right now. our current best result is FFT: -46%(no multithreading, no type conversions) from scipy and +0.37% is the worst case (multithreaded, no type conversions). IFFT: -32%(no multithreading, no type conversions), +4.71%(multithreaded, no type conversions). https://github.com/2D-FFT-Project/2d-fft/blob/main/notebooks/comparisons.ipy... i suppose due to your corrections we were able to achieve even better results in my opinion. we will take a closer look at (M+, TC-) case, as it seems to be the worst perfomance of our algorithm. all in all, we'll be waiting for your further directions what we could do. please take a moment to review it again and thanks for your time. have a lovely day.
![](https://secure.gravatar.com/avatar/114b52fab2628d55a23c40a246a08bd1.jpg?s=120&d=mm&r=g)
On Tue, 12 Mar 2024 11:34:40 -0000 via NumPy-Discussion <numpy-discussion@python.org> wrote:
https://github.com/2D-FFT-Project/2d-fft/blob/main/notebooks/comparisons.ipy...
Hi, Since you are using a notebook to perform the benchmark, I would advise you to use: ``` timing = %timeit -o some_function(*args) ``` Because you are using the `time.time()` function which is not very precise. `time.perf_counter()` would be better but `timeit` is the best. Also, it allows to run for longer thus more stable values (let the speed of the CPU stabilize). That said, I did not measure your code faster than numpy or scipy implementations ... despite you are using some "un-advisable" practices like "-O3" and "--fast-math" by default. This brings me back 20 years in time when I played with the Intel compiler, which was much faster (~ +20% vs gcc) for each iteration but surprisingly it needed also +50% in the number of iteration to achieve a given precision. I remember loosing 6 month of work because of these options. One should only activate those options on a limited piece of code where it is known to be NOT harmful. Cheers, -- Jérôme Kieffer
![](https://secure.gravatar.com/avatar/37eae48a3760fab53c86a497f7655e49.jpg?s=120&d=mm&r=g)
thanks for your extensive feedback. if i got you right, we can't state the outperformance in all cases, because it is measured by an insufficiently precise function and a relatively short period of time. I understand your point of view and thank you for your observation. we will start working on fixing it and re-testing. I don't fully see your point about O3 and --fastmath, if I correctly understand these concepts of aggressive optimization sacrifice performance and computational accuracy in the long run. at the moment our operation has been tested by another project dealing with signal processing and the testing was successful and indeed showed better performance. as for computational accuracy, I don't fully grasp your point here either, since before publishing this we ran a number of tests on different dimensions and sizes, some of the tests can be found as well. nevertheless, i would agree that our development is the result of the personal enthusiasm of two individuals, based on the fact that the mathematical aspect of the algorithm currently used in the field is far from ideal. We have spent some of our time understanding the problem, analyzing articles on this matter, and implementing functions to achieve a more efficient mathematical algorithm. We have spent some of our time bringing the aforementioned package to a state where it can be further worked on and improved, and we have a great deal of respect for the motivations of more experienced colleagues working on open-source development to help us in this endeavor. As I mentioned earlier, we are ready to work within our abilities and knowledge, applying all the information available to us for research. I believe that with the combination of our efforts and the guidance of colleagues from NumPy, our operation can be tested and integrated into the package itself. We are ready to put in all necessary efforts from our side. i thank you for your comments, we will replace our module with the one you mentioned, re-run the tests and will be happy to share the results. i appreciate that such a knowledgeable person in the industry took the time to pay attention to our work.
![](https://secure.gravatar.com/avatar/37eae48a3760fab53c86a497f7655e49.jpg?s=120&d=mm&r=g)
Good day, Ralf. I am sharing the results of the latest updates on our code. We have taken into account the comments below and are testing the timing with %timeit -o inside jupyter, having information about the best of 7 code passes and the average deviation. Writing to summarise the intermediate results. The testing notebooks: Memory Usage - https://github.com/2D-FFT-Project/2d-fft/blob/testnotebook/notebooks/memory_... Timing comparisons(updated) - https://github.com/2D-FFT-Project/2d-fft/blob/testnotebook/notebooks/compari... Our version loses to Scipy always if multithreading is enabled, also we wondered about type conversions - whether to leave them for test metrics or not. The point is that they are necessary for converting matrix values from int to complex128 (we will replace them with 64 if necessary) and back when outputting. For more convenient user-experience we preferred to leave the conversions for testing, we will be interested in your opinion. Regarding the results we have after all updates - everything is stable in memory, our operation wins by 2 times. Regarding execution time and efficiency - I have the following opinion. On tests with multithreading enabled we are consistently losing, while on tests with multithreading disabled we are consistently winning. From this we should draw one logical conclusion - our algorithm is mathematically smarter, which makes it possible for it to win steadily within the limits of memory usage and performance when multithreading is switched off. At the same time, multithreading itself, used by Scipy authors, is better and more efficient than ours - that's why our operation loses algorithmically at the moment when it is switched on. From this I can conclude that our algorithm is still more performant, but it obviously needs modification of the existing multithreading system. In this situation we need your advice. In theory, we can figure out and write a more efficient and smarter algorithm for multithreading than our current one. In practice, I'm sure the best way forward would be to collaborate with someone responsible for FFT from Scipy or NumPy so that we can test our algorithm with their multithreading, I'm sure this action will give the best possible performance at the moment in general. I propose this option instead of our separate multithreading writing, as the goal of our work is to embed in NumPy so that as many people as possible can use a more efficient algorithm for their work. And if we write our multithreading first, we will then have to switch to the NumPy version to synthesise the package anyway. So I'm asking for your feedback on our memory usage and operation efficiency results to decide together the next steps of our hopefully collaborative work, if you're interested in doing so. Thank you for your time. Regards, Alexander
![](https://secure.gravatar.com/avatar/d9ac9213ada4a807322f99081296784b.jpg?s=120&d=mm&r=g)
Hi Alexander, On 2024-03-14 22:43:38, Alexander Levin via NumPy-Discussion <numpy-discussion@python.org> wrote:
I see these timings are still done only for power-of-two shaped arrays. This is the easiest case to optimize, and I wonder if you've given further thought to supporting other sizes? PocketFFT, e.g., implements the Bluestein / Chirp-Z algorithm to deal with cases where the sizes have large prime factors. Your test matrix also only contains real values. In that case, you can use rfft, which might resolve the memory usage difference? I'd be surprized if PocketFFT uses that much more memory for the same calculation. I saw that in the notebook code you have: matr = np.zeros((n, m), dtype=np.complex64) matr = np.random.rand(n, m) Was the intent here to generate a complex random matrix? Stéfan
![](https://secure.gravatar.com/avatar/37eae48a3760fab53c86a497f7655e49.jpg?s=120&d=mm&r=g)
Hi Stefan, indeed you're right, the underlying formula initially was created by V. Tutatchikov for power-of-two matrices. The initial butterfly approach requires a recursive breakdown to 2x2 matrix in order to proceed with precalculations of roots of unity (exactly what provides you the aforementioned performance advantage). I did my own research on implementations where the size is not a power of two and they still implement the butterfly, usually they proceeded with 0 imputation to build to closest power of 2 (which is a mess on a bigger sizes) or tried dropped the columns and building them back which is also not a brilliant solution. At some point I found a patent number US 8.484.274B2, where they discussed the possible padding options for such matrices. The methodology was picked depending on the actual size of signal/data, therefore, in case you proceed with implementing butterfly with such approach, you'd need to write several cases and check the size. Theoretically, it's possible, though I can't really say if this particular case would give much more performance advantage rather than the same or the worse. Yep, indeed the intend is to generate a random matrix, so our tests can be objective as they can be. I appreciate your review and will also run the memory against rfft (i hope tomorrow). So for now I'm sure this method shows better performance on size of power-of-two square matrices as well as rectangular matrices of size 2^n x 2^m (this one was tested during development process). Speaking of other sizes, I mentioned above that I have some thoughts, but it's very case sensitive in terms of specific type of signal we are working with. I would like to emphasize that regardless of my genuine desire to create the best possible version for the user, my work is not limited by my desire but constrained by capabilities. As you may have noticed earlier, the multithreading implemented by the authors of Scipy surpasses the version created by us. Considering the matrix dimension sensitivity of the butterfly method, I am ready to share my thoughts regarding specific data sizes and use cases. However, I cannot readily provide an optimal solution for all cases, otherwise, I would have implemented it first and then presented it to you. At the moment, I believe this is the only significant vulnerability in the mathematics we have presented. I can't provide much feedback on Bluestein / Chirp-Z at this very moment, but I'll research on this matter and if in some case it solves our issue - will definitely implement it. At this very point, I believe if our method after thorough provided corrections still has some performance advantage (even on matrices of more simple sizes like power-of-two) it is worth to embed it even using in 'if-cases' in my opinion. Yet i'm only a contributor and that's why i'm discussing this matter with you. I want to admit I appreciate your feedback and your time and will be waiting for your response. Regards, Alexander
![](https://secure.gravatar.com/avatar/37eae48a3760fab53c86a497f7655e49.jpg?s=120&d=mm&r=g)
Hi Stéfan, upd: indeed, rfft2 has equal memory usage with our fft2d in terms of reals. thanks, Stefan. to this moment, i believe the results are following:
scipy time outperformance on rectangular signals with sides of power-of-two. equal memory usage with rfft2
in my eyes, it's worth trying putting our algorithm and scipy multithreading together, considering previous results, I believe it'll show major performance improvements. in case it does, i still think it's worthy trying putting the Cooley-Tukey operation in work in terms of cases of the mentioned signals. like i suggest we try testing our code as a part of numpy/scipy, tbh, i really lost the track of whether this thread is about numpy or scipy embedding. i believe if we could place the butterfly algorithm into scipy and add a 'checking if' for the size of the matrix, we would rather win some performance than lose, i think any advantage in performance of the algorithm is important, considering the balance of memory usage and time is still observed. i suppose in terms of algorithm performance one step for a man is a leap for mankind in terms of other projects. please lmk if you share this opinion and we should try testing. regards Alexander
![](https://secure.gravatar.com/avatar/5f88830d19f9c83e2ddfd913496c5025.jpg?s=120&d=mm&r=g)
On Wed, Feb 28, 2024 at 6:59 AM camrymanjr--- via NumPy-Discussion < numpy-discussion@python.org> wrote:
Thanks for sharing Alexander.
Did you also benchmark against `scipy.fft` by any chance? SciPy uses a newer C++ version of PocketFFT. In NumPy we just switched over from the older C code of PocketFFT to the newer C++ code: https://github.com/numpy/numpy/pull/25536. That PR also solves the upcasting that your Medium article touches on; we now have proper float32/complex64 support. Note that SciPy also has a multi-threaded version, which you can enable with the `workers` keyword. It'd also be interesting perhaps to benchmark memory usage, since your article touches on eliminating temporary arrays. After discussing this matter with my colleague with whom we did the above
Great to hear that. Note that we're not a company - more a loosely organized team of mostly volunteers and some folks who get to work on NumPy as part of their day job at a company or university.
Performance improvements are always of interest, so if this is an algorithmic improvement without loss of accuracy, then that sounds very relevant. And more contributors are always welcome! Cheers, Ralf I trust that our vision will resonate with you and that you will agree with
![](https://secure.gravatar.com/avatar/37eae48a3760fab53c86a497f7655e49.jpg?s=120&d=mm&r=g)
Good evening, Ralf! I beg your pardon, for some reason I didn't get the notification of your response to this issue and couldn't answer in a more timely fashion. We'll cover all the mentioned points in shortest time possible (also some university and job projects) and I really appreciate such a profound person as you are finding time reviewing our contribution. I'll contact you later on here on all the mentioned points with the results. Thanks for you time, Regards, Aleksandr
![](https://secure.gravatar.com/avatar/37eae48a3760fab53c86a497f7655e49.jpg?s=120&d=mm&r=g)
Good afternoon, Ralf. We have done some of the measurements you recommended, for your convenience we have created a separate folder with notebooks where we measured memory usage and performance of our interpretation against Scipy. Separately you can run the tests on your hardware and separately measure memory. I've left the link below. https://github.com/2D-FFT-Project/2d-fft/tree/main/notebooks We measured efficiency for 4 versions - with multithreading and data type conversion. According to the results of the tests, our algorithm has the greatest lead in the case with multithreading and without data type conversion - 75%, the worst performance without multithreading and with data type conversion - 14%. In terms of memory usage we beat NumPy and Scipy by 2 times in all cases, I think this is a solid achievement at this point. I can generalise that our mathematical approach still has a serious advantage, nevertheless we lose always to Scipy in inverse operation case, we haven't figured out the reasons yet, we are discussing it at the moment, but we will fix it. It is important to note that at this stage our algorithm shows the above perfomance on matrices of size powers of two. This is a specificity of the mathematical butterfly formula. We are investigating ways to remove this limitation, we already assessed the effect of element imputation and column dropping, the result is not accurate enough. Otherwise, we can suggest putting our version to work only in cases of the mentioned matrices, it'll still be an upgrade for NumPy. At this point I can say that we are willing to work and improve the existing version within our skills, knowledge and available resources. We still live with the idea of adding our interpretation or idea to the existing NumPy package, as in theoretical perspective within the memory usage and efficiency, it can give a serious advantage on other projects built on NumPy. Thank you for your time, we will continue our work and look forward to your review.
![](https://secure.gravatar.com/avatar/37eae48a3760fab53c86a497f7655e49.jpg?s=120&d=mm&r=g)
i appreciate your correction, indeed you are right, it was my fault. i changed everything and i believe it is in the correct order of things right now. our current best result is FFT: -46%(no multithreading, no type conversions) from scipy and +0.37% is the worst case (multithreaded, no type conversions). IFFT: -32%(no multithreading, no type conversions), +4.71%(multithreaded, no type conversions). https://github.com/2D-FFT-Project/2d-fft/blob/main/notebooks/comparisons.ipy... i suppose due to your corrections we were able to achieve even better results in my opinion. we will take a closer look at (M+, TC-) case, as it seems to be the worst perfomance of our algorithm. all in all, we'll be waiting for your further directions what we could do. please take a moment to review it again and thanks for your time. have a lovely day.
![](https://secure.gravatar.com/avatar/114b52fab2628d55a23c40a246a08bd1.jpg?s=120&d=mm&r=g)
On Tue, 12 Mar 2024 11:34:40 -0000 via NumPy-Discussion <numpy-discussion@python.org> wrote:
https://github.com/2D-FFT-Project/2d-fft/blob/main/notebooks/comparisons.ipy...
Hi, Since you are using a notebook to perform the benchmark, I would advise you to use: ``` timing = %timeit -o some_function(*args) ``` Because you are using the `time.time()` function which is not very precise. `time.perf_counter()` would be better but `timeit` is the best. Also, it allows to run for longer thus more stable values (let the speed of the CPU stabilize). That said, I did not measure your code faster than numpy or scipy implementations ... despite you are using some "un-advisable" practices like "-O3" and "--fast-math" by default. This brings me back 20 years in time when I played with the Intel compiler, which was much faster (~ +20% vs gcc) for each iteration but surprisingly it needed also +50% in the number of iteration to achieve a given precision. I remember loosing 6 month of work because of these options. One should only activate those options on a limited piece of code where it is known to be NOT harmful. Cheers, -- Jérôme Kieffer
![](https://secure.gravatar.com/avatar/37eae48a3760fab53c86a497f7655e49.jpg?s=120&d=mm&r=g)
thanks for your extensive feedback. if i got you right, we can't state the outperformance in all cases, because it is measured by an insufficiently precise function and a relatively short period of time. I understand your point of view and thank you for your observation. we will start working on fixing it and re-testing. I don't fully see your point about O3 and --fastmath, if I correctly understand these concepts of aggressive optimization sacrifice performance and computational accuracy in the long run. at the moment our operation has been tested by another project dealing with signal processing and the testing was successful and indeed showed better performance. as for computational accuracy, I don't fully grasp your point here either, since before publishing this we ran a number of tests on different dimensions and sizes, some of the tests can be found as well. nevertheless, i would agree that our development is the result of the personal enthusiasm of two individuals, based on the fact that the mathematical aspect of the algorithm currently used in the field is far from ideal. We have spent some of our time understanding the problem, analyzing articles on this matter, and implementing functions to achieve a more efficient mathematical algorithm. We have spent some of our time bringing the aforementioned package to a state where it can be further worked on and improved, and we have a great deal of respect for the motivations of more experienced colleagues working on open-source development to help us in this endeavor. As I mentioned earlier, we are ready to work within our abilities and knowledge, applying all the information available to us for research. I believe that with the combination of our efforts and the guidance of colleagues from NumPy, our operation can be tested and integrated into the package itself. We are ready to put in all necessary efforts from our side. i thank you for your comments, we will replace our module with the one you mentioned, re-run the tests and will be happy to share the results. i appreciate that such a knowledgeable person in the industry took the time to pay attention to our work.
![](https://secure.gravatar.com/avatar/37eae48a3760fab53c86a497f7655e49.jpg?s=120&d=mm&r=g)
Good day, Ralf. I am sharing the results of the latest updates on our code. We have taken into account the comments below and are testing the timing with %timeit -o inside jupyter, having information about the best of 7 code passes and the average deviation. Writing to summarise the intermediate results. The testing notebooks: Memory Usage - https://github.com/2D-FFT-Project/2d-fft/blob/testnotebook/notebooks/memory_... Timing comparisons(updated) - https://github.com/2D-FFT-Project/2d-fft/blob/testnotebook/notebooks/compari... Our version loses to Scipy always if multithreading is enabled, also we wondered about type conversions - whether to leave them for test metrics or not. The point is that they are necessary for converting matrix values from int to complex128 (we will replace them with 64 if necessary) and back when outputting. For more convenient user-experience we preferred to leave the conversions for testing, we will be interested in your opinion. Regarding the results we have after all updates - everything is stable in memory, our operation wins by 2 times. Regarding execution time and efficiency - I have the following opinion. On tests with multithreading enabled we are consistently losing, while on tests with multithreading disabled we are consistently winning. From this we should draw one logical conclusion - our algorithm is mathematically smarter, which makes it possible for it to win steadily within the limits of memory usage and performance when multithreading is switched off. At the same time, multithreading itself, used by Scipy authors, is better and more efficient than ours - that's why our operation loses algorithmically at the moment when it is switched on. From this I can conclude that our algorithm is still more performant, but it obviously needs modification of the existing multithreading system. In this situation we need your advice. In theory, we can figure out and write a more efficient and smarter algorithm for multithreading than our current one. In practice, I'm sure the best way forward would be to collaborate with someone responsible for FFT from Scipy or NumPy so that we can test our algorithm with their multithreading, I'm sure this action will give the best possible performance at the moment in general. I propose this option instead of our separate multithreading writing, as the goal of our work is to embed in NumPy so that as many people as possible can use a more efficient algorithm for their work. And if we write our multithreading first, we will then have to switch to the NumPy version to synthesise the package anyway. So I'm asking for your feedback on our memory usage and operation efficiency results to decide together the next steps of our hopefully collaborative work, if you're interested in doing so. Thank you for your time. Regards, Alexander
![](https://secure.gravatar.com/avatar/d9ac9213ada4a807322f99081296784b.jpg?s=120&d=mm&r=g)
Hi Alexander, On 2024-03-14 22:43:38, Alexander Levin via NumPy-Discussion <numpy-discussion@python.org> wrote:
I see these timings are still done only for power-of-two shaped arrays. This is the easiest case to optimize, and I wonder if you've given further thought to supporting other sizes? PocketFFT, e.g., implements the Bluestein / Chirp-Z algorithm to deal with cases where the sizes have large prime factors. Your test matrix also only contains real values. In that case, you can use rfft, which might resolve the memory usage difference? I'd be surprized if PocketFFT uses that much more memory for the same calculation. I saw that in the notebook code you have: matr = np.zeros((n, m), dtype=np.complex64) matr = np.random.rand(n, m) Was the intent here to generate a complex random matrix? Stéfan
![](https://secure.gravatar.com/avatar/37eae48a3760fab53c86a497f7655e49.jpg?s=120&d=mm&r=g)
Hi Stefan, indeed you're right, the underlying formula initially was created by V. Tutatchikov for power-of-two matrices. The initial butterfly approach requires a recursive breakdown to 2x2 matrix in order to proceed with precalculations of roots of unity (exactly what provides you the aforementioned performance advantage). I did my own research on implementations where the size is not a power of two and they still implement the butterfly, usually they proceeded with 0 imputation to build to closest power of 2 (which is a mess on a bigger sizes) or tried dropped the columns and building them back which is also not a brilliant solution. At some point I found a patent number US 8.484.274B2, where they discussed the possible padding options for such matrices. The methodology was picked depending on the actual size of signal/data, therefore, in case you proceed with implementing butterfly with such approach, you'd need to write several cases and check the size. Theoretically, it's possible, though I can't really say if this particular case would give much more performance advantage rather than the same or the worse. Yep, indeed the intend is to generate a random matrix, so our tests can be objective as they can be. I appreciate your review and will also run the memory against rfft (i hope tomorrow). So for now I'm sure this method shows better performance on size of power-of-two square matrices as well as rectangular matrices of size 2^n x 2^m (this one was tested during development process). Speaking of other sizes, I mentioned above that I have some thoughts, but it's very case sensitive in terms of specific type of signal we are working with. I would like to emphasize that regardless of my genuine desire to create the best possible version for the user, my work is not limited by my desire but constrained by capabilities. As you may have noticed earlier, the multithreading implemented by the authors of Scipy surpasses the version created by us. Considering the matrix dimension sensitivity of the butterfly method, I am ready to share my thoughts regarding specific data sizes and use cases. However, I cannot readily provide an optimal solution for all cases, otherwise, I would have implemented it first and then presented it to you. At the moment, I believe this is the only significant vulnerability in the mathematics we have presented. I can't provide much feedback on Bluestein / Chirp-Z at this very moment, but I'll research on this matter and if in some case it solves our issue - will definitely implement it. At this very point, I believe if our method after thorough provided corrections still has some performance advantage (even on matrices of more simple sizes like power-of-two) it is worth to embed it even using in 'if-cases' in my opinion. Yet i'm only a contributor and that's why i'm discussing this matter with you. I want to admit I appreciate your feedback and your time and will be waiting for your response. Regards, Alexander
![](https://secure.gravatar.com/avatar/37eae48a3760fab53c86a497f7655e49.jpg?s=120&d=mm&r=g)
Hi Stéfan, upd: indeed, rfft2 has equal memory usage with our fft2d in terms of reals. thanks, Stefan. to this moment, i believe the results are following:
scipy time outperformance on rectangular signals with sides of power-of-two. equal memory usage with rfft2
in my eyes, it's worth trying putting our algorithm and scipy multithreading together, considering previous results, I believe it'll show major performance improvements. in case it does, i still think it's worthy trying putting the Cooley-Tukey operation in work in terms of cases of the mentioned signals. like i suggest we try testing our code as a part of numpy/scipy, tbh, i really lost the track of whether this thread is about numpy or scipy embedding. i believe if we could place the butterfly algorithm into scipy and add a 'checking if' for the size of the matrix, we would rather win some performance than lose, i think any advantage in performance of the algorithm is important, considering the balance of memory usage and time is still observed. i suppose in terms of algorithm performance one step for a man is a leap for mankind in terms of other projects. please lmk if you share this opinion and we should try testing. regards Alexander
participants (6)
-
Alexander Levin
-
camrymanjr@me.com
-
george trojan
-
Jerome Kieffer
-
Ralf Gommers
-
Stefan van der Walt