Changing FFT cache to a bounded LRU cache
![](https://secure.gravatar.com/avatar/353b1027a39f64da0d4306b6503b880e.jpg?s=120&d=mm&r=g)
Hi all, I was told to take this to the mailing list. Relevant pull request: https://github.com/numpy/numpy/pull/7686 NumPy's FFT implementation caches some form of execution plan for each encountered input data length. This is currently implemented as a simple dictionary which can grow without bounds. Calculating lots of different FFTs thus cause a memory leak from the users' perspective. We encountered a real world situation where this is an issue. The PR linked above proposes to replace the simple dictionary with an LRU (least recently used) cache. It will remove the least recently used pieces of data if it grows beyond a specified size (currently an arbitrary limit of 100 MB per cache). Thus almost all users will still benefit from the caches but their total memory size is now limited. Things to consider: * This adds quite some additional complexity but I could not find a simple way to achieve the same result. * What is a good limit on cache size? I used 100 MB because it works for my uses cases. Cheers! Lion
![](https://secure.gravatar.com/avatar/b4f6d4f8b501cb05fd054944a166a121.jpg?s=120&d=mm&r=g)
On Fr, 2016-05-27 at 22:51 +0200, Lion Krischer wrote:
I am +1 in principle, since I don't like that the cache might just grow forever and I see that as a bug right now personally. One that rarely hits maybe, but a bug. I guess if you have the time, the perfect thing would be if you could time how big the cache difference is anyway, etc.? The cache mostly gets a working array I think, so does it even help for large arrays or is the time spend for the allocation negligible anway then? We also have a small array cache in numpy anyway nowadays (not sure how small small is here). Maybe this already achieves everything that the fftcache was designed for and we could even just disable it as default? The complexity addition is a bit annoying I must admit, on python 3 functools.lru_cache could be another option, but only there. - Sebastian
![](https://secure.gravatar.com/avatar/851ff10fbb1363b7d6111ac60194cc1c.jpg?s=120&d=mm&r=g)
Hi, I did a few simple timing tests (see comment in PR), which suggests it is hardly worth having the cache. Indeed, if one really worries about speed, one should probably use pyFFTW (scipy.fft is a bit faster too, but at least for me the way real FFT values are stored is just too inconvenient). So, my suggestion would be to do away with the cache altogether. If we do keep it, I think the approach in the PR is nice, but I would advocate setting both a size and number limit (e.g., by default no more than 8 entries or so, which should cover most repetitive use cases). All the best, Marten p.s. I do like having a quick fft routine in numpy. My main gripe is that it always casts to float64/complex128 rather than sticking with the input dtype. Hope to get around to making a PR for that...
![](https://secure.gravatar.com/avatar/353b1027a39f64da0d4306b6503b880e.jpg?s=120&d=mm&r=g)
On 30/05/16 10:07, Joseph Martinot-Lagarde wrote:
I added a slightly more comprehensive benchmark to the PR. Please have a look. It tests the total time for 100 FFTs with and without cache. It is over 30 percent faster with cache which it totally worth it in my opinion as repeated FFTs of the same size are a very common use case. Also many people will not have enough knowledge to use FFTW or some other FFT implementation.
![](https://secure.gravatar.com/avatar/2a9d09b311f11f92cdc6a91b3c6519b1.jpg?s=120&d=mm&r=g)
Joseph Martinot-Lagarde <contrebasse@gmail.com> wrote:
The problem with FFTW is that its license is more restrictive (GPL), and because of this may not be suitable everywhere numpy.fft is.
A lot of us use NumPy linked with MKL or Accelerate, both of which have some really nifty FFTs. And the license issue is hardly any worse than linking with them for BLAS and LAPACK, which we do anyway. We could extend numpy.fft to use MKL or Accelerate when they are available. Sturla
![](https://secure.gravatar.com/avatar/353b1027a39f64da0d4306b6503b880e.jpg?s=120&d=mm&r=g)
Seems so. numpy/fft/__init__.py when installed with conda contains a thin optional wrapper around mklfft, e.g. this here: https://docs.continuum.io/accelerate/mkl_fft It is part of the accelerate package from continuum and thus not free. Cheers! Lion On 01/06/16 09:44, Gregor Thalhammer wrote:
![](https://secure.gravatar.com/avatar/59bdb3784070f0a6836aca9ee03ad817.jpg?s=120&d=mm&r=g)
On Tue, May 31, 2016 at 10:36 PM, Sturla Molden <sturla.molden@gmail.com> wrote:
That's what we used to do in scipy, but it was a PITA to maintain. Contrary to blas/lapack, fft does not have a standard API, hence exposing a consistent API in python, including data layout involved quite a bit of work. It is better to expose those through 3rd party APIs. David
![](https://secure.gravatar.com/avatar/db5f70d2f2520ef725839f046bdc32fb.jpg?s=120&d=mm&r=g)
On Sat, 28 May 2016 20:19:27 +0200 Sebastian Berg <sebastian@sipsolutions.net> wrote:
The complexity addition is a bit annoying I must admit, on python 3 functools.lru_cache could be another option, but only there.
You can backport the pure Python version of lru_cache for Python 2 (or vendor the backport done here: https://pypi.python.org/pypi/backports.functools_lru_cache/). The advantage is that lru_cache is C-accelerated in Python 3.5 and upwards... Regards Antoine.
![](https://secure.gravatar.com/avatar/353b1027a39f64da0d4306b6503b880e.jpg?s=120&d=mm&r=g)
That's a pretty big back-port. The speed also does not matter for this particular use-case: Time for the actual FFT will dominate by far. The lru_cache decorator can furthermore only limit the cache size by item count and not size in memory as the proposed solution does. I think the downsides outweigh the advantages of being able to use functionality from the stdlib.
![](https://secure.gravatar.com/avatar/b4f6d4f8b501cb05fd054944a166a121.jpg?s=120&d=mm&r=g)
On Fr, 2016-05-27 at 22:51 +0200, Lion Krischer wrote:
I am +1 in principle, since I don't like that the cache might just grow forever and I see that as a bug right now personally. One that rarely hits maybe, but a bug. I guess if you have the time, the perfect thing would be if you could time how big the cache difference is anyway, etc.? The cache mostly gets a working array I think, so does it even help for large arrays or is the time spend for the allocation negligible anway then? We also have a small array cache in numpy anyway nowadays (not sure how small small is here). Maybe this already achieves everything that the fftcache was designed for and we could even just disable it as default? The complexity addition is a bit annoying I must admit, on python 3 functools.lru_cache could be another option, but only there. - Sebastian
![](https://secure.gravatar.com/avatar/851ff10fbb1363b7d6111ac60194cc1c.jpg?s=120&d=mm&r=g)
Hi, I did a few simple timing tests (see comment in PR), which suggests it is hardly worth having the cache. Indeed, if one really worries about speed, one should probably use pyFFTW (scipy.fft is a bit faster too, but at least for me the way real FFT values are stored is just too inconvenient). So, my suggestion would be to do away with the cache altogether. If we do keep it, I think the approach in the PR is nice, but I would advocate setting both a size and number limit (e.g., by default no more than 8 entries or so, which should cover most repetitive use cases). All the best, Marten p.s. I do like having a quick fft routine in numpy. My main gripe is that it always casts to float64/complex128 rather than sticking with the input dtype. Hope to get around to making a PR for that...
![](https://secure.gravatar.com/avatar/353b1027a39f64da0d4306b6503b880e.jpg?s=120&d=mm&r=g)
On 30/05/16 10:07, Joseph Martinot-Lagarde wrote:
I added a slightly more comprehensive benchmark to the PR. Please have a look. It tests the total time for 100 FFTs with and without cache. It is over 30 percent faster with cache which it totally worth it in my opinion as repeated FFTs of the same size are a very common use case. Also many people will not have enough knowledge to use FFTW or some other FFT implementation.
![](https://secure.gravatar.com/avatar/2a9d09b311f11f92cdc6a91b3c6519b1.jpg?s=120&d=mm&r=g)
Joseph Martinot-Lagarde <contrebasse@gmail.com> wrote:
The problem with FFTW is that its license is more restrictive (GPL), and because of this may not be suitable everywhere numpy.fft is.
A lot of us use NumPy linked with MKL or Accelerate, both of which have some really nifty FFTs. And the license issue is hardly any worse than linking with them for BLAS and LAPACK, which we do anyway. We could extend numpy.fft to use MKL or Accelerate when they are available. Sturla
![](https://secure.gravatar.com/avatar/353b1027a39f64da0d4306b6503b880e.jpg?s=120&d=mm&r=g)
Seems so. numpy/fft/__init__.py when installed with conda contains a thin optional wrapper around mklfft, e.g. this here: https://docs.continuum.io/accelerate/mkl_fft It is part of the accelerate package from continuum and thus not free. Cheers! Lion On 01/06/16 09:44, Gregor Thalhammer wrote:
![](https://secure.gravatar.com/avatar/59bdb3784070f0a6836aca9ee03ad817.jpg?s=120&d=mm&r=g)
On Tue, May 31, 2016 at 10:36 PM, Sturla Molden <sturla.molden@gmail.com> wrote:
That's what we used to do in scipy, but it was a PITA to maintain. Contrary to blas/lapack, fft does not have a standard API, hence exposing a consistent API in python, including data layout involved quite a bit of work. It is better to expose those through 3rd party APIs. David
![](https://secure.gravatar.com/avatar/db5f70d2f2520ef725839f046bdc32fb.jpg?s=120&d=mm&r=g)
On Sat, 28 May 2016 20:19:27 +0200 Sebastian Berg <sebastian@sipsolutions.net> wrote:
The complexity addition is a bit annoying I must admit, on python 3 functools.lru_cache could be another option, but only there.
You can backport the pure Python version of lru_cache for Python 2 (or vendor the backport done here: https://pypi.python.org/pypi/backports.functools_lru_cache/). The advantage is that lru_cache is C-accelerated in Python 3.5 and upwards... Regards Antoine.
![](https://secure.gravatar.com/avatar/353b1027a39f64da0d4306b6503b880e.jpg?s=120&d=mm&r=g)
That's a pretty big back-port. The speed also does not matter for this particular use-case: Time for the actual FFT will dominate by far. The lru_cache decorator can furthermore only limit the cache size by item count and not size in memory as the proposed solution does. I think the downsides outweigh the advantages of being able to use functionality from the stdlib.
participants (8)
-
Antoine Pitrou
-
David Cournapeau
-
Gregor Thalhammer
-
Joseph Martinot-Lagarde
-
Lion Krischer
-
Marten van Kerkwijk
-
Sebastian Berg
-
Sturla Molden