
Hello, This was started on a different thread, but I thought I would post a new thread focused on this. Currently, I have some existing code that implements the bandwidth selection algorithm from: Z. I. Botev, J. F. Grotowski, and D. P. Kroese. Kernel density estimation via diffusion. The Annals of Statistics, 38(5):2916-2957, 2010. Zdravko Botev implemented the code in MatLab which can be found here: http://www.mathworks.com/matlabcentral/fileexchange/14034-kernel-density-est... My code for that is here: https://github.com/Daniel-B-Smith/KDE-for-SciPy I assume I probably need to find a workaround to avoid the float128 in the function fixed_point before I can add it to SciPy. I wrote the code a couple of years ago, so it will take me a moment to map out the best workaround (there is a very large number being multiplied by a very small number). I can also add the 2d-version once I start integrating with SciPy. I have a couple of questions remaining. First, should I implement this in SciPy? StatsModels? Both? Secondly, can I use Cython to generate C code for the function fixed_point? Or do I need to write it up in the Numpy C API? If there is somewhere else I should post this and/or someone I should directly contact, I would greatly appreciate it. Thanks, Daniel

Hi Daniel, That looks like a nice implementation. My concern about adding it to scipy is twofold: 1) Is this a well-known and well-proven technique, or is it more cutting-edge? My view is that scipy should not seek to implement every cutting-edge algorithm: in the long-run this will lead to code bloat and difficulty of maintenance. If that's the case, your code might be a better fit for statsmodels or another more specialized package. 2) The algorithm seems limited to one or maybe two dimensions. scipy.stats.gaussian_kde is designed for N dimensions, so it might be difficult to find a fit for this bandwidth selection method. One option might be to allow this bandwidth selection method via a flag in scipy.stats.gaussian_kde, and raise an error if the dimensionality is too high. To do that, your code would need to be reworked fairly extensively to fit in the gaussian_kde class. I'd like other devs to weigh-in about the algorithm, especially my concern #1, before any work starts on a scipy PR. Thanks, Jake On 01/23/2013 12:11 PM, Daniel Smith wrote:
Hello,
This was started on a different thread, but I thought I would post a new thread focused on this. Currently, I have some existing code that implements the bandwidth selection algorithm from:
Z. I. Botev, J. F. Grotowski, and D. P. Kroese. Kernel density estimation via diffusion. The Annals of Statistics, 38(5):2916-2957, 2010.
Zdravko Botev implemented the code in MatLab which can be found here:
http://www.mathworks.com/matlabcentral/fileexchange/14034-kernel-density-est...
My code for that is here:
https://github.com/Daniel-B-Smith/KDE-for-SciPy
I assume I probably need to find a workaround to avoid the float128 in the function fixed_point before I can add it to SciPy. I wrote the code a couple of years ago, so it will take me a moment to map out the best workaround (there is a very large number being multiplied by a very small number). I can also add the 2d-version once I start integrating with SciPy. I have a couple of questions remaining. First, should I implement this in SciPy? StatsModels? Both? Secondly, can I use Cython to generate C code for the function fixed_point? Or do I need to write it up in the Numpy C API?
If there is somewhere else I should post this and/or someone I should directly contact, I would greatly appreciate it.
Thanks, Daniel _______________________________________________ SciPy-Dev mailing list SciPy-Dev@scipy.org http://mail.scipy.org/mailman/listinfo/scipy-dev

On Wed, Jan 23, 2013 at 3:30 PM, Jake Vanderplas <vanderplas@astro.washington.edu> wrote:
Hi Daniel, That looks like a nice implementation. My concern about adding it to scipy is twofold:
1) Is this a well-known and well-proven technique, or is it more cutting-edge? My view is that scipy should not seek to implement every cutting-edge algorithm: in the long-run this will lead to code bloat and difficulty of maintenance. If that's the case, your code might be a better fit for statsmodels or another more specialized package.
146 citations in google scholar for the paper since 2010 across many fields 169 downloads in the last month for the matlab version The availability of the matlab code is increasing the number of citations, from what I can see in a few examples. So, it looks popular and it works, even if it's new. Using fft for kde is old, but I didn't look yet at the details.
2) The algorithm seems limited to one or maybe two dimensions. scipy.stats.gaussian_kde is designed for N dimensions, so it might be difficult to find a fit for this bandwidth selection method. One option might be to allow this bandwidth selection method via a flag in scipy.stats.gaussian_kde, and raise an error if the dimensionality is too high. To do that, your code would need to be reworked fairly extensively to fit in the gaussian_kde class.
My guess is that it doesn't make much sense to merge it into gaussian_kde. I doubt there will be much direct code sharing, and the implementation differs quite a bit. In statsmodels we have separate classes for univariate and multivariate kde (although most of the kernel density estimation and kernel regression in statsmodels is new and not settled yet). Josef
I'd like other devs to weigh-in about the algorithm, especially my concern #1, before any work starts on a scipy PR. Thanks, Jake
On 01/23/2013 12:11 PM, Daniel Smith wrote:
Hello,
This was started on a different thread, but I thought I would post a new thread focused on this. Currently, I have some existing code that implements the bandwidth selection algorithm from:
Z. I. Botev, J. F. Grotowski, and D. P. Kroese. Kernel density estimation via diffusion. The Annals of Statistics, 38(5):2916-2957, 2010.
Zdravko Botev implemented the code in MatLab which can be found here:
http://www.mathworks.com/matlabcentral/fileexchange/14034-kernel-density-est...
My code for that is here:
https://github.com/Daniel-B-Smith/KDE-for-SciPy
I assume I probably need to find a workaround to avoid the float128 in the function fixed_point before I can add it to SciPy. I wrote the code a couple of years ago, so it will take me a moment to map out the best workaround (there is a very large number being multiplied by a very small number). I can also add the 2d-version once I start integrating with SciPy. I have a couple of questions remaining. First, should I implement this in SciPy? StatsModels? Both? Secondly, can I use Cython to generate C code for the function fixed_point? Or do I need to write it up in the Numpy C API?
If there is somewhere else I should post this and/or someone I should directly contact, I would greatly appreciate it.
Thanks, Daniel _______________________________________________ SciPy-Dev mailing list SciPy-Dev@scipy.org http://mail.scipy.org/mailman/listinfo/scipy-dev
_______________________________________________ SciPy-Dev mailing list SciPy-Dev@scipy.org http://mail.scipy.org/mailman/listinfo/scipy-dev

On Wed, Jan 23, 2013 at 9:30 PM, Jake Vanderplas < vanderplas@astro.washington.edu> wrote:
Hi Daniel, That looks like a nice implementation. My concern about adding it to scipy is twofold:
1) Is this a well-known and well-proven technique, or is it more cutting-edge? My view is that scipy should not seek to implement every cutting-edge algorithm: in the long-run this will lead to code bloat and difficulty of maintenance. If that's the case, your code might be a better fit for statsmodels or another more specialized package.
It seems to be a new technique, however the paper has already 150 citations. The algorithm also seems to be straightforward to implement, so I think it could be put into scipy.stats. statsmodels.nonparametric would also be a good place for it though.
2) The algorithm seems limited to one or maybe two dimensions. scipy.stats.gaussian_kde is designed for N dimensions, so it might be difficult to find a fit for this bandwidth selection method. One option might be to allow this bandwidth selection method via a flag in scipy.stats.gaussian_kde, and raise an error if the dimensionality is too high. To do that, your code would need to be reworked fairly extensively to fit in the gaussian_kde class.
I'd like other devs to weigh-in about the algorithm, especially my concern #1, before any work starts on a scipy PR.
I quickly browsed the paper and original (BSD-licensed) code. My impression is that this can't be integrated with gaussian_kde - it's not a bandwidth estimation method but an adaptive density estimator. The method is only 1-D, but will handle especially multimodal distributions much better than gaussian_kde. My suggestion would be to implement the density estimator and do a good amount of performance testing, at least show that the performance is as good as described in table 1 of the paper. Then we can still decide where to put it. Ralf
On 01/23/2013 12:11 PM, Daniel Smith wrote:
Hello,
This was started on a different thread, but I thought I would post a new thread focused on this. Currently, I have some existing code that implements the bandwidth selection algorithm from:
Z. I. Botev, J. F. Grotowski, and D. P. Kroese. Kernel density estimation via diffusion. The Annals of Statistics, 38(5):2916-2957, 2010.
Zdravko Botev implemented the code in MatLab which can be found here:
http://www.mathworks.com/matlabcentral/fileexchange/14034-kernel-density-est...
My code for that is here:
https://github.com/Daniel-B-Smith/KDE-for-SciPy
I assume I probably need to find a workaround to avoid the float128 in the function fixed_point before I can add it to SciPy. I wrote the code a couple of years ago, so it will take me a moment to map out the best workaround (there is a very large number being multiplied by a very small number). I can also add the 2d-version once I start integrating with SciPy. I have a couple of questions remaining. First, should I implement this in SciPy? StatsModels? Both? Secondly, can I use Cython to generate C code for the function fixed_point? Or do I need to write it up in the Numpy C API?
If there is somewhere else I should post this and/or someone I should directly contact, I would greatly appreciate it.
Thanks, Daniel _______________________________________________ SciPy-Dev mailing list SciPy-Dev@scipy.org http://mail.scipy.org/mailman/listinfo/scipy-dev
_______________________________________________ SciPy-Dev mailing list SciPy-Dev@scipy.org http://mail.scipy.org/mailman/listinfo/scipy-dev

Hi, I am not a developer of SciPy per se, but I am currently looking into these KDE methods with bounded domains. First, it looks like there are two parts with the algorithm: 1 - the estimation of the bandwidth 2 - the estimation of the density It should be easy to separate them and use the estimation of the bandwidth without the density estimation. This would be interesting as it seems that the density needs to be estimated on a regular mesh. It also allows comparison of the method with other estimators. For example, as stated in the paper, the method is equivalent to a reflexion method with a gaussian kernel. But the renormalisation method gives very similar results too, without enforcing that f'(0) = 0 (i.e. first derivative is always 0 on the boundaries). I have a different concern though: is it normal that the density returned by the method is not normalized? (i.e. the integral of the density is far from being one). Also, can you generalise the bandwidth calculation to unbounded domains? or at least half-domains (i.e. [a; oo[ or ]-oo; a])? It seems that it all depends on the domain being of finite size. -- Barbier de Reuille Pierre On 23 January 2013 23:41, Ralf Gommers <ralf.gommers@gmail.com> wrote:
On Wed, Jan 23, 2013 at 9:30 PM, Jake Vanderplas < vanderplas@astro.washington.edu> wrote:
Hi Daniel, That looks like a nice implementation. My concern about adding it to scipy is twofold:
1) Is this a well-known and well-proven technique, or is it more cutting-edge? My view is that scipy should not seek to implement every cutting-edge algorithm: in the long-run this will lead to code bloat and difficulty of maintenance. If that's the case, your code might be a better fit for statsmodels or another more specialized package.
It seems to be a new technique, however the paper has already 150 citations. The algorithm also seems to be straightforward to implement, so I think it could be put into scipy.stats. statsmodels.nonparametric would also be a good place for it though.
2) The algorithm seems limited to one or maybe two dimensions. scipy.stats.gaussian_kde is designed for N dimensions, so it might be difficult to find a fit for this bandwidth selection method. One option might be to allow this bandwidth selection method via a flag in scipy.stats.gaussian_kde, and raise an error if the dimensionality is too high. To do that, your code would need to be reworked fairly extensively to fit in the gaussian_kde class.
I'd like other devs to weigh-in about the algorithm, especially my concern #1, before any work starts on a scipy PR.
I quickly browsed the paper and original (BSD-licensed) code. My impression is that this can't be integrated with gaussian_kde - it's not a bandwidth estimation method but an adaptive density estimator. The method is only 1-D, but will handle especially multimodal distributions much better than gaussian_kde.
My suggestion would be to implement the density estimator and do a good amount of performance testing, at least show that the performance is as good as described in table 1 of the paper. Then we can still decide where to put it.
Ralf
On 01/23/2013 12:11 PM, Daniel Smith wrote:
Hello,
This was started on a different thread, but I thought I would post a new thread focused on this. Currently, I have some existing code that implements the bandwidth selection algorithm from:
Z. I. Botev, J. F. Grotowski, and D. P. Kroese. Kernel density estimation via diffusion. The Annals of Statistics, 38(5):2916-2957, 2010.
Zdravko Botev implemented the code in MatLab which can be found here:
http://www.mathworks.com/matlabcentral/fileexchange/14034-kernel-density-est...
My code for that is here:
https://github.com/Daniel-B-Smith/KDE-for-SciPy
I assume I probably need to find a workaround to avoid the float128 in the function fixed_point before I can add it to SciPy. I wrote the code a couple of years ago, so it will take me a moment to map out the best workaround (there is a very large number being multiplied by a very small number). I can also add the 2d-version once I start integrating with SciPy. I have a couple of questions remaining. First, should I implement this in SciPy? StatsModels? Both? Secondly, can I use Cython to generate C code for the function fixed_point? Or do I need to write it up in the Numpy C API?
If there is somewhere else I should post this and/or someone I should directly contact, I would greatly appreciate it.
Thanks, Daniel _______________________________________________ SciPy-Dev mailing list SciPy-Dev@scipy.org http://mail.scipy.org/mailman/listinfo/scipy-dev
_______________________________________________ SciPy-Dev mailing list SciPy-Dev@scipy.org http://mail.scipy.org/mailman/listinfo/scipy-dev
_______________________________________________ SciPy-Dev mailing list SciPy-Dev@scipy.org http://mail.scipy.org/mailman/listinfo/scipy-dev

Barbier de Reuille Pierre
About this: this is incorrect, as you work with a DCT, it is equivalent to repeat the data on both sides by reflexion. Which means your method is equivalent to the reflexion method. Also note this is pointed out in the paper itself. That being said, if there is enough "padding" on both sides (i.e. such that the tail of the kernel is almost 0) there is no effect to this. Also, you can replace the CDT with a FFT to get a cyclic density. I adapted your code for this and it works great!
You are correct. I had always ended up having padding on each side and gotten nonsense near the boundary. When I fixed the boundary correctly, it gave me nice answers. Could you send me your code for the cyclic density? I do some molecular dynamics work, and it would be really useful for making angular density plots.
Back on the computation of the bandwidth, I would argue that you can compute it without computing the density itself. It's true that it makes sense to combine the binning as it useful for both, but I don't agree that it's necessary.j
Let me rephrase my sentiment. I think we can't calculate the bandwidth without the moral equivalent of calculating the density. Basically, we need a mapping from our set of samples plus our bandwidth to the square norm of the n'th derivative. Last night, I came up with a far more efficient method that I think demonstrates the moral equivalence. With some clever calculus, we can write down the mapping from the samples plus bandwidth to the j'th DCT (or Fourier) component. We can simply iterate over the DCT components until the change in the derivative estimate falls below some threshold. That saves us the histogramming step (not that important), but it also means we almost assuredly don't need 2**14 DCT components. For all intents in purposes, we have also constructed an estimate of the density in our DCT components. Without working through the math exactly, I think every representation of our data which allows us to estimate the density derivative is going to be equivalent, up to isometry, to the density itself. All that is neither here nor there, but certainly let me know if you have an idea how we could do such a calculation. I would be very interested in finding out that I'm wrong on this point. Josef:
Besides the boundary problem in bounded domains, there is also the problem with unbounded domains, that the tails might not be well captured by a kde, especially with heavier tails.
You are absolutely correct, but that is another problem completely. Let me know if you implement the Pareto tails idea.
how about ``dgp_density.py``? We put the current module in the sandbox during the merge, because we still need to adjust it as we get new use cases.
Cool. I'll get started on this over the weekend. Also, numpy.random has a whole bunch of distributions. We'll just need to combine them in clever ways to get our example distributions. Thanks, Daniel

On Fri, Jan 25, 2013 at 11:36 AM, Daniel Smith <smith.daniel.br@gmail.com> wrote:
Barbier de Reuille Pierre
About this: this is incorrect, as you work with a DCT, it is equivalent to repeat the data on both sides by reflexion. Which means your method is equivalent to the reflexion method. Also note this is pointed out in the paper itself. That being said, if there is enough "padding" on both sides (i.e. such that the tail of the kernel is almost 0) there is no effect to this. Also, you can replace the CDT with a FFT to get a cyclic density. I adapted your code for this and it works great!
You are correct. I had always ended up having padding on each side and gotten nonsense near the boundary. When I fixed the boundary correctly, it gave me nice answers. Could you send me your code for the cyclic density? I do some molecular dynamics work, and it would be really useful for making angular density plots.
Back on the computation of the bandwidth, I would argue that you can compute it without computing the density itself. It's true that it makes sense to combine the binning as it useful for both, but I don't agree that it's necessary.j
Let me rephrase my sentiment. I think we can't calculate the bandwidth without the moral equivalent of calculating the density. Basically, we need a mapping from our set of samples plus our bandwidth to the square norm of the n'th derivative. Last night, I came up with a far more efficient method that I think demonstrates the moral equivalence. With some clever calculus, we can write down the mapping from the samples plus bandwidth to the j'th DCT (or Fourier) component. We can simply iterate over the DCT components until the change in the derivative estimate falls below some threshold. That saves us the histogramming step (not that important), but it also means we almost assuredly don't need 2**14 DCT components. For all intents in purposes, we have also constructed an estimate of the density in our DCT components. Without working through the math exactly, I think every representation of our data which allows us to estimate the density derivative is going to be equivalent, up to isometry, to the density itself.
All that is neither here nor there, but certainly let me know if you have an idea how we could do such a calculation. I would be very interested in finding out that I'm wrong on this point.
It would be useful, for me and maybe to others, if you could use github to keep track of the different versions (your repo or gists). I would like to see how the boundary and periodicity are affected by the different fft and dct, since I bump into this also in other areas. Thanks, Josef
Josef:
Besides the boundary problem in bounded domains, there is also the problem with unbounded domains, that the tails might not be well captured by a kde, especially with heavier tails.
You are absolutely correct, but that is another problem completely. Let me know if you implement the Pareto tails idea.
how about ``dgp_density.py``? We put the current module in the sandbox during the merge, because we still need to adjust it as we get new use cases.
Cool. I'll get started on this over the weekend. Also, numpy.random has a whole bunch of distributions. We'll just need to combine them in clever ways to get our example distributions.
Thanks, Daniel _______________________________________________ SciPy-Dev mailing list SciPy-Dev@scipy.org http://mail.scipy.org/mailman/listinfo/scipy-dev

On 25 January 2013 17:36, Daniel Smith <smith.daniel.br@gmail.com> wrote:
You are correct. I had always ended up having padding on each side and gotten nonsense near the boundary. When I fixed the boundary correctly, it gave me nice answers. Could you send me your code for the cyclic density? I do some molecular dynamics work, and it would be really useful for making angular density plots.
Hey, I attach the code here. As for your other part, I will have to think about it, but essentially I came up with the conclusion that the bandwidth estimation would require a sparser grid than the density estimation. Making some test, a grid of 2^10 elements seem plenty (i.e. I get 4 significant digits compared to 2^14) and computation time falls from ~250ms to ~15ms using a dataset with 1000 samples. And 15 ms to compute the bandwidth is perfectly acceptable for me. Now, if you have an adaptive method that can perform similarly, that would be awesome. The bandwidth can then be used in any context in which it make sense. -- Barbier de Reuille Pierre

As for your other part, I will have to think about it, but essentially I came up with the conclusion that the bandwidth estimation would require a sparser grid than the density estimation. Making some test, a grid of 2^10 elements seem plenty (i.e. I get 4 significant digits compared to 2^14) and computation time falls from ~250ms to ~15ms using a dataset with 1000 samples. And 15 ms to compute the bandwidth is perfectly acceptable for me. Now, if you have an adaptive method that can perform similarly, that would be awesome. The bandwidth can then be used in any context in which it make sense.
I obviously did not write clearly enough. My apologies. You're absolutely right that the density estimation can use a sparser grid. The algorithm I was describing was designed to calculate the bandwidth estimate with the minimal grid necessary. You still need a density estimate, but not necessarily the 2^14 estimate I have defaulted. I will actually code that idea up to make it more clear. Thank you very much for your code.
It would be useful, for me and maybe to others, if you could use github to keep track of the different versions (your repo or gists).
I would like to see how the boundary and periodicity are affected by the different fft and dct, since I bump into this also in other areas.
Future updates should go to that same github link I sent earlier. I haven't written any new code. I simply just generated some exponentially distributed data set MIN to be 0. My periodic idea involves a different kernel, so it will take a moment to map out. I haven't played with the code Barbier de Reuille Pierre posted yet. I will add the distribution generating code to that same git depository. Daniel

Hello, I've updated the github repository with some new code: https://github.com/Daniel-B-Smith/KDE-for-SciPy I added a data generating process class and subclassed it with the examples from the Botev, Grotowski, and Kroese paper. I could have subclassed scipy's rv_continuous, but I thought that class was a lot bigger and bulkier than we need. The script test_plots.py produces a bunch of pdf's that show how well the KDE estimate aligns with the probability distribution function. Next, I will write some code to compare the square error of the new method with the square error from SciPy's current implementation. Finally, I am still planning on writing some Cython code to calculate the derivative using a minimal number of Fourier basis vectors. Thanks, Daniel
participants (5)
-
Daniel Smith
-
Jake Vanderplas
-
josef.pktd@gmail.com
-
Pierre Barbier de Reuille
-
Ralf Gommers