[SciPy-Dev] scipy.stats.kde

Sam Birch sam.m.birch at gmail.com
Fri Aug 27 16:37:03 EDT 2010


Quick question: norm <-> delta function?

Again, this is for work on the circle and for fairly dense data sets.

But in principle, the KDE as a function is the convolution of a forest

of delta functions, one per point, with the kernel. The conventional

way to evaluate this function at a point is simply to evaluate the

kernel once per data point and add them up. To evaluate this on a

grid, you evaluate the kernel once per grid point per data point and

add. Naturally this can be expensive.


Huh, usually I do it the other way around (changing the grid as needed
per-datapoint) because it saves lots of time for kernels with finite
support. Is truncating kernels often applied? It depends on your data
obviously but in many cases the Gaussian drops very rapidly (relative to the
size of the mesh).

To the rest: that's a very clever idea. I see your point about varying the
bandwidths etc. without recomputing the KDE. Maybe then there should be
a separation between "plotting" the KDE with a given kernel & bandwidth and
creating a KDE from a dataset (which would just determine the FCs of the
forest of delta functions). W.r.t. padding, perhaps doing it the other way
around (as I mentioned above) would negate the consequences of an expanded
mesh (I have no idea what I'm talking about really--shot in the dark)?

-Sam

On Fri, Aug 27, 2010 at 3:51 PM, Anne Archibald
<aarchiba at physics.mcgill.ca>wrote:

> On 27 August 2010 15:27, Sam Birch <sam.m.birch at gmail.com> wrote:
> >> Bandwidth selection is a hotly debated topic, at least in one
> >
> > dimension, so perhaps not just different methods but tools for
> >
> > diagnosing bandwidth selection problems would be nice - at the least,
> >
> > it should be made straightforward to vary the bandwidth (e.g. to plot
> >
> > the KDE with a range of different bandwidth values).
> >
> > Well by allowing them to use a custom bandwidth matrix they can vary it
> > themselves, no?
>
> Well, in principle, yes. But if the API forces them to construct an
> entirely new KDE object to change the bandwidth matrix, and if this
> object involves substantial additional data structures (e.g. a kd-tree
> holding the data points) this could be cumbersome.
>
> >> At the other end of the spectrum, for very dense KDEs, on the circle I
> >
> > found it extremely convenient to use Fourier transforms to carry out
> >
> > the convolution of kernel with points. In particular, I represented
> >
> > the KDE in terms of its Fourier coefficients, so that an inverse FFT
> >
> > immediately gave me the KDE evaluated on a grid (or, with some
> >
> > fiddling, integrated over the bins of a histogram). I don't know
> >
> > whether this is a useful optimization for KDEs on the line or in
> >
> > higher dimensions, since there's the problem of wrapping.
> >
> > That sounds very interesting. Sorry if I'm being dense (or just wrong, or
> > both), but do you convolve post-FFT or before? If before why does it make
> it
> > easier?
>
> Again, this is for work on the circle and for fairly dense data sets.
> But in principle, the KDE as a function is the convolution of a forest
> of delta functions, one per point, with the kernel. The conventional
> way to evaluate this function at a point is simply to evaluate the
> kernel once per data point and add them up. To evaluate this on a
> grid, you evaluate the kernel once per grid point per data point and
> add. Naturally this can be expensive.
>
> My idea is that convolution of functions is simply multiplication of
> the Fourier transforms of those functions. So instead of storing a
> list of data points in my KDE object, I store a representation of the
> forest of delta functions in terms of their Fourier coefficients (the
> nth Fourier coefficient of a delta function at phase p is exp(2 pi i n
> p)). This is necessarily approximate, since I store finitely many
> Fourier coefficients, but it's not hard to store "enough". Now when I
> want to convolve this forest by a kernel, I simply multiply these
> Fourier coefficients by those of the kernel. The easiest "kernel" is
> the sinc function, for which I simply truncate the Fourier
> coefficients (which is why it's easy to have enough). We actually use
> this "kernel" a lot, even though it's not positive everywhere. A
> mathematically-better choice is the von Mises distribution, whose PDF
> is proportional to exp(k cos(x)) and whose Fourier coefficients can be
> written in terms of Bessel functions.
>
> Once you have the Fourier coefficients of the KDE, you can evaluate it
> at a point by taking a sum of sinusoids, but the key idea is that you
> can evaluate it on a grid by taking an inverse FFT. If you want
> integrals over intervals, well, that you just get by integrating
> sinusoids over intervals, so there's a messy but easily-derived way to
> work out the area in terms of the Fourier coefficients. This too can
> be nicely worked out on a grid, by fiddling the Fourier coefficients
> and taking an inverse FFT.
>
>
> To construct the FCs of the forest of delta functions, if I have
> photon arrival phases I just take a sum (which can be slow, but this
> isn't really time-critical). But it's also perfectly reasonable to
> start from a histogram and take an FFT. Crucially, the histogram need
> not be a reasonable-looking histogram - you can never have too many
> bins, since it's not a problem if all the bin counts are either zero
> or one. The only drawback here is that you introduce an error
> averaging to half a bin width on each photon arrival phase. But the
> KDE provides a check on this too - if your kernel width is much larger
> than the width of the input bins, then the errors you introduced
> probably don't matter much (leaving aside nasty issues with Moire
> patterns in the likely case that your input times were already
> binned).
>
> One thing to note here is that once you have the FCs, you can try
> various kernels and bandwidths without going back to your original
> data. (You can also get uncertainties on all the various computed
> quantities, and in fact you can usually turn around and not only start
> from a histogram but start from an array of values with uncertainties.
> All this stuff is in a paper that's on my back burner right now.)
>
>
> The thing is, I don't really know how useful all this is for KDEs on a
> line or in R^n. The problem is that working with discrete Fourier
> coefficients implicitly wraps the KDE around at the ends of the
> interval, and it's not clear that this is still worth doing if you're
> going to "pad" your region enough that this isn't a problem: the
> padding forces you to evaluate at lots of points you don't care about
> and use lots more Fourier coefficients than you would otherwise have
> to.
>
>
> Anne
>
> > -Sam
> > On Fri, Aug 27, 2010 at 2:48 PM, Anne Archibald <
> aarchiba at physics.mcgill.ca>
> > wrote:
> >>
> >> My only experience with KDEs has been on the circle, where there seems
> >> to be little or no literature and the constraints are rather
> >> different.
> >>
> >> On 27 August 2010 14:38,  <josef.pktd at gmail.com> wrote:
> >> > On Fri, Aug 27, 2010 at 2:17 PM, Sam Birch <sam.m.birch at gmail.com>
> >> > wrote:
> >> >> Hi all,
> >> >> I was thinking of renovating the kernel density estimation package
> >> >> (although
> >> >> no promises; I'm leaving for college tomorrow morning!). I was
> >> >> wondering:
> >> >> a) whether anyone had started code in that direction
> >> >
> >> > Mike Crowe wrote code for kernel regression  and Skipper started a 1D
> >> > kernel density estimator in scikits.statsmodels, which cover a larger
> >> > number of kernels
> >> >
> >> > I don't think I have seen any higher dimensional kernel density
> >> > estimation in python besides scipy.stats.kde. The Gaussian kde in
> >> > scipy.stats is targeted to the underlying Fortran code for
> >> > multivariate normal cdf.
> >> > It's not clear to me what other n-dimensional kdes would require or
> >> > whether they would fit well with the current code.
> >> >
> >> > One extension that Robert also mentioned in the past that it would be
> >> > nice to have adaptive kernels, which I also haven't seen in python
> >> > yet.
> >> >
> >> >> b) what people want in it
> >> >> I was thinking (as an ideal, not necessarily goal):
> >> >> - Support for more than Gaussian kernels (e.g. custom,
> >> >> uniform, Epanechnikov, triangular, quartic, cosine, etc.)
> >> >> - More options for bandwidth selection (custom bandwidth matrices,
> >> >> AMISE
> >> >> optimization, cross-validation, etc.)
> >> >
> >> > definitely yes, I don't think they are even available for 1D yet.
> >>
> >> Bandwidth selection is a hotly debated topic, at least in one
> >> dimension, so perhaps not just different methods but tools for
> >> diagnosing bandwidth selection problems would be nice - at the least,
> >> it should be made straightforward to vary the bandwidth (e.g. to plot
> >> the KDE with a range of different bandwidth values).
> >>
> >> >> - Assorted conveniences: automatically generate the mesh, limit the
> >> >> kernel's
> >> >> support for speed
> >> >
> >> > Using scipy.spatial to limit the number of neighbors in a bounded
> >> > support kernel might be a good idea.
> >>
> >> Simply using it to find the neighbors that need to be used should
> >> speed things up. There may also be some shortcuts for
> >> unbounded-support kernels (no point adding a Gaussian a hundred sigma
> >> away if there's any points nearby).
> >>
> >> At the other end of the spectrum, for very dense KDEs, on the circle I
> >> found it extremely convenient to use Fourier transforms to carry out
> >> the convolution of kernel with points. In particular, I represented
> >> the KDE in terms of its Fourier coefficients, so that an inverse FFT
> >> immediately gave me the KDE evaluated on a grid (or, with some
> >> fiddling, integrated over the bins of a histogram). I don't know
> >> whether this is a useful optimization for KDEs on the line or in
> >> higher dimensions, since there's the problem of wrapping.
> >>
> >> Anne
> >>
> >> > (just some thought on the topic)
> >> >
> >> > Josef
> >> >
> >> >> So, thoughts anyone? I figure it's better to over-specify and then
> >> >> under-produce, so don't hold back.
> >> >> Thanks,
> >> >> Sam
> >> >> _______________________________________________
> >> >> SciPy-Dev mailing list
> >> >> SciPy-Dev at scipy.org
> >> >> http://mail.scipy.org/mailman/listinfo/scipy-dev
> >> >>
> >> >>
> >> > _______________________________________________
> >> > SciPy-Dev mailing list
> >> > SciPy-Dev at scipy.org
> >> > http://mail.scipy.org/mailman/listinfo/scipy-dev
> >> >
> >> _______________________________________________
> >> SciPy-Dev mailing list
> >> SciPy-Dev at scipy.org
> >> http://mail.scipy.org/mailman/listinfo/scipy-dev
> >
> >
> > _______________________________________________
> > SciPy-Dev mailing list
> > SciPy-Dev at scipy.org
> > http://mail.scipy.org/mailman/listinfo/scipy-dev
> >
> >
> _______________________________________________
> SciPy-Dev mailing list
> SciPy-Dev at scipy.org
> http://mail.scipy.org/mailman/listinfo/scipy-dev
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scipy-dev/attachments/20100827/28fa39c2/attachment.html>


More information about the SciPy-Dev mailing list