[SciPy-User] Weighted KDE

Sun Jan 13 11:44:36 EST 2013

On Sun, Jan 13, 2013 at 10:08 AM, Jackson Li <sonicatedboom-s at yahoo.com>wrote:

>  <josef.pktd <at> gmail.com> writes:
>
> >
> > On Sun, May 13, 2012 at 1:07 PM, Zachary Pincus <zachary.pincus <at>
> yale.edu>
> wrote:
> > > Hello all,
> > >
> > > A while ago, someone asked on this list about whether it would be
> simple to
> modify
> > scipy.stats.kde.gaussian_kde to deal with weighted data:
> > > http://mail.scipy.org/pipermail/scipy-user/2008-November/018578.html
> > >
> > > Anne and Robert assured the writer that this was pretty simple (modulo
> bandwidth selection), though I
> > couldn't find any code that the original author may have generated based
> on
> that advice.
> > >
> > > I've got a problem that could (perhaps) be solved neatly with weighed
> KDE,
> so I'd like to give this a go. I
> > assume that at a minimum, to get basic gaussian_kde.evaluate()
> functionality:
> > >
> > > (1) The covariance calculation would need to be replaced by a weighted-
> covariance calculation. (Simple enough.)
> > >
> > > (2) In evaluate(), the critical part looks like this (and a similar
> stanza
> that loops over the points instead):
> > > # if there are more points than data, so loop over data
> > > for i in range(self.n):
> > >    diff = self.dataset[:, i, newaxis] - points
> > >    tdiff = dot(self.inv_cov, diff)
> > >    energy = sum(diff*tdiff,axis=0) / 2.0
> > >    result = result + exp(-energy)
> > >
> > > I assume that, further, the 'diff' values ought to be scaled by the
> weights,
> too. Is this all that would need
> > to be done? (For the integration and resampling, obviously, there would
> be a
> bit of other work...)
> >
> > it looks to me that way, scaled according to weight by dataset points
> >
> > I don't see what the norm_factor should be:
> >       self._norm_factor = sqrt(linalg.det(2*pi*self.covariance)) * self.n
> > there should be the weights somewhere in there, maybe just replace
> > self.n by sum(weights) given a constant covariance
> >
> > sampling doesn't look difficult, if we want biased sampling, then
> > instead of randint, we would need weighted randint (non-uniform)
> >
> > integration might require more work, or not (I never tried to understand
> them)
> >
> > (I don't know if kde in statsmodels has weights on the schedule.)
> >
> > Josef
> > mostly guessing
> >
> > >
> > > Thanks,
> > > Zach
> > > _______________________________________________
> > > SciPy-User mailing list
> > > SciPy-User <at> scipy.org
> > > http://mail.scipy.org/mailman/listinfo/scipy-user
> >
>
> Hi,
>
> I am facing the same problem as well, but can't figure out how the
> weighting
> should be done exactly.
>
> Has anybody successfully completed the modification of the code to allow a
> weighted kde? I am attempting to perform kde on a set of imaging data with
> X, Y,
> and an additional "temperature" column.
>
> Performing the kde on only the X,Y axes gives a working heatmap showing the
> spatial distribution of the data points, but I would also like to use them
> to
> see the "temperature" profile (the third axis), much like a geographical
> heatmap
> showing temperature or rainfall values over a X-Y map.
>
> I found another set of code from
> http://pastebin.com/LNdYCZgw
> which allows weighted kde, but when I tried it out with my data, it took
> much
> longer than the normal kde (>1 hour) when the original code took only a
> about
> twenty seconds (despite claims that it was faster).
>
> Thanks,
> Jackson
>

For what it's worth, the code you linked to is much slower for small sample
sizes. It's only faster with large numbers (>1e4) of points.  It also has a
bit of a different use case than gaussian_kde.  It's only intended for
making a regularly gridded KDE of a very large number of points on a
relatively fine grid. It bins the data onto a regular grid and convolves it
with an approriate gaussian kernel.  This is a reasonable approximation
when you're dealing with a large number of points, but not so reasonable if
you only have a handful.  Because the size of the gaussian kernel can be
very large when the sample size is low, the convolution can be very slow
for small sample sizes.  Also, If I recall correctly, there's a stray
flipud that got left in there. You'll want to take it out.  (Also, while I
think that got posted only a couple of years ago, I wrote it much longer
ago than that... There's some less-than-ideal code in there...)

However, are you sure that you want a kernel density estimate?  What you're
describing sounds like interpolation, not a weighted KDE.

As an example, a weighted KDE would be used when you wanted to show the
density of point estimates while weighting it by error in the location of
the point.

Instead, it sounds like you have a third variable that you want to make a
continuous map of based on irregularly sampled points.  If so, have a look
at scipy.interpolate (and particularly scipy.interpolate.Rbf).

Hope that helps,
-Joe

>
>
>
>
>
> _______________________________________________
> SciPy-User mailing list
> SciPy-User at scipy.org
> http://mail.scipy.org/mailman/listinfo/scipy-user
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.scipy.org/pipermail/scipy-user/attachments/20130113/b0b16c38/attachment.html>