Mailman 3 Weighted KDE - SciPy-User

Weighted KDE

Zachary Pincus

13 May 2012 13 May '12

10:37 p.m.

Hello all, A while ago, someone asked on this list about whether it would be simple to modify scipy.stats.kde.gaussian_kde to deal with weighted data: http://mail.scipy.org/pipermail/scipy-user/2008-November/018578.html Anne and Robert assured the writer that this was pretty simple (modulo bandwidth selection), though I couldn't find any code that the original author may have generated based on that advice. I've got a problem that could (perhaps) be solved neatly with weighed KDE, so I'd like to give this a go. I assume that at a minimum, to get basic gaussian_kde.evaluate() functionality: (1) The covariance calculation would need to be replaced by a weighted-covariance calculation. (Simple enough.) (2) In evaluate(), the critical part looks like this (and a similar stanza that loops over the points instead): # if there are more points than data, so loop over data for i in range(self.n): diff = self.dataset[:, i, newaxis] - points tdiff = dot(self.inv_cov, diff) energy = sum(diff*tdiff,axis=0) / 2.0 result = result + exp(-energy) I assume that, further, the 'diff' values ought to be scaled by the weights, too. Is this all that would need to be done? (For the integration and resampling, obviously, there would be a bit of other work...) Thanks, Zach

Show replies by date

josef.pktd＠gmail.com

13 May 13 May

11:47 p.m.

On Sun, May 13, 2012 at 1:07 PM, Zachary Pincus <zachary.pincus@yale.edu> wrote:

...

Hello all,

A while ago, someone asked on this list about whether it would be simple to modify scipy.stats.kde.gaussian_kde to deal with weighted data: http://mail.scipy.org/pipermail/scipy-user/2008-November/018578.html

Anne and Robert assured the writer that this was pretty simple (modulo bandwidth selection), though I couldn't find any code that the original author may have generated based on that advice.

I've got a problem that could (perhaps) be solved neatly with weighed KDE, so I'd like to give this a go. I assume that at a minimum, to get basic gaussian_kde.evaluate() functionality:

(1) The covariance calculation would need to be replaced by a weighted-covariance calculation. (Simple enough.)

(2) In evaluate(), the critical part looks like this (and a similar stanza that loops over the points instead): # if there are more points than data, so loop over data for i in range(self.n): diff = self.dataset[:, i, newaxis] - points tdiff = dot(self.inv_cov, diff) energy = sum(diff*tdiff,axis=0) / 2.0 result = result + exp(-energy)

I assume that, further, the 'diff' values ought to be scaled by the weights, too. Is this all that would need to be done? (For the integration and resampling, obviously, there would be a bit of other work...)

it looks to me that way, scaled according to weight by dataset points I don't see what the norm_factor should be: self._norm_factor = sqrt(linalg.det(2*pi*self.covariance)) * self.n there should be the weights somewhere in there, maybe just replace self.n by sum(weights) given a constant covariance sampling doesn't look difficult, if we want biased sampling, then instead of randint, we would need weighted randint (non-uniform) integration might require more work, or not (I never tried to understand them) (I don't know if kde in statsmodels has weights on the schedule.) Josef mostly guessing

...

Thanks, Zach _______________________________________________ SciPy-User mailing list SciPy-User@scipy.org http://mail.scipy.org/mailman/listinfo/scipy-user

Jackson Li

13 Jan 13 Jan

9:38 p.m.

<josef.pktd <at> gmail.com> writes:

...

On Sun, May 13, 2012 at 1:07 PM, Zachary Pincus <zachary.pincus <at> yale.edu>

...

...
Hello all,

A while ago, someone asked on this list about whether it would be simple to modify scipy.stats.kde.gaussian_kde to deal with weighted data: http://mail.scipy.org/pipermail/scipy-user/2008-November/018578.html

Anne and Robert assured the writer that this was pretty simple (modulo bandwidth selection), though I couldn't find any code that the original author may have generated based on

...

...
I've got a problem that could (perhaps) be solved neatly with weighed KDE,

so I'd like to give this a go. I assume that at a minimum, to get basic gaussian_kde.evaluate() functionality:

...
(1) The covariance calculation would need to be replaced by a weighted-

covariance calculation. (Simple enough.)

...
(2) In evaluate(), the critical part looks like this (and a similar stanza

wrote: that advice. that loops over the points instead):

...

...
# if there are more points than data, so loop over data for i in range(self.n): diff = self.dataset[:, i, newaxis] - points tdiff = dot(self.inv_cov, diff) energy = sum(diff*tdiff,axis=0) / 2.0 result = result + exp(-energy)

I assume that, further, the 'diff' values ought to be scaled by the weights, too. Is this all that would need to be done? (For the integration and resampling, obviously, there would be a bit of other work...)

it looks to me that way, scaled according to weight by dataset points

I don't see what the norm_factor should be: self._norm_factor = sqrt(linalg.det(2*pi*self.covariance)) * self.n there should be the weights somewhere in there, maybe just replace self.n by sum(weights) given a constant covariance

sampling doesn't look difficult, if we want biased sampling, then instead of randint, we would need weighted randint (non-uniform)

integration might require more work, or not (I never tried to understand them)

(I don't know if kde in statsmodels has weights on the schedule.)

Josef mostly guessing

...
Thanks, Zach _______________________________________________ SciPy-User mailing list SciPy-User <at> scipy.org http://mail.scipy.org/mailman/listinfo/scipy-user

Hi, I am facing the same problem as well, but can't figure out how the weighting should be done exactly. Has anybody successfully completed the modification of the code to allow a weighted kde? I am attempting to perform kde on a set of imaging data with X, Y, and an additional "temperature" column. Performing the kde on only the X,Y axes gives a working heatmap showing the spatial distribution of the data points, but I would also like to use them to see the "temperature" profile (the third axis), much like a geographical heatmap showing temperature or rainfall values over a X-Y map. I found another set of code from http://pastebin.com/LNdYCZgw which allows weighted kde, but when I tried it out with my data, it took much longer than the normal kde (>1 hour) when the original code took only a about twenty seconds (despite claims that it was faster). Thanks, Jackson

Joe Kington

10:01 p.m.

For what it's worth, the code you linked to is much slower for small sample sizes. It's only faster with large numbers (>1e4) of points. It also has a bit of a different use case than gaussian_kde. It's only intended for making a regularly gridded KDE of a very large number of points on a relatively fine grid. It bins the data onto a regular grid and convolves it with an approriate gaussian kernel. This is a reasonable approximation when you're dealing with a large number of points, but not so reasonable if you only have a handful. Because the size of the gaussian kernel can be very large when the sample size is low, the convolution can be very slow for small sample sizes. Also, If I recall correctly, there's a stray flipud that got left in there. You'll want to take it out. However, are you sure that you want a kernel density estimate? What you're describing sounds like interpolation, not a weighted KDE. As an example, a weighted KDE would be used when you wanted to show the density of point estimates while weighting it by error in the location of the point. Instead, it sounds like you have a third variable that you want to make a continuous map of based on irregularly sampled points. If so, have a look at scipy.interpolate (and particularly scipy.interpolate.Rbf). Hope that helps, -Joe On Sun, Jan 13, 2013 at 10:08 AM, Jackson Li <sonicatedboom-s@yahoo.com>wrote:

...

<josef.pktd <at> gmail.com> writes:

...
On Sun, May 13, 2012 at 1:07 PM, Zachary Pincus <zachary.pincus <at>

...
...
Hello all,

A while ago, someone asked on this list about whether it would be simple to modify scipy.stats.kde.gaussian_kde to deal with weighted data: http://mail.scipy.org/pipermail/scipy-user/2008-November/018578.html

Anne and Robert assured the writer that this was pretty simple (modulo bandwidth selection), though I couldn't find any code that the original author may have generated based on

...
...
I've got a problem that could (perhaps) be solved neatly with weighed

KDE, so I'd like to give this a go. I assume that at a minimum, to get basic gaussian_kde.evaluate() functionality:

...
(1) The covariance calculation would need to be replaced by a weighted-

covariance calculation. (Simple enough.)

...
(2) In evaluate(), the critical part looks like this (and a similar

stanza

...
...
# if there are more points than data, so loop over data for i in range(self.n): diff = self.dataset[:, i, newaxis] - points tdiff = dot(self.inv_cov, diff) energy = sum(diff*tdiff,axis=0) / 2.0 result = result + exp(-energy)

I assume that, further, the 'diff' values ought to be scaled by the weights, too. Is this all that would need to be done? (For the integration and resampling, obviously, there would be a bit of other work...)

it looks to me that way, scaled according to weight by dataset points

I don't see what the norm_factor should be: self._norm_factor = sqrt(linalg.det(2*pi*self.covariance)) * self.n there should be the weights somewhere in there, maybe just replace self.n by sum(weights) given a constant covariance

sampling doesn't look difficult, if we want biased sampling, then instead of randint, we would need weighted randint (non-uniform)

integration might require more work, or not (I never tried to understand

yale.edu> wrote: that advice. that loops over the points instead): them)

...
(I don't know if kde in statsmodels has weights on the schedule.)

Josef mostly guessing

...
Thanks, Zach _______________________________________________ SciPy-User mailing list SciPy-User <at> scipy.org http://mail.scipy.org/mailman/listinfo/scipy-user

Hi,

I am facing the same problem as well, but can't figure out how the weighting should be done exactly.

Has anybody successfully completed the modification of the code to allow a weighted kde? I am attempting to perform kde on a set of imaging data with X, Y, and an additional "temperature" column.

Performing the kde on only the X,Y axes gives a working heatmap showing the spatial distribution of the data points, but I would also like to use them to see the "temperature" profile (the third axis), much like a geographical heatmap showing temperature or rainfall values over a X-Y map.

I found another set of code from http://pastebin.com/LNdYCZgw which allows weighted kde, but when I tried it out with my data, it took much longer than the normal kde (>1 hour) when the original code took only a about twenty seconds (despite claims that it was faster).

Thanks, Jackson

_______________________________________________ SciPy-User mailing list SciPy-User@scipy.org http://mail.scipy.org/mailman/listinfo/scipy-user

Joe Kington

10:14 p.m.

On Sun, Jan 13, 2013 at 10:08 AM, Jackson Li <sonicatedboom-s@yahoo.com>wrote:

...

<josef.pktd <at> gmail.com> writes:

...
On Sun, May 13, 2012 at 1:07 PM, Zachary Pincus <zachary.pincus <at>

...
...
Hello all,

A while ago, someone asked on this list about whether it would be simple to modify scipy.stats.kde.gaussian_kde to deal with weighted data: http://mail.scipy.org/pipermail/scipy-user/2008-November/018578.html

Anne and Robert assured the writer that this was pretty simple (modulo bandwidth selection), though I couldn't find any code that the original author may have generated based on

...
...
I've got a problem that could (perhaps) be solved neatly with weighed

KDE, so I'd like to give this a go. I assume that at a minimum, to get basic gaussian_kde.evaluate() functionality:

...
(1) The covariance calculation would need to be replaced by a weighted-

covariance calculation. (Simple enough.)

...
(2) In evaluate(), the critical part looks like this (and a similar

stanza

...
...
# if there are more points than data, so loop over data for i in range(self.n): diff = self.dataset[:, i, newaxis] - points tdiff = dot(self.inv_cov, diff) energy = sum(diff*tdiff,axis=0) / 2.0 result = result + exp(-energy)

I assume that, further, the 'diff' values ought to be scaled by the weights, too. Is this all that would need to be done? (For the integration and resampling, obviously, there would be a bit of other work...)

it looks to me that way, scaled according to weight by dataset points

I don't see what the norm_factor should be: self._norm_factor = sqrt(linalg.det(2*pi*self.covariance)) * self.n there should be the weights somewhere in there, maybe just replace self.n by sum(weights) given a constant covariance

sampling doesn't look difficult, if we want biased sampling, then instead of randint, we would need weighted randint (non-uniform)

integration might require more work, or not (I never tried to understand

yale.edu> wrote: that advice. that loops over the points instead): them)

...
(I don't know if kde in statsmodels has weights on the schedule.)

Josef mostly guessing

...
Thanks, Zach _______________________________________________ SciPy-User mailing list SciPy-User <at> scipy.org http://mail.scipy.org/mailman/listinfo/scipy-user

Hi,

I am facing the same problem as well, but can't figure out how the weighting should be done exactly.

Has anybody successfully completed the modification of the code to allow a weighted kde? I am attempting to perform kde on a set of imaging data with X, Y, and an additional "temperature" column.

Performing the kde on only the X,Y axes gives a working heatmap showing the spatial distribution of the data points, but I would also like to use them to see the "temperature" profile (the third axis), much like a geographical heatmap showing temperature or rainfall values over a X-Y map.

I found another set of code from http://pastebin.com/LNdYCZgw which allows weighted kde, but when I tried it out with my data, it took much longer than the normal kde (>1 hour) when the original code took only a about twenty seconds (despite claims that it was faster).

Thanks, Jackson

For what it's worth, the code you linked to is much slower for small sample sizes. It's only faster with large numbers (>1e4) of points. It also has a bit of a different use case than gaussian_kde. It's only intended for making a regularly gridded KDE of a very large number of points on a relatively fine grid. It bins the data onto a regular grid and convolves it with an approriate gaussian kernel. This is a reasonable approximation when you're dealing with a large number of points, but not so reasonable if you only have a handful. Because the size of the gaussian kernel can be very large when the sample size is low, the convolution can be very slow for small sample sizes. Also, If I recall correctly, there's a stray flipud that got left in there. You'll want to take it out. (Also, while I think that got posted only a couple of years ago, I wrote it much longer ago than that... There's some less-than-ideal code in there...) However, are you sure that you want a kernel density estimate? What you're describing sounds like interpolation, not a weighted KDE. As an example, a weighted KDE would be used when you wanted to show the density of point estimates while weighting it by error in the location of the point. Instead, it sounds like you have a third variable that you want to make a continuous map of based on irregularly sampled points. If so, have a look at scipy.interpolate (and particularly scipy.interpolate.Rbf). Hope that helps, -Joe

...

_______________________________________________ SciPy-User mailing list SciPy-User@scipy.org http://mail.scipy.org/mailman/listinfo/scipy-user

Joe Kington

10:23 p.m.

On Sun, Jan 13, 2013 at 10:44 AM, Joe Kington <joferkington@gmail.com>wrote:

...

On Sun, Jan 13, 2013 at 10:08 AM, Jackson Li <sonicatedboom-s@yahoo.com>wrote:

...
<josef.pktd <at> gmail.com> writes:

...
On Sun, May 13, 2012 at 1:07 PM, Zachary Pincus <zachary.pincus <at>

...
...
Hello all,

A while ago, someone asked on this list about whether it would be simple to modify scipy.stats.kde.gaussian_kde to deal with weighted data: http://mail.scipy.org/pipermail/scipy-user/2008-November/018578.html

Anne and Robert assured the writer that this was pretty simple (modulo bandwidth selection), though I couldn't find any code that the original author may have generated

...
...
I've got a problem that could (perhaps) be solved neatly with weighed

KDE, so I'd like to give this a go. I assume that at a minimum, to get basic gaussian_kde.evaluate() functionality:

...
(1) The covariance calculation would need to be replaced by a

weighted- covariance calculation. (Simple enough.)

...
(2) In evaluate(), the critical part looks like this (and a similar

stanza

yale.edu> wrote: based on that advice. that loops over the points instead):

...
...
# if there are more points than data, so loop over data for i in range(self.n): diff = self.dataset[:, i, newaxis] - points tdiff = dot(self.inv_cov, diff) energy = sum(diff*tdiff,axis=0) / 2.0 result = result + exp(-energy)

I assume that, further, the 'diff' values ought to be scaled by the weights, too. Is this all that would need to be done? (For the integration and resampling, obviously, there would be a bit of other work...)

it looks to me that way, scaled according to weight by dataset points

I don't see what the norm_factor should be: self._norm_factor = sqrt(linalg.det(2*pi*self.covariance)) * self.n there should be the weights somewhere in there, maybe just replace self.n by sum(weights) given a constant covariance

sampling doesn't look difficult, if we want biased sampling, then instead of randint, we would need weighted randint (non-uniform)

integration might require more work, or not (I never tried to understand them)

(I don't know if kde in statsmodels has weights on the schedule.)

Josef mostly guessing

...
Thanks, Zach _______________________________________________ SciPy-User mailing list SciPy-User <at> scipy.org http://mail.scipy.org/mailman/listinfo/scipy-user

Hi,

I am facing the same problem as well, but can't figure out how the weighting should be done exactly.

Has anybody successfully completed the modification of the code to allow a weighted kde? I am attempting to perform kde on a set of imaging data with X, Y, and an additional "temperature" column.

Performing the kde on only the X,Y axes gives a working heatmap showing the spatial distribution of the data points, but I would also like to use them to see the "temperature" profile (the third axis), much like a geographical heatmap showing temperature or rainfall values over a X-Y map.

I found another set of code from http://pastebin.com/LNdYCZgw which allows weighted kde, but when I tried it out with my data, it took much longer than the normal kde (>1 hour) when the original code took only a about twenty seconds (despite claims that it was faster).

Thanks, Jackson

For what it's worth, the code you linked to is much slower for small sample sizes. It's only faster with large numbers (>1e4) of points. It also has a bit of a different use case than gaussian_kde. It's only intended for making a regularly gridded KDE of a very large number of points on a relatively fine grid. It bins the data onto a regular grid and convolves it with an approriate gaussian kernel. This is a reasonable approximation when you're dealing with a large number of points, but not so reasonable if you only have a handful. Because the size of the gaussian kernel can be very large when the sample size is low, the convolution can be very slow for small sample sizes. Also, If I recall correctly, there's a stray flipud that got left in there. You'll want to take it out. (Also, while I think that got posted only a couple of years ago, I wrote it much longer ago than that... There's some less-than-ideal code in there...)

However, are you sure that you want a kernel density estimate? What you're describing sounds like interpolation, not a weighted KDE.

As an example, a weighted KDE would be used when you wanted to show the density of point estimates while weighting it by error in the location of the point.

I shouldn't have said "error in the location of the point". I guess it would me more like "confidence that the point exists" or more accurately, "magnitude of the point". Otherwise, the size of the Gaussian kernel would have to change depending on the data involved. As another (not exact) example, it can be handy when you want to sum some attribute over a map to yield a density estimate per-unit-area (e.g. population density, where you have populations of cities as your point measurements). In other words, if you want your temperature values to be summed-per-unit-area, then it's what you want. If you want to interpolate, it's not what you want.

...

Instead, it sounds like you have a third variable that you want to make a continuous map of based on irregularly sampled points. If so, have a look at scipy.interpolate (and particularly scipy.interpolate.Rbf).

Hope that helps, -Joe

...
_______________________________________________ SciPy-User mailing list SciPy-User@scipy.org http://mail.scipy.org/mailman/listinfo/scipy-user

Zachary Pincus

14 Jan 14 Jan

9:38 p.m.

...

I am facing the same problem as well, but can't figure out how the weighting should be done exactly.

Has anybody successfully completed the modification of the code to allow a weighted kde? I am attempting to perform kde on a set of imaging data with X, Y, and an additional "temperature" column.

Performing the kde on only the X,Y axes gives a working heatmap showing the spatial distribution of the data points, but I would also like to use them to see the "temperature" profile (the third axis), much like a geographical heatmap showing temperature or rainfall values over a X-Y map.

I found another set of code from http://pastebin.com/LNdYCZgw which allows weighted kde, but when I tried it out with my data, it took much longer than the normal kde (>1 hour) when the original code took only a about twenty seconds (despite claims that it was faster).

Thanks, Jackson

Here's a modification of the scipy KDE code that I made to perform weighting, as per the earlier discussion. No guarantees as to correctness, but it seems to be right-ish? Zach

4187

Age (days ago)

4432

Last active (days ago)

List overview

Download

6 comments

4 participants

participants (4)

Jackson Li
Joe Kington
josef.pktd＠gmail.com
Zachary Pincus

Weighted KDE

Zachary Pincus

josef.pktd＠gmail.com

Jackson Li

Joe Kington

Joe Kington

Joe Kington

Zachary Pincus

tags

participants (4)