Adding weights to cov and corrcoef

Hi all, in Pull Request https://github.com/numpy/numpy/pull/3864 Neol Dawe suggested adding new parameters to our `cov` and `corrcoef` functions to implement weights, which already exists for `average` (the PR still needs to be adapted). The idea right now would be to add a `weights` and a `frequencies` keyword arguments to these functions. In more detail: The situation is a bit more complex for `cov` and `corrcoef` than `average`, because there are different types of weights. The current plan would be to add two new keyword arguments: * weights: Uncertainty weights which causes `N` to be recalculated accordingly (This is R's `cov.wt` default I believe). * frequencies: When given, `N = sum(frequencies)` and the values are weighted by their frequency. Because it appeared that the uncertainty type of weights are not obvious, while other types of weights should be pretty easily implemented by scaling `frequencies` (i.e. one may want `sum(frequencies) == len(data)`). However, we may have missed something obvious, or maybe it is already getting too statistical for NumPy, or the keyword argument might be better `uncertainties` and `frequencies`. So comments and insights are very welcome :). Regards, Sebastian

On Wed, Mar 5, 2014 at 4:45 PM, Sebastian Berg <sebastian@sipsolutions.net> wrote:
Hi all,
in Pull Request https://github.com/numpy/numpy/pull/3864 Neol Dawe suggested adding new parameters to our `cov` and `corrcoef` functions to implement weights, which already exists for `average` (the PR still needs to be adapted).
The idea right now would be to add a `weights` and a `frequencies` keyword arguments to these functions.
In more detail: The situation is a bit more complex for `cov` and `corrcoef` than `average`, because there are different types of weights. The current plan would be to add two new keyword arguments: * weights: Uncertainty weights which causes `N` to be recalculated accordingly (This is R's `cov.wt` default I believe). * frequencies: When given, `N = sum(frequencies)` and the values are weighted by their frequency.
I don't understand this description at all. One them recalculates N, and the other sets N according to some calculation? Is there a standard reference on how these are supposed to be interpreted? When you talk about per-value uncertainties, I start imagining that we're trying to estimate a population covariance given a set of samples each corrupted by independent measurement noise, and then there's some natural hierarchical Bayesian model one could write down and get an ML estimate of the latent covariance via empirical Bayes or something. But this requires a bunch of assumptions and is that really what we want to do? (Or maybe it collapses down into something simpler if the measurement noise is gaussian or something?) -n -- Nathaniel J. Smith Postdoctoral researcher - Informatics - University of Edinburgh http://vorpus.org

On Thu, Mar 6, 2014 at 2:51 PM, Nathaniel Smith <njs@pobox.com> wrote:
On Wed, Mar 5, 2014 at 4:45 PM, Sebastian Berg <sebastian@sipsolutions.net> wrote:
Hi all,
in Pull Request https://github.com/numpy/numpy/pull/3864 Neol Dawe suggested adding new parameters to our `cov` and `corrcoef` functions to implement weights, which already exists for `average` (the PR still needs to be adapted).
The idea right now would be to add a `weights` and a `frequencies` keyword arguments to these functions.
In more detail: The situation is a bit more complex for `cov` and `corrcoef` than `average`, because there are different types of weights. The current plan would be to add two new keyword arguments: * weights: Uncertainty weights which causes `N` to be recalculated accordingly (This is R's `cov.wt` default I believe). * frequencies: When given, `N = sum(frequencies)` and the values are weighted by their frequency.
I don't understand this description at all. One them recalculates N, and the other sets N according to some calculation?
Is there a standard reference on how these are supposed to be interpreted? When you talk about per-value uncertainties, I start imagining that we're trying to estimate a population covariance given a set of samples each corrupted by independent measurement noise, and then there's some natural hierarchical Bayesian model one could write down and get an ML estimate of the latent covariance via empirical Bayes or something. But this requires a bunch of assumptions and is that really what we want to do? (Or maybe it collapses down into something simpler if the measurement noise is gaussian or something?)
I think the idea is that if you write formulas involving correlation or covariance using matrix notation, then these formulas can be generalized in several different ways by inserting some non-negative or positive diagonal matrices into the formulas in various places. The diagonal entries could be called 'weights'. If they are further restricted to sum to 1 then they could be called 'frequencies'. Or maybe this is too cynical and the jargon has a more standard meaning in this context.

On Do, 2014-03-06 at 19:51 +0000, Nathaniel Smith wrote:
On Wed, Mar 5, 2014 at 4:45 PM, Sebastian Berg <sebastian@sipsolutions.net> wrote:
Hi all,
in Pull Request https://github.com/numpy/numpy/pull/3864 Neol Dawe suggested adding new parameters to our `cov` and `corrcoef` functions to implement weights, which already exists for `average` (the PR still needs to be adapted).
The idea right now would be to add a `weights` and a `frequencies` keyword arguments to these functions.
In more detail: The situation is a bit more complex for `cov` and `corrcoef` than `average`, because there are different types of weights. The current plan would be to add two new keyword arguments: * weights: Uncertainty weights which causes `N` to be recalculated accordingly (This is R's `cov.wt` default I believe). * frequencies: When given, `N = sum(frequencies)` and the values are weighted by their frequency.
I don't understand this description at all. One them recalculates N, and the other sets N according to some calculation?
Is there a standard reference on how these are supposed to be interpreted? When you talk about per-value uncertainties, I start imagining that we're trying to estimate a population covariance given a set of samples each corrupted by independent measurement noise, and then there's some natural hierarchical Bayesian model one could write down and get an ML estimate of the latent covariance via empirical Bayes or something. But this requires a bunch of assumptions and is that really what we want to do? (Or maybe it collapses down into something simpler if the measurement noise is gaussian or something?)
I had really hoped someone who knows this stuff very well would show up ;). I think these weights were uncertainties under gaussian assumption and the other types of weights different, see `aweights` here: http://www.stata.com/support/faqs/statistics/weights-and-summary-statistics/, but I did not check a statistics book or have one here right now (e.g. wikipedia is less than helpful). Frankly unless there is some "obviously right" thing (for a statistician), I would be careful add such new features. And while I thought before that this might be the case, it isn't clear to me. - Sebastian
-n

On Thu, Mar 6, 2014 at 2:51 PM, Nathaniel Smith <njs@pobox.com> wrote:
On Wed, Mar 5, 2014 at 4:45 PM, Sebastian Berg <sebastian@sipsolutions.net> wrote:
Hi all,
in Pull Request https://github.com/numpy/numpy/pull/3864 Neol Dawe suggested adding new parameters to our `cov` and `corrcoef` functions to implement weights, which already exists for `average` (the PR still needs to be adapted).
The idea right now would be to add a `weights` and a `frequencies` keyword arguments to these functions.
In more detail: The situation is a bit more complex for `cov` and `corrcoef` than `average`, because there are different types of weights. The current plan would be to add two new keyword arguments: * weights: Uncertainty weights which causes `N` to be recalculated accordingly (This is R's `cov.wt` default I believe). * frequencies: When given, `N = sum(frequencies)` and the values are weighted by their frequency.
I don't understand this description at all. One them recalculates N, and the other sets N according to some calculation?
Is there a standard reference on how these are supposed to be interpreted? When you talk about per-value uncertainties, I start imagining that we're trying to estimate a population covariance given a set of samples each corrupted by independent measurement noise, and then there's some natural hierarchical Bayesian model one could write down and get an ML estimate of the latent covariance via empirical Bayes or something. But this requires a bunch of assumptions and is that really what we want to do? (Or maybe it collapses down into something simpler if the measurement noise is gaussian or something?)
In general, going mostly based on Stata frequency weights are just a shortcut if you have repeated observations. In my unit tests, the results is the same as using np.repeat IIRC. The total number of observation is the sum of weights. aweights and pweights are mainly like weights in WLS, reflecting the uncertainty of each observation. The number of observations is equal to the number of rows. (Stata internally rescales the weights) one explanation is that observations are measured with different noise, another that observations represent the mean of subsamples with different number of observations. there is an additional degrees of freedom correction in one of the proposed calculations modeled after other packages that I never figured out. (aside: statsmodels does not normalize the scale in WLS, in contrast to Stata, and it is now equivalent to GLS with diagonal sigma. The meaning of weight=1 depends on the user. nobs is number of rows.) no Bayesian analysis involved. but I guess someone could come up with a Bayesian interpretation. I think the two proposed weight types, weights and frequencies, should be able to handle almost all cases. Josef
-n
-- Nathaniel J. Smith Postdoctoral researcher - Informatics - University of Edinburgh http://vorpus.org _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion

On Fri, Mar 7, 2014 at 12:06 AM, <josef.pktd@gmail.com> wrote:
On Thu, Mar 6, 2014 at 2:51 PM, Nathaniel Smith <njs@pobox.com> wrote:
On Wed, Mar 5, 2014 at 4:45 PM, Sebastian Berg <sebastian@sipsolutions.net> wrote:
Hi all,
in Pull Request https://github.com/numpy/numpy/pull/3864 Neol Dawe suggested adding new parameters to our `cov` and `corrcoef` functions to implement weights, which already exists for `average` (the PR still needs to be adapted).
The idea right now would be to add a `weights` and a `frequencies` keyword arguments to these functions.
In more detail: The situation is a bit more complex for `cov` and `corrcoef` than `average`, because there are different types of weights. The current plan would be to add two new keyword arguments: * weights: Uncertainty weights which causes `N` to be recalculated accordingly (This is R's `cov.wt` default I believe). * frequencies: When given, `N = sum(frequencies)` and the values are weighted by their frequency.
I don't understand this description at all. One them recalculates N, and the other sets N according to some calculation?
Is there a standard reference on how these are supposed to be interpreted? When you talk about per-value uncertainties, I start imagining that we're trying to estimate a population covariance given a set of samples each corrupted by independent measurement noise, and then there's some natural hierarchical Bayesian model one could write down and get an ML estimate of the latent covariance via empirical Bayes or something. But this requires a bunch of assumptions and is that really what we want to do? (Or maybe it collapses down into something simpler if the measurement noise is gaussian or something?)
In general, going mostly based on Stata
frequency weights are just a shortcut if you have repeated observations. In my unit tests, the results is the same as using np.repeat IIRC. The total number of observation is the sum of weights.
aweights and pweights are mainly like weights in WLS, reflecting the uncertainty of each observation. The number of observations is equal to the number of rows. (Stata internally rescales the weights) one explanation is that observations are measured with different noise, another that observations represent the mean of subsamples with different number of observations.
there is an additional degrees of freedom correction in one of the proposed calculations modeled after other packages that I never figured out.
I found the missing proof http://stats.stackexchange.com/questions/47325/bias-correction-in-weighted-v... Josef
(aside: statsmodels does not normalize the scale in WLS, in contrast to Stata, and it is now equivalent to GLS with diagonal sigma. The meaning of weight=1 depends on the user. nobs is number of rows.)
no Bayesian analysis involved. but I guess someone could come up with a Bayesian interpretation.
I think the two proposed weight types, weights and frequencies, should be able to handle almost all cases.
Josef
-n
-- Nathaniel J. Smith Postdoctoral researcher - Informatics - University of Edinburgh http://vorpus.org _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
participants (4)
-
alex
-
josef.pktd@gmail.com
-
Nathaniel Smith
-
Sebastian Berg