
On Thu, Mar 6, 2014 at 2:51 PM, Nathaniel Smith <njs@pobox.com> wrote:
On Wed, Mar 5, 2014 at 4:45 PM, Sebastian Berg <sebastian@sipsolutions.net> wrote:
Hi all,
in Pull Request https://github.com/numpy/numpy/pull/3864 Neol Dawe suggested adding new parameters to our `cov` and `corrcoef` functions to implement weights, which already exists for `average` (the PR still needs to be adapted).
The idea right now would be to add a `weights` and a `frequencies` keyword arguments to these functions.
In more detail: The situation is a bit more complex for `cov` and `corrcoef` than `average`, because there are different types of weights. The current plan would be to add two new keyword arguments: * weights: Uncertainty weights which causes `N` to be recalculated accordingly (This is R's `cov.wt` default I believe). * frequencies: When given, `N = sum(frequencies)` and the values are weighted by their frequency.
I don't understand this description at all. One them recalculates N, and the other sets N according to some calculation?
Is there a standard reference on how these are supposed to be interpreted? When you talk about per-value uncertainties, I start imagining that we're trying to estimate a population covariance given a set of samples each corrupted by independent measurement noise, and then there's some natural hierarchical Bayesian model one could write down and get an ML estimate of the latent covariance via empirical Bayes or something. But this requires a bunch of assumptions and is that really what we want to do? (Or maybe it collapses down into something simpler if the measurement noise is gaussian or something?)
In general, going mostly based on Stata frequency weights are just a shortcut if you have repeated observations. In my unit tests, the results is the same as using np.repeat IIRC. The total number of observation is the sum of weights. aweights and pweights are mainly like weights in WLS, reflecting the uncertainty of each observation. The number of observations is equal to the number of rows. (Stata internally rescales the weights) one explanation is that observations are measured with different noise, another that observations represent the mean of subsamples with different number of observations. there is an additional degrees of freedom correction in one of the proposed calculations modeled after other packages that I never figured out. (aside: statsmodels does not normalize the scale in WLS, in contrast to Stata, and it is now equivalent to GLS with diagonal sigma. The meaning of weight=1 depends on the user. nobs is number of rows.) no Bayesian analysis involved. but I guess someone could come up with a Bayesian interpretation. I think the two proposed weight types, weights and frequencies, should be able to handle almost all cases. Josef
-n
-- Nathaniel J. Smith Postdoctoral researcher - Informatics - University of Edinburgh http://vorpus.org _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion