weighted mean; weighted standard error of the mean (sem)
I am looking for some reaally basic statistical tools. I have some sample data, some sample weights for those measurements, and I want to calculate a mean and a standard error of the mean. Here are obvious places to look: numpy scipy.stats statsmodels It seems to me that numpy's "mean" and "average" functions have their names backwards. That is, often a mean is defined more generally than average, and includes the possibility of weighting, but in this case it is "average" that has a weights argument. Can these functions be merged/renamed/deprecated in the future? It's clear to me that "mean" should allow for weights. None of these modules, above, offer standard error of the mean which incorporates weights. scipy.stats.sem() doesn't, and that's the closest thing. numpy's "var" doesn't allow weights. There aren't any weighted variances in the above modules. Again, are there favoured codes for these functions? Can they be incorporated appropriately in the future? Most immediately, I'd love to get code for weighted sem. I'll write it otherwise, but it might be crude and dumb... Thanks! Chris Barrington-Leigh UBC
Excerpts from cpblpublic's message of Thu Sep 09 22:22:05 -0400 2010:
I am looking for some reaally basic statistical tools. I have some sample data, some sample weights for those measurements, and I want to calculate a mean and a standard error of the mean.
Here are obvious places to look:
numpy scipy.stats statsmodels
It seems to me that numpy's "mean" and "average" functions have their names backwards. That is, often a mean is defined more generally than average, and includes the possibility of weighting, but in this case it is "average" that has a weights argument. Can these functions be merged/renamed/deprecated in the future? It's clear to me that "mean" should allow for weights.
None of these modules, above, offer standard error of the mean which incorporates weights. scipy.stats.sem() doesn't, and that's the closest thing. numpy's "var" doesn't allow weights. There aren't any weighted variances in the above modules.
Again, are there favoured codes for these functions? Can they be incorporated appropriately in the future?
Most immediately, I'd love to get code for weighted sem. I'll write it otherwise, but it might be crude and dumb... Thanks! Chris Barrington-Leigh UBC
This code below should do what you want. It is part of the stat sub-package of esutil http://code.google.com/p/esutil/ Hope this helps, Erin Scott Sheldon Brookhaven National Laboratory def wmom(arrin, weights_in, inputmean=None, calcerr=False, sdev=False): """ NAME: wmom() PURPOSE: Calculate the weighted mean, error, and optionally standard deviation of an input array. By default error is calculated assuming the weights are 1/err^2, but if you send calcerr=True this assumption is dropped and the error is determined from the weighted scatter. CALLING SEQUENCE: wmean,werr = wmom(arr, weights, inputmean=None, calcerr=False, sdev=False) INPUTS: arr: A numpy array or a sequence that can be converted. weights: A set of weights for each elements in array. OPTIONAL INPUTS: inputmean: An input mean value, around which them mean is calculated. calcerr=False: Calculate the weighted error. By default the error is calculated as 1/sqrt( weights.sum() ). If calcerr=True it is calculated as sqrt( (w**2 * (arr-mean)**2).sum() )/weights.sum() sdev=False: If True, also return the weighted standard deviation as a third element in the tuple. OUTPUTS: wmean, werr: A tuple of the weighted mean and error. If sdev=True the tuple will also contain sdev: wmean,werr,wsdev REVISION HISTORY: Converted from IDL: 2006-10-23. Erin Sheldon, NYU """ # no copy made if they are already arrays arr = numpy.array(arrin, ndmin=1, copy=False) # Weights is forced to be type double. All resulting calculations # will also be double weights = numpy.array(weights_in, ndmin=1, dtype='f8', copy=False) wtot = weights.sum() # user has input a mean value if inputmean is None: wmean = ( weights*arr ).sum()/wtot else: wmean=float(inputmean) # how should error be calculated? if calcerr: werr2 = ( weights**2 * (arr-wmean)**2 ).sum() werr = numpy.sqrt( werr2 )/wtot else: werr = 1.0/numpy.sqrt(wtot) # should output include the weighted standard deviation? if sdev: wvar = ( weights*(arr-wmean)**2 ).sum()/wtot wsdev = numpy.sqrt(wvar) return wmean,werr,wsdev else: return wmean,werr
On Thu, Sep 9, 2010 at 10:22 PM, cpblpublic
I am looking for some reaally basic statistical tools. I have some sample data, some sample weights for those measurements, and I want to calculate a mean and a standard error of the mean.
Here are obvious places to look:
numpy scipy.stats statsmodels
It seems to me that numpy's "mean" and "average" functions have their names backwards. That is, often a mean is defined more generally than average, and includes the possibility of weighting, but in this case it is "average" that has a weights argument. Can these functions be merged/renamed/deprecated in the future? It's clear to me that "mean" should allow for weights.
I think of weighted mean and weighted average, pretty much as synonyms, changing names would break backwards compatibility without any real benefit, in my opinion.
None of these modules, above, offer standard error of the mean which incorporates weights. scipy.stats.sem() doesn't, and that's the closest thing. numpy's "var" doesn't allow weights. There aren't any weighted variances in the above modules.
for weighted statistics, I usually refer to ticket 604 http://projects.scipy.org/scipy/attachment/ticket/604/Statistics.py but I didn't see weighted sem in it
Again, are there favoured codes for these functions? Can they be incorporated appropriately in the future?
Most immediately, I'd love to get code for weighted sem. I'll write it otherwise, but it might be crude and dumb...
just a thought, I still have to check the details: Estimating statsmodels.WLS with just a constant should give all the result statistics on the mean, e.g. bse for variance of constant, t() for t-statistic Josef
Thanks! Chris Barrington-Leigh UBC
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
On Thu, Sep 9, 2010 at 7:22 PM, cpblpublic
I am looking for some reaally basic statistical tools. I have some sample data, some sample weights for those measurements, and I want to calculate a mean and a standard error of the mean.
How about using a bootstrap? Array and weights:
a = np.arange(100) w = np.random.rand(100) w = w / w.sum()
Initialize:
n = 1000 ma = np.zeros(n)
Save mean of each bootstrap sample:
for i in range(n): ....: idx = np.random.randint(0, 100, 100) ....: ma[i] = np.dot(a[idx], w[idx]) ....: ....:
Error in mean:
ma.std() 3.854023384833674
Sanity check:
np.dot(w, a) 49.231127299096954 ma.mean() 49.111478821225127
Hmm...should w[idx] be renormalized to sum to one in each bootstrap sample?
On Thu, Sep 9, 2010 at 8:07 PM, Keith Goodman
On Thu, Sep 9, 2010 at 7:22 PM, cpblpublic
wrote: I am looking for some reaally basic statistical tools. I have some sample data, some sample weights for those measurements, and I want to calculate a mean and a standard error of the mean.
How about using a bootstrap?
Array and weights:
a = np.arange(100) w = np.random.rand(100) w = w / w.sum()
Initialize:
n = 1000 ma = np.zeros(n)
Save mean of each bootstrap sample:
for i in range(n): ....: idx = np.random.randint(0, 100, 100) ....: ma[i] = np.dot(a[idx], w[idx]) ....: ....:
Error in mean:
ma.std() 3.854023384833674
Sanity check:
np.dot(w, a) 49.231127299096954 ma.mean() 49.111478821225127
Hmm...should w[idx] be renormalized to sum to one in each bootstrap sample?
Or perhaps there is no uncertainty about the weights, in which case:
for i in range(n): ....: idx = np.random.randint(0, 100, 100) ....: ma[i] = np.dot(a[idx], w) ....: ....: ma.std() 3.2548815339711115
On Thu, Sep 9, 2010 at 11:32 PM, Keith Goodman
On Thu, Sep 9, 2010 at 8:07 PM, Keith Goodman
wrote: On Thu, Sep 9, 2010 at 7:22 PM, cpblpublic
wrote: I am looking for some reaally basic statistical tools. I have some sample data, some sample weights for those measurements, and I want to calculate a mean and a standard error of the mean.
How about using a bootstrap?
Array and weights:
a = np.arange(100) w = np.random.rand(100) w = w / w.sum()
Initialize:
n = 1000 ma = np.zeros(n)
Save mean of each bootstrap sample:
for i in range(n): ....: idx = np.random.randint(0, 100, 100) ....: ma[i] = np.dot(a[idx], w[idx]) ....: ....:
Error in mean:
ma.std() 3.854023384833674
Sanity check:
np.dot(w, a) 49.231127299096954 ma.mean() 49.111478821225127
Hmm...should w[idx] be renormalized to sum to one in each bootstrap sample?
Or perhaps there is no uncertainty about the weights, in which case:
for i in range(n): ....: idx = np.random.randint(0, 100, 100) ....: ma[i] = np.dot(a[idx], w) ....: ....: ma.std() 3.2548815339711115
or maybe `w` reflects an underlying sampling scheme and you should sample in the bootstrap according to w ? if weighted average is a sum of linear functions of (normal) distributed random variables, it still depends on whether the individual observations have the same or different variances, e.g. http://en.wikipedia.org/wiki/Weighted_mean#Statistical_properties What I can't figure out is whether if you assume simga_i = sigma for all observation i, do we use the weighted or the unweighted variance to get an estimate of sigma. And I'm not able to replicate with simple calculations what statsmodels.WLS gives me. ??? Josef
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
On Thu, Sep 9, 2010 at 8:44 PM,
On Thu, Sep 9, 2010 at 11:32 PM, Keith Goodman
wrote: On Thu, Sep 9, 2010 at 8:07 PM, Keith Goodman
wrote: On Thu, Sep 9, 2010 at 7:22 PM, cpblpublic
wrote: I am looking for some reaally basic statistical tools. I have some sample data, some sample weights for those measurements, and I want to calculate a mean and a standard error of the mean.
How about using a bootstrap?
Array and weights:
a = np.arange(100) w = np.random.rand(100) w = w / w.sum()
Initialize:
n = 1000 ma = np.zeros(n)
Save mean of each bootstrap sample:
for i in range(n): ....: idx = np.random.randint(0, 100, 100) ....: ma[i] = np.dot(a[idx], w[idx]) ....: ....:
Error in mean:
ma.std() 3.854023384833674
Sanity check:
np.dot(w, a) 49.231127299096954 ma.mean() 49.111478821225127
Hmm...should w[idx] be renormalized to sum to one in each bootstrap sample?
Or perhaps there is no uncertainty about the weights, in which case:
for i in range(n): ....: idx = np.random.randint(0, 100, 100) ....: ma[i] = np.dot(a[idx], w) ....: ....: ma.std() 3.2548815339711115
or maybe `w` reflects an underlying sampling scheme and you should sample in the bootstrap according to w ?
Yes....
if weighted average is a sum of linear functions of (normal) distributed random variables, it still depends on whether the individual observations have the same or different variances, e.g. http://en.wikipedia.org/wiki/Weighted_mean#Statistical_properties
...lots of possibilities. As you have shown the problem is not yet well defined. Not much specification needed for the weighted mean, lots needed for the standard error of the weighted mean.
What I can't figure out is whether if you assume simga_i = sigma for all observation i, do we use the weighted or the unweighted variance to get an estimate of sigma. And I'm not able to replicate with simple calculations what statsmodels.WLS gives me.
My guess: if all you want is sigma of the individual i and you know sigma is the same for all i, then I suppose you don't care about the weight.
???
Josef
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
I am looking for some reaally basic statistical tools. I have some sample data, some sample weights for those measurements, and I want to calculate a mean and a standard error of the mean.
I have been working through an R training class, converting the examples and exercises to numpy and scipy. Some of what you want may be there. http://www.oplnk.net/~ajackson/software/ -- ----------------------------------------------------------------------- | Alan K. Jackson | To see a World in a Grain of Sand | | alan@ajackson.org | And a Heaven in a Wild Flower, | | www.ajackson.org | Hold Infinity in the palm of your hand | | Houston, Texas | And Eternity in an hour. - Blake | -----------------------------------------------------------------------
participants (5)
-
alan@ajackson.org
-
cpblpublic
-
Erin Sheldon
-
josef.pktd@gmail.com
-
Keith Goodman