Mailman 3 weighted mean; weighted standard error of the mean (sem) - NumPy-Discussion

weighted mean; weighted standard error of the mean (sem)

cpblpublic

10 Sep 2010 10 Sep '10

7:52 a.m.

I am looking for some reaally basic statistical tools. I have some sample data, some sample weights for those measurements, and I want to calculate a mean and a standard error of the mean. Here are obvious places to look: numpy scipy.stats statsmodels It seems to me that numpy's "mean" and "average" functions have their names backwards. That is, often a mean is defined more generally than average, and includes the possibility of weighting, but in this case it is "average" that has a weights argument. Can these functions be merged/renamed/deprecated in the future? It's clear to me that "mean" should allow for weights. None of these modules, above, offer standard error of the mean which incorporates weights. scipy.stats.sem() doesn't, and that's the closest thing. numpy's "var" doesn't allow weights. There aren't any weighted variances in the above modules. Again, are there favoured codes for these functions? Can they be incorporated appropriately in the future? Most immediately, I'd love to get code for weighted sem. I'll write it otherwise, but it might be crude and dumb... Thanks! Chris Barrington-Leigh UBC

Attachments:

attachment.htm (text/html — 1.5 KB)

Show replies by thread

Erin Sheldon

10 Sep 10 Sep

8:08 a.m.

Excerpts from cpblpublic's message of Thu Sep 09 22:22:05 -0400 2010:

...

I am looking for some reaally basic statistical tools. I have some sample data, some sample weights for those measurements, and I want to calculate a mean and a standard error of the mean.

Here are obvious places to look:

numpy scipy.stats statsmodels

It seems to me that numpy's "mean" and "average" functions have their names backwards. That is, often a mean is defined more generally than average, and includes the possibility of weighting, but in this case it is "average" that has a weights argument. Can these functions be merged/renamed/deprecated in the future? It's clear to me that "mean" should allow for weights.

None of these modules, above, offer standard error of the mean which incorporates weights. scipy.stats.sem() doesn't, and that's the closest thing. numpy's "var" doesn't allow weights. There aren't any weighted variances in the above modules.

Again, are there favoured codes for these functions? Can they be incorporated appropriately in the future?

Most immediately, I'd love to get code for weighted sem. I'll write it otherwise, but it might be crude and dumb... Thanks! Chris Barrington-Leigh UBC

This code below should do what you want. It is part of the stat sub-package of esutil http://code.google.com/p/esutil/ Hope this helps, Erin Scott Sheldon Brookhaven National Laboratory def wmom(arrin, weights_in, inputmean=None, calcerr=False, sdev=False): """ NAME: wmom() PURPOSE: Calculate the weighted mean, error, and optionally standard deviation of an input array. By default error is calculated assuming the weights are 1/err^2, but if you send calcerr=True this assumption is dropped and the error is determined from the weighted scatter. CALLING SEQUENCE: wmean,werr = wmom(arr, weights, inputmean=None, calcerr=False, sdev=False) INPUTS: arr: A numpy array or a sequence that can be converted. weights: A set of weights for each elements in array. OPTIONAL INPUTS: inputmean: An input mean value, around which them mean is calculated. calcerr=False: Calculate the weighted error. By default the error is calculated as 1/sqrt( weights.sum() ). If calcerr=True it is calculated as sqrt( (w**2 * (arr-mean)**2).sum() )/weights.sum() sdev=False: If True, also return the weighted standard deviation as a third element in the tuple. OUTPUTS: wmean, werr: A tuple of the weighted mean and error. If sdev=True the tuple will also contain sdev: wmean,werr,wsdev REVISION HISTORY: Converted from IDL: 2006-10-23. Erin Sheldon, NYU """ # no copy made if they are already arrays arr = numpy.array(arrin, ndmin=1, copy=False) # Weights is forced to be type double. All resulting calculations # will also be double weights = numpy.array(weights_in, ndmin=1, dtype='f8', copy=False) wtot = weights.sum() # user has input a mean value if inputmean is None: wmean = ( weights*arr ).sum()/wtot else: wmean=float(inputmean) # how should error be calculated? if calcerr: werr2 = ( weights**2 * (arr-wmean)**2 ).sum() werr = numpy.sqrt( werr2 )/wtot else: werr = 1.0/numpy.sqrt(wtot) # should output include the weighted standard deviation? if sdev: wvar = ( weights*(arr-wmean)**2 ).sum()/wtot wsdev = numpy.sqrt(wvar) return wmean,werr,wsdev else: return wmean,werr

josef.pktd＠gmail.com

8:09 a.m.

New subject: weighted mean; weighted standard error of the mean (sem)

On Thu, Sep 9, 2010 at 10:22 PM, cpblpublic wrote:

...

I am looking for some reaally basic statistical tools. I have some sample data, some sample weights for those measurements, and I want to calculate a mean and a standard error of the mean.

Here are obvious places to look:

numpy scipy.stats statsmodels

It seems to me that numpy's "mean" and "average" functions have their names backwards. That is, often a mean is defined more generally than average, and includes the possibility of weighting, but in this case it is "average" that has a weights argument. Can these functions be merged/renamed/deprecated in the future? It's clear to me that "mean" should allow for weights.

I think of weighted mean and weighted average, pretty much as synonyms, changing names would break backwards compatibility without any real benefit, in my opinion.

...

None of these modules, above, offer standard error of the mean which incorporates weights. scipy.stats.sem() doesn't, and that's the closest thing. numpy's "var" doesn't allow weights. There aren't any weighted variances in the above modules.

for weighted statistics, I usually refer to ticket 604 http://projects.scipy.org/scipy/attachment/ticket/604/Statistics.py but I didn't see weighted sem in it

...

Again, are there favoured codes for these functions? Can they be incorporated appropriately in the future?

Most immediately, I'd love to get code for weighted sem. I'll write it otherwise, but it might be crude and dumb...

just a thought, I still have to check the details: Estimating statsmodels.WLS with just a constant should give all the result statistics on the mean, e.g. bse for variance of constant, t() for t-statistic Josef

...

Thanks! Chris Barrington-Leigh UBC

_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion

Keith Goodman

8:37 a.m.

New subject: weighted mean; weighted standard error of the mean (sem)

On Thu, Sep 9, 2010 at 7:22 PM, cpblpublic wrote:

...

I am looking for some reaally basic statistical tools. I have some sample data, some sample weights for those measurements, and I want to calculate a mean and a standard error of the mean.

How about using a bootstrap? Array and weights:

...

...
a = np.arange(100) w = np.random.rand(100) w = w / w.sum()

Initialize:

...

...
n = 1000 ma = np.zeros(n)

Save mean of each bootstrap sample:

...

...
for i in range(n): ....: idx = np.random.randint(0, 100, 100) ....: ma[i] = np.dot(a[idx], w[idx]) ....: ....:

Error in mean:

...

...
ma.std() 3.854023384833674

Sanity check:

...

...
np.dot(w, a) 49.231127299096954 ma.mean() 49.111478821225127

Hmm...should w[idx] be renormalized to sum to one in each bootstrap sample?

Keith Goodman

9:02 a.m.

New subject: weighted mean; weighted standard error of the mean (sem)

On Thu, Sep 9, 2010 at 8:07 PM, Keith Goodman wrote:

...

On Thu, Sep 9, 2010 at 7:22 PM, cpblpublic wrote:

...
I am looking for some reaally basic statistical tools. I have some sample data, some sample weights for those measurements, and I want to calculate a mean and a standard error of the mean.

How about using a bootstrap?

Array and weights:

...
...
a = np.arange(100) w = np.random.rand(100) w = w / w.sum()

Initialize:

...
...
n = 1000 ma = np.zeros(n)

Save mean of each bootstrap sample:

...
...
for i in range(n): ....: idx = np.random.randint(0, 100, 100) ....: ma[i] = np.dot(a[idx], w[idx]) ....: ....:

Error in mean:

...
...
ma.std() 3.854023384833674

Sanity check:

...
...
np.dot(w, a) 49.231127299096954 ma.mean() 49.111478821225127

Hmm...should w[idx] be renormalized to sum to one in each bootstrap sample?

Or perhaps there is no uncertainty about the weights, in which case:

...

...
for i in range(n): ....: idx = np.random.randint(0, 100, 100) ....: ma[i] = np.dot(a[idx], w) ....: ....: ma.std() 3.2548815339711115

josef.pktd＠gmail.com

9:14 a.m.

New subject: weighted mean; weighted standard error of the mean (sem)

On Thu, Sep 9, 2010 at 11:32 PM, Keith Goodman wrote:

...

On Thu, Sep 9, 2010 at 8:07 PM, Keith Goodman wrote:

...
On Thu, Sep 9, 2010 at 7:22 PM, cpblpublic wrote:

...
I am looking for some reaally basic statistical tools. I have some sample data, some sample weights for those measurements, and I want to calculate a mean and a standard error of the mean.

How about using a bootstrap?

Array and weights:

...
...
a = np.arange(100) w = np.random.rand(100) w = w / w.sum()

Initialize:

...
...
n = 1000 ma = np.zeros(n)

Save mean of each bootstrap sample:

...
...
for i in range(n): ....: idx = np.random.randint(0, 100, 100) ....: ma[i] = np.dot(a[idx], w[idx]) ....: ....:

Error in mean:

...
...
ma.std() 3.854023384833674

Sanity check:

...
...
np.dot(w, a) 49.231127299096954 ma.mean() 49.111478821225127

Hmm...should w[idx] be renormalized to sum to one in each bootstrap sample?

Or perhaps there is no uncertainty about the weights, in which case:

...
...
for i in range(n): ....: idx = np.random.randint(0, 100, 100) ....: ma[i] = np.dot(a[idx], w) ....: ....: ma.std() 3.2548815339711115

or maybe `w` reflects an underlying sampling scheme and you should sample in the bootstrap according to w ? if weighted average is a sum of linear functions of (normal) distributed random variables, it still depends on whether the individual observations have the same or different variances, e.g. http://en.wikipedia.org/wiki/Weighted_mean#Statistical_properties What I can't figure out is whether if you assume simga_i = sigma for all observation i, do we use the weighted or the unweighted variance to get an estimate of sigma. And I'm not able to replicate with simple calculations what statsmodels.WLS gives me. ??? Josef

...

_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion

Keith Goodman

9:43 a.m.

New subject: weighted mean; weighted standard error of the mean (sem)

On Thu, Sep 9, 2010 at 8:44 PM, wrote:

...

On Thu, Sep 9, 2010 at 11:32 PM, Keith Goodman wrote:

...
On Thu, Sep 9, 2010 at 8:07 PM, Keith Goodman wrote:

...
On Thu, Sep 9, 2010 at 7:22 PM, cpblpublic wrote:

...
I am looking for some reaally basic statistical tools. I have some sample data, some sample weights for those measurements, and I want to calculate a mean and a standard error of the mean.

How about using a bootstrap?

Array and weights:

...
...
a = np.arange(100) w = np.random.rand(100) w = w / w.sum()

Initialize:

...
...
n = 1000 ma = np.zeros(n)

Save mean of each bootstrap sample:

...
...
for i in range(n): ....: idx = np.random.randint(0, 100, 100) ....: ma[i] = np.dot(a[idx], w[idx]) ....: ....:

Error in mean:

...
...
ma.std() 3.854023384833674

Sanity check:

...
...
np.dot(w, a) 49.231127299096954 ma.mean() 49.111478821225127

Hmm...should w[idx] be renormalized to sum to one in each bootstrap sample?

Or perhaps there is no uncertainty about the weights, in which case:

...
...
for i in range(n): ....: idx = np.random.randint(0, 100, 100) ....: ma[i] = np.dot(a[idx], w) ....: ....: ma.std() 3.2548815339711115

or maybe `w` reflects an underlying sampling scheme and you should sample in the bootstrap according to w ?

Yes....

...

if weighted average is a sum of linear functions of (normal) distributed random variables, it still depends on whether the individual observations have the same or different variances, e.g. http://en.wikipedia.org/wiki/Weighted_mean#Statistical_properties

...lots of possibilities. As you have shown the problem is not yet well defined. Not much specification needed for the weighted mean, lots needed for the standard error of the weighted mean.

...

What I can't figure out is whether if you assume simga_i = sigma for all observation i, do we use the weighted or the unweighted variance to get an estimate of sigma. And I'm not able to replicate with simple calculations what statsmodels.WLS gives me.

My guess: if all you want is sigma of the individual i and you know sigma is the same for all i, then I suppose you don't care about the weight.

...

???

Josef

...
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion

_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion

alan＠ajackson.org

16 Sep 16 Sep

6:44 a.m.

New subject: weighted mean; weighted standard error of the mean (sem)

...

I am looking for some reaally basic statistical tools. I have some sample data, some sample weights for those measurements, and I want to calculate a mean and a standard error of the mean.

I have been working through an R training class, converting the examples and exercises to numpy and scipy. Some of what you want may be there. http://www.oplnk.net/~ajackson/software/ -- ----------------------------------------------------------------------- | Alan K. Jackson | To see a World in a Grain of Sand | | alan@ajackson.org | And a Heaven in a Wild Flower, | | www.ajackson.org | Hold Infinity in the palm of your hand | | Houston, Texas | And Eternity in an hour. - Blake | -----------------------------------------------------------------------

4965

Age (days ago)

4971

Last active (days ago)

List overview

Download

7 comments

5 participants

participants (5)

alan＠ajackson.org
cpblpublic
Erin Sheldon
josef.pktd＠gmail.com
Keith Goodman

weighted mean; weighted standard error of the mean (sem)

cpblpublic

Erin Sheldon

josef.pktd＠gmail.com

Keith Goodman

Keith Goodman

josef.pktd＠gmail.com

Keith Goodman

alan＠ajackson.org

tags

participants (5)