[Numpy-discussion] weighted mean; weighted standard error of the mean (sem)

Fri Sep 17 13:19:18 EDT 2010

On Fri, Sep 10, 2010 at 3:01 PM,  <josef.pktd at gmail.com> wrote:
> On Fri, Sep 10, 2010 at 1:58 PM, Christopher Barrington-Leigh
> <cpblpublic+numpy at gmail.com> wrote:
>> Interesting. Thanks Erin, Josef and Keith.
>
> thanks to the stata page at least I figured out that WLS is aweights
> with asumption mu_i = mu
>
> import numpy as np
> from scikits.statsmodels import WLS
> w0 = np.arange(20) % 4
> w = 1.*w0/w0.sum()
> y = 2 + np.random.randn(20)
>
>>>> res = WLS(y, np.ones(20), weights=w).fit()
>>>> print res.params, res.bse
> [ 2.29083069] [ 0.17562867]
>>>> m = np.dot(w, y)
>>>> m
> 2.2908306865128401
>>>> s2u = 1/(nobs-1.) * np.dot(w, (y - m)**2)
>>>> s2u
> 0.030845429945278956
>>>> np.sqrt(s2u)
> 0.17562867062435722
>
>
>>
>> There is a nice article on this at
>> http://www.stata.com/support/faqs/stat/supweight.html. In my case, the
>> model I've in mind is to assume that the expected value (mean) is the same
>> for each sample, and that the weights are/should be normalised, whence a
>> consistent estimator for sem is straightforward (if second moments can
>> be assumed to be
>> well behaved?). I suspect that this (survey-like) case is also one of
>> the two most standard/most common
>> expression that people want when they ask for an s.e. of the mean for
>> a weighted dataset. The other would be when the weights are not to be
>> normalised, but represent standard errors on the individual
>> measurements.
>>
>> Surely what one wants, in the end, is a single function (or whatever)
>> called mean or sem which calculates different values for different
>> specified choices of model (assumptions)? And where possible that it has a
>> default model in mind for when none is specified?
>
> I find aweights and pweights still confusing, plus necessary auxillary
> assumptions.
>
> I don't find Stata docs very helpful, I almost never find a clear
> description of the formulas (and I don't have any Stata books).
>
> If you have or write some examples that show or apply in the different
> cases, then this would be very helpful to get a structure into this
> area, weighting and survey sampling, and population versus clustered
> or stratified sample statistics.
>
> I'm still pretty lost with the literature on surveys.

I found the formula collection for SPSS
http://support.spss.com/productsext/statistics/documentation/19/clientindex.html#Manuals
pdf file for algorithms

Not much explanation, and sometimes it's not really clear what a
variable stands for exactly, but a useful summary of formulas. Also
the formulas might not always be for a general case, e.g. formulas for
non-parametric tests seem to be missing tie-handling (from a quick
look).

More compressed than the details descriptions in SAS, but much more
explicit than Stata and R without buying the books.

The chapter on T Test Algorithm carries population frequencies
throughout, this should work for weighted statistics, but maybe not
for different complex sampling schemes (a-weights, p-weights,...).

Josef

>
> Josef
>
>
>>
>> thanks,
>> Chris
>>
>> On Thu, Sep 9, 2010 at 9:13 PM, Keith Goodman <kwgoodman at gmail.com> wrote:
>>> >>>> ma.std()
>>> >>   3.2548815339711115
>>> >
>>> > or maybe `w` reflects an underlying sampling scheme and you should
>>> > sample in the bootstrap according to w ?
>>>
>>> Yes....
>>>
>>> > if weighted average is a sum of linear functions of (normal)
>>> > distributed random variables, it still depends on whether the
>>> > individual observations have the same or different variances, e.g.
>>> > http://en.wikipedia.org/wiki/Weighted_mean#Statistical_properties
>>>
>>> ...lots of possibilities. As you have shown the problem is not yet
>>> well defined. Not much specification needed for the weighted mean,
>>> lots needed for the standard error of the weighted mean.
>>>
>>> > What I can't figure out is whether if you assume simga_i = sigma for
>>> > all observation i, do we use the weighted or the unweighted variance
>>> > to get an estimate of sigma. And I'm not able to replicate with simple
>>> > calculations what statsmodels.WLS gives me.
>>>
>>> My guess: if all you want is sigma of the individual i and you know
>>> sigma is the same for all i, then I suppose you don't care about the
>>> weight.
>>>
>>> >
>>> > ???
>>> >
>>> > Josef
>> _______________________________________________
>> NumPy-Discussion mailing list
>> NumPy-Discussion at scipy.org
>> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>>
>