[Numpy-discussion] weighted mean; weighted standard error of the mean (sem)

Sun Sep 19 10:28:41 EDT 2010

On Fri, Sep 17, 2010 at 1:19 PM,  <josef.pktd at gmail.com> wrote:
> On Fri, Sep 10, 2010 at 3:01 PM,  <josef.pktd at gmail.com> wrote:
>> On Fri, Sep 10, 2010 at 1:58 PM, Christopher Barrington-Leigh
>> <cpblpublic+numpy at gmail.com> wrote:
>>> Interesting. Thanks Erin, Josef and Keith.
>>
>> thanks to the stata page at least I figured out that WLS is aweights
>> with asumption mu_i = mu
>>
>> import numpy as np
>> from scikits.statsmodels import WLS
>> w0 = np.arange(20) % 4
>> w = 1.*w0/w0.sum()
>> y = 2 + np.random.randn(20)
>>
>>>>> res = WLS(y, np.ones(20), weights=w).fit()
>>>>> print res.params, res.bse
>> [ 2.29083069] [ 0.17562867]
>>>>> m = np.dot(w, y)
>>>>> m
>> 2.2908306865128401
>>>>> s2u = 1/(nobs-1.) * np.dot(w, (y - m)**2)
>>>>> s2u
>> 0.030845429945278956
>>>>> np.sqrt(s2u)
>> 0.17562867062435722
>>
>>
>>>
>>> There is a nice article on this at
>>> http://www.stata.com/support/faqs/stat/supweight.html. In my case, the
>>> model I've in mind is to assume that the expected value (mean) is the same
>>> for each sample, and that the weights are/should be normalised, whence a
>>> consistent estimator for sem is straightforward (if second moments can
>>> be assumed to be
>>> well behaved?). I suspect that this (survey-like) case is also one of
>>> the two most standard/most common
>>> expression that people want when they ask for an s.e. of the mean for
>>> a weighted dataset. The other would be when the weights are not to be
>>> normalised, but represent standard errors on the individual
>>> measurements.
>>>
>>> Surely what one wants, in the end, is a single function (or whatever)
>>> called mean or sem which calculates different values for different
>>> specified choices of model (assumptions)? And where possible that it has a
>>> default model in mind for when none is specified?
>>
>> I find aweights and pweights still confusing, plus necessary auxillary
>> assumptions.
>>
>> I don't find Stata docs very helpful, I almost never find a clear
>> description of the formulas (and I don't have any Stata books).
>>
>> If you have or write some examples that show or apply in the different
>> cases, then this would be very helpful to get a structure into this
>> area, weighting and survey sampling, and population versus clustered
>> or stratified sample statistics.
>>
>> I'm still pretty lost with the literature on surveys.
>
> I found the formula collection for SPSS
> http://support.spss.com/productsext/statistics/documentation/19/clientindex.html#Manuals
> pdf file for algorithms
>
> Not much explanation, and sometimes it's not really clear what a
> variable stands for exactly, but a useful summary of formulas. Also
> the formulas might not always be for a general case, e.g. formulas for
> non-parametric tests seem to be missing tie-handling (from a quick
> look).
>
> More compressed than the details descriptions in SAS, but much more
> explicit than Stata and R without buying the books.
>
> The chapter on T Test Algorithm carries population frequencies
> throughout, this should work for weighted statistics, but maybe not
> for different complex sampling schemes (a-weights, p-weights,...).

here is the corresponding SAS documentation

http://support.sas.com/documentation/cdl/en/statug/63347/HTML/default/viewer.htm#ttest_toc.htm

and here is my first draft of it in statsmodels,

http://bazaar.launchpad.net/~josef-pktd/statsmodels/statsmodels-josef-experimental-gsoc/annotate/head%3A/scikits/statsmodels/sandbox/stats/weightstats.py

A few hours of translating the SPSS formula collection into
sci-python, still buggy, insufficiently tested and insufficiently
documented, and it doesn't have the extras yet that SAS has. It uses
my currently preferred pattern of lazily evaluated classes (although I
might switch decorators).

Josef

>
> Josef
>
>
>>
>> Josef
>>
>>
>>>
>>> thanks,
>>> Chris
>>>
>>> On Thu, Sep 9, 2010 at 9:13 PM, Keith Goodman <kwgoodman at gmail.com> wrote:
>>>> >>>> ma.std()
>>>> >>   3.2548815339711115
>>>> >
>>>> > or maybe `w` reflects an underlying sampling scheme and you should
>>>> > sample in the bootstrap according to w ?
>>>>
>>>> Yes....
>>>>
>>>> > if weighted average is a sum of linear functions of (normal)
>>>> > distributed random variables, it still depends on whether the
>>>> > individual observations have the same or different variances, e.g.
>>>> > http://en.wikipedia.org/wiki/Weighted_mean#Statistical_properties
>>>>
>>>> ...lots of possibilities. As you have shown the problem is not yet
>>>> well defined. Not much specification needed for the weighted mean,
>>>> lots needed for the standard error of the weighted mean.
>>>>
>>>> > What I can't figure out is whether if you assume simga_i = sigma for
>>>> > all observation i, do we use the weighted or the unweighted variance
>>>> > to get an estimate of sigma. And I'm not able to replicate with simple
>>>> > calculations what statsmodels.WLS gives me.
>>>>
>>>> My guess: if all you want is sigma of the individual i and you know
>>>> sigma is the same for all i, then I suppose you don't care about the
>>>> weight.
>>>>
>>>> >
>>>> > ???
>>>> >
>>>> > Josef
>>> _______________________________________________
>>> NumPy-Discussion mailing list
>>> NumPy-Discussion at scipy.org
>>> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>>>
>>
>