[SciPy-User] An extra parameter to stats.chisquare ?

Mon Aug 3 15:10:36 EDT 2009

On Mon, Aug 3, 2009 at 2:29 PM, Pierre GM<pgmdevlist at gmail.com> wrote:
>
> On Aug 3, 2009, at 11:12 AM, josef.pktd at gmail.com wrote:
>
>> On Sun, Aug 2, 2009 at 10:43 PM, Pierre GM<pgmdevlist at gmail.com>
>> wrote:
>>>
>>>
>>> Well, I guess we'd need a "real" statistician. From what I gathered,
>>> when you fit your N observations to a distribution with p parameters
>>> (eg, 2 for normal, 1 for logseries), the ddof is N-(p+1): http://www.itl.nist.gov/div898/handbook/eda/section3/eda35f.htm
>>> However, that works as long as all the parameters are independent. If
>>> one depends from the others or can be related to the others, we
>>> switch
>>> from p independent parameters to (p-1), thus giving a dof of N-p. So,
>>> wikipedia looks right.
>>
>>
>> how about this change, then
>>
>> def chisquare(f_obs, f_exp=None, ddof=0):
>>      ....
>>      return chisq, chisqprob(chisq, k-1-ddof)
>>
>> default is when no parameters are estimated (dof=k-1), e.g. create
>> random sample and compare to distribution with *given* parameters.
>
> Looks cool.

I will prepare the change and check whether chisquare is fully tested.

>
>> I didn't find a reference for your statement that the parameter
>> (estimators) have to be independent.
>
> Check the example here:
> http://books.google.com/books?id=3rC5A1doUcwC&pg=PA12
>
> The model has 2 parameters, but only one is important, so the final
> ddof is k-2 instead of k-(2+1)

Ok, I misinterpreted the term independent, I thought it means
statistically independent (like beta and sigma estimates in ols), and
not independent, as in one parameter is just a deterministic function
of the other ones.

The reference looks good, it uses power discrepancy in the following section.

>
>> If you are testing discrete distribution, then there is also a
>> helper function
>> in test_discrete_basic.py in stats/tests
>
> Oh OK. Didn't know about that.
>
>> def check_discrete_chisquare(distfn, arg, rvs, alpha, msg):
>>    '''perform chisquare test for random sample of a discrete
>> distribution
>>
>> The main point of the function is to do a equal weight binning, to
>> maintain a minimum expected frequency in each cell, which is
>> recommended (>=5 expected observations for the chisquare distribution
>> to be an appropriate approximation).
>
> Bah, this binning doesn't really matter when you wanna use X2 to
> compare a sample to an actual distribution, does it ?

Yes it does.
If there is not a minimum number of expected frequency counts in each
cell, then the chisquare distribution is not a good approximation for
the distribution of the test statistic.
In your book example the expected cell count is around 8. If there
were fewer observations so that the expected cell count drops below 5,
then the literature recommends combining cells.

The worse case are discrete distributions with unbound support, e.g.
poisson, then there will always be integers in the tail(s) without
observations, and observations have to be binned. Similar the
continuous distribution case, you cannot compare the pdf/pmf pointwise
in the chisquare test if the probability of each point is very small.

> And anyway, this
> function suffers from the same problem as stats.chisquare: the ddof
> are not taken into account.

None, of my function takes estimated parameters into account since I
didn't think about this issue, but they can easily be changed in the
same way as stats.chisquare.

> All in all, I like you chisquare function best.
stats.chisquare is inherited not mine.

Josef
>
>
> _______________________________________________
> SciPy-User mailing list
> SciPy-User at scipy.org
> http://mail.scipy.org/mailman/listinfo/scipy-user
>