[SciPy-User] scipy.stats.fit inquiry

Wed Oct 21 10:55:52 EDT 2009

On Tue, Oct 20, 2009 at 2:41 PM, Anne Archibald
<peridot.faceted at gmail.com> wrote:
> 2009/10/20  <josef.pktd at gmail.com>:
>> On Tue, Oct 20, 2009 at 11:13 AM, Anne Archibald
>> <peridot.faceted at gmail.com> wrote:
>>>
>>> Incidentally, I have some code implementing the Kuiper test, a
>>> modified K-S test that is sensitive to different aspects of the shape
>>> of the distribution, and (more importantly for me) is invariant on
>>> shifting a distribution or sample modulo 1. I haven't submitted it for
>>> inclusion because the interface I used is a little different from that
>>> used by scipy's K-S test, but if there's interest I'd be happy to
>>> contribute it.
>>
>> More tests in scipy.stats is always good (at least as long as I don't
>> have to chase down the references to write the tests for the tests.)
>> How do you get the pvalue or critical values for Kuiper, since the
>> distribution of the test statistic is not a standard distribution?
>
> There's a special function, like the one for the K-S test (but
> obviously different in details), which I implemented. I have tests for
> the Kuiper test too. Most of the code is at home and in any case needs
> some polishing, but I've attached kuiper.py just so you can see what's
> there. (There's also some extra stuff for dealing with different
> exposure times for different phases, which doesn't need to go in.)

I just did a quick Monte Carlo to see how well sized the test is.
It looks good for the normal distribution, it also has some power
testing for the t distribution with fixed df, loc and scale.

DGP standard normal
nsim = 10000
nobs = 500,
[[ 0.0094  0.0496  0.0958]  normal
 [ 0.0529  0.1759  0.287 ]]  t, df=10
nobs = 100
[[ 0.0072  0.0425  0.0831]
 [ 0.0132  0.0615  0.1191]]

nobs = 100
[[ 0.0081  0.0417  0.0841]
 [ 0.0315  0.1233  0.2128]]  t, df=5

When the parameters of the t distribution
are estimated, then the test has no power.
I don't know if adjusting the critical values
would help at least for comparing similar
distributions like norm and t.
(same problem with kstest)

----------
if __name__ == '__main__':
    from scipy import stats

    tcdf = lambda x: stats.t.cdf(x, 5)
    mcres = []
    nsim = 100
    nobs = 500 #500
    est = 0
    for i in range(nsim):
        rvsn = 1.0* np.random.normal(size=nobs)
        d,p = kuiper(rvsn, stats.norm.cdf)
        if est:
            targs = stats.t.fit(rvsn)
            tcdf = lambda x: stats.t.cdf(x, *targs)
        #print targs
        dt,pt = kuiper(rvsn, tcdf)
        mcres.append([d,p, dt, pt])
        #print d, p
    mcres = np.array(mcres)
    print (mcres[:,(1, 3)][:,:,None]<np.array([0.01, 0.05,
0.1])[None,None,:] ).sum(0)/float(nsim)

---------

To save on the lambda function, the cdf could take args given as
an argument to kuiper. Maybe the specification of the cdf argument
could be taken from kstest, but not (the generality of) the rvs argument
in kstest.

Your cdf_from_intervals looks nice for a univariate stepfunction distribution,
There are several pieces for this on the mailing list, but I never got around
to collecting them.

Your interval functions, I haven't quite figured out. Is fold_intervals
supposed to convert a set of overlapping intervals to a non-overlapping
partitioning with associated weights? For example for merging
2 histograms with non-identical bounds?

I  haven't figured out the shift invariance.

>
> Some references are Paltani 2004, "Searching for periods in X-ray
> observations using Kuiper's test. Application to the ROSAT PSPC
> archive", and Kuiper 1962, "Tests concerning random points on a
> circle" (which I haven't read).
>
> I also have some code for the H test (essentially looking at the
> Fourier coefficients to find how many harmonics are worth including
> and what the significance is; de Jager et al. 1989 "A poweful test for
> weak periodic signals with unknown light curve shape in sparse data").
> But circular statistics seem to be a bit obscure, so I'm not sure how
> much effort should go into putting this in scipy.

For sure they are obscure to me, but there are a few circular descriptive
statistics in stats.morestats, and I saw a matlab toolbox on the
file exchange. I figured out by now that there are some pretty different
statistics used in various fields.
I guess, it's all up to you.

>
>> (From what I understand from Sherpa, is that Cash is used as
>> objective function for the maximum likelihood estimation of a
>> Poisson process.)
>
> In principle it's more general, but it's used, for example, in the
> X-ray spectral fitting program XSPEC for when you only have a handful
> of photons in each energy bin.
>
>> Looking for Kuiper, I found a nice overview of a large list of goodness
>> of fit tests, used in natural sciences.
>>
>> http://www.ge.infn.it/statisticaltoolkit/gof/deployment/userdoc/statistics/applicability.html
>>
>> with article in
>> http://ieeexplore.ieee.org/xpls/abs_all.jsp?isnumber=29603&arnumber=1344284&count=103&index=22
>
> Hmm, I'll have to look into the Watson statistic, I haven't run into it before.
>
>> And apropos circular statistic, which I know nothing about:
>>
>> stats.distribution.vonmises is the only distribution that has bounded
>> (or circular) support but doesn't define the bounds (a, b). This
>> screws up numerical integration, e.g. for the moment calculation.
>>
>> Is vonmises actually used on circular support?
>> To enable integration, we need to define a and b (e.g [-pi,pi] as
>> in numpy random) or introduce new bounds specifically for
>> integration in this case.
>
> I use circular statistics quite a bit, since it's the appropriate
> toolkit for working with X-ray pulsar pulse profiles. In particular
> the von Mises distribution is handy for producing simulated pulse
> profiles (there's also a sense in which it is the circular analog of
> the normal distribution; it's maximum entropy for a fixed first moment
> IIRC). Unfortunately the generic stats interface is quite clumsy for
> this particular case. If I recall correctly as of 0.6.0 the interface
> was kind of miserable to use, since the CDF had a discontinuity at -pi
> and pi, and using loc moved the discontinuity along with the function.
> (I think there was also a severe bug for large values of the shape
> parameter.) I'm not sure what the interface is like in more recent
> versions, since 0.6.0 is what's on the computers here at work, but I
> think it's better.

If you are still on 0.6.0, then you are missing all of my cleanup
of stats.distributions, especially the problems with the generic
methods. However, I didn't touch vonmises, since I didn't know
how to interpret the support. In the current generic framework,
it would be just another distribution with a finite support, where
generic moments are just calculated for the interval.

>From your description (below), I would think, that for circular
distribution, we would need different generic functions that
don't fit in the current distribution classes, integration on a
circle (?) instead of integration on the real line.

If I define boundaries, vonmises.a and vonmises.b, then
I think, you would not be able to calculate pdf(x), cdf(x)
and so on for x outside of the support [a,b]. I don't know
whether it is possible to define a,b but not enforce them in
_argcheck and only use them as integration bounds.

I checked briefly, vonmises.moment and vonmises.stats
work (integrating over ppf not pdf)
generic moment calculation with pdf  (_mom0_sc) and
entropy fail. fit seems to work with the normal random
variable, but I thought I got lots of "weird" fit results
before.

For the real line, this looks "strange"
>>> stats.vonmises.pdf(np.linspace(0,3*np.pi,10), 2)
array([ 0.51588541,  0.18978364,  0.02568442,  0.00944877,  0.02568442,
        0.18978364,  0.51588541,  0.18978364,  0.02568442,  0.00944877])
>>> stats.vonmises.cdf(np.linspace(0,3*np.pi,10), 2)
array([  0.5       ,   0.89890776,   0.98532805,   1.        ,
         7.28318531,   7.28318531,   7.28318531,   7.28318531,
         7.28318531,  13.56637061])

ppf, calculated generically, works and looks only a little bit strange.

>>> stats.vonmises.ppf([0, 1e-10,1-1e-10,1],2)
array([       -Inf, -3.14159264,  3.14159264,         Inf])

>
> It's worth defining the boundaries, but I don't think you'll get
> useful moment calculations out of it, since circular moments are
> defined differently from linear moments: rather than int(x**n*pdf(x))
> they're int(exp(2,j*pi*n*x)*pdf(x)). There are explicit formulas for
> these for the von Mises distribution, though, and they're useful
> because they are essentially the Fourier coefficients of the pdf.
>
> For context, I've done some work with representing circular
> probability distributions (= pulse profiles) in terms of their moments
> (= Fourier series), and in particular using a kernel density estimator
> with either a sinc kernel (= truncation) or a von Mises kernel (=
> scaling coefficients by von Mises moments). The purpose was to get
> not-too-biased estimates of the modulation amplitudes of X-ray
> pulsations; the paper has been kind of sidetracked by other work.

Sounds like statistics for seasonal time series to me, except you
might have a lot more regularity in the repeated pattern then in
economics or climate research.

Josef

>
>
> Anne
>
> _______________________________________________
> SciPy-User mailing list
> SciPy-User at scipy.org
> http://mail.scipy.org/mailman/listinfo/scipy-user
>
>