Re: [SciPy-Dev] studentized range approximations

June 3, 2011


      On 06/02/2011 01:15 PM, Roger Lew wrote:
...
Hi Josef,
Thanks for the feedback. The "choice" of using cdf is more a carryover 
from how the algorithm is described by Gleason. Perhaps it would be 
best to have it match your intuition and accept the survival function?
Feel free to treat it like your own for statsmodels. I will definitely 
check out some of your multcomp module so I'm not reinventing the wheel.
In the grand scheme, I could see these having a home in scipy.special 
(after more extensive review of course). That is where I went to look 
for it when I didn't find it in distributions.
Roger
On Thu, Jun 2, 2011 at 1:38 AM, <josef.pktd@gmail.com 
<mailto:josef.pktd@gmail.com>> wrote:
On Thu, Jun 2, 2011 at 12:53 AM, Roger Lew <rogerlew@gmail.com
    <mailto:rogerlew@gmail.com>> wrote:
    > Hi,
    > I have implemented some approximations for studentized range
    quantiles and
    > probabilities based on John R. Gleason's (1999) "An accurate,
    non-iterative
    > approximation for studentized range quantiles." Computational
    Statistics &
    > Data Analysis, (31), 147-158.
    > Probability approximations rely on scipy.optimize.fminbound. The
    functions
    > accept both scalars or array-like data thanks to
    numpy.vectorize. A fair
    > amount of validation and testing has been conducted on the code.
    More
    > details can be found here: http://code.google.com/p/qsturng-py/
    > I welcome any thoughts as to whether you all think this might be
    useful to
    > add to SciPy or make into a scikit. Any general comments would
    be helpful as
    > well. I should mention I'm a cognitive neuroscientist by trade,
    my use of
    > statistical jargon probably isn't that good.
Hi Roger,
I'm very interested in using this in scikits.statsmodels. The table
    that I am currently using is very limited
    http://statsmodels.sourceforge.net/devel/generated/scikits.statsmodels.sandb...
>From a quick look it looks very good.
    What I found a bit confusing is that qstrung takes the probability of
    the cdf and not of the survival function. Without reading the
    docstring carefully enough, I interpreted it as a p-value (upper tail)
    especially since pstrung returns the upper tail probability,
>>> import scikits.statsmodels.sandbox.stats.multicomp as smmc
    >>> for i in range(3, 10):
           x = qsturng(0.95, i, 16)
           x, psturng(x, i, 16), smmc.get_tukeyQcrit(i, 16, 0.05),
    smmc.tukey_pvalues(x*np.ones(i), 16)[0]
(3.647864471854692, 0.049999670839029453, array(3.6499999999999999),
    0.050092818925981608)
    (4.0464124382823847, 0.050001178443752514, array(4.0499999999999998),
    0.037164602483501508)
    (4.3332505094058114, 0.049999838126148499, array(4.3300000000000001),
    0.029954033157223781)
    (4.5573603020371234, 0.049999276281813887, array(4.5599999999999996),
    0.025276987281047769)
    (4.7410585998112742, 0.049998508166777755, array(4.7400000000000002),
    0.022010630154416622)
    (4.8965400268915289, 0.04999983345598491, array(4.9000000000000004),
    0.019614841752159107)
    (5.0312039650945257, 0.049999535359310343, array(5.0300000000000002),
    0.017721848279719898)
The last column is (in my interpretation) supposed to be 0.05. I was
    trying to get the pvalues for Tukeys range statistic through the
    multivariate t-distribution, but the unit test looks only at one point
    (and I ran out of time to work on this during Christmas break). Either
    there is a bug (it's still in the sandbox) or my interpretation is
    wrong.
The advantage of the multivariate t-distribution is that it allows for
    arbitrary correlation, but it's not a substitute for pre-calculated
    tables for standard cases/distributions because it's much too slow.
------------
    As a bit of background on the multiple testing, multiple comparison
    status in statsmodels:
The tukeyhsd test has one test case against R, but it has too many
    options (it allows unequal variances and unequal sample sizes, that
    still need to be checked.)
http://statsmodels.sourceforge.net/devel/generated/scikits.statsmodels.sandb...
What I did manage to finish and verify against R
http://statsmodels.sourceforge.net/devel/generated/scikits.statsmodels.sandb...
multiple testing for general linear models is very incomplete
and as an aside: I'm not a statistician, and if the module in the
    statsmodels sandbox is still a mess then it's because I took me a long
    time and many functions to figure out what's going on.
    ----------
scipy.special has a nice collection of standard distributions
    functions, but it would be very useful to have some additional
    distributions either in scipy or scikits.statsmodels available, like
    your studentized range statistic, (and maybe some others in multiple
    comparisons, like Duncan, Dunnet) and Anderson-Darling, and ...
Thanks,
Josef
> Regards,
    > Roger
    > Roger Lew
    >
    > _______________________________________________
    > SciPy-Dev mailing list
    > SciPy-Dev@scipy.org <mailto:SciPy-Dev@scipy.org>
    > http://mail.scipy.org/mailman/listinfo/scipy-dev
    >
    >
    _______________________________________________
    SciPy-Dev mailing list
    SciPy-Dev@scipy.org <mailto:SciPy-Dev@scipy.org>
    http://mail.scipy.org/mailman/listinfo/scipy-dev
_______________________________________________
SciPy-Dev mailing list
SciPy-Dev@scipy.org
http://mail.scipy.org/mailman/listinfo/scipy-dev
The problem I have is that this is still an approximation that probably 
covers what most situations.
Do you have the Copenhaver & Holland (1988) approach or know any 
BSD-licensed Python code for it?
Also, the code probably needs to conform to numpy/scipy standards (which 
I don't remember the links).
You also have vary less desirable features:
1) hard coded numbers like '1e38' - numpy does define infinity (np.inf).
2) comparisons to 'inf' ('v==inf') that are not desirable.

Bruce

Re: [SciPy-Dev] studentized range approximations

Bruce Southey