Mailman 3 studentized range approximations - SciPy-Dev

newer
Re: [SciPy-Dev] call signature for...

studentized range approximations

older
GSoC 2012

Roger Lew

June 2, 2011

4:53 a.m.

Hi, I have implemented some approximations for studentized range quantiles and probabilities based on John R. Gleason's (1999) "An accurate, non-iterative approximation for studentized range quantiles." Computational Statistics & Data Analysis, (31), 147-158. Probability approximations rely on scipy.optimize.fminbound. The functions accept both scalars or array-like data thanks to numpy.vectorize. A fair amount of validation and testing has been conducted on the code. More details can be found here: http://code.google.com/p/qsturng-py/ I welcome any thoughts as to whether you all think this might be useful to add to SciPy or make into a scikit. Any general comments would be helpful as well. I should mention I'm a cognitive neuroscientist by trade, my use of statistical jargon probably isn't that good. Regards, Roger Roger Lew

Attachments:

attachment.htm (text/html — 2.3 KB)

Show replies by date

josef.pktd＠gmail.com

June 2011

8:38 a.m.

On Thu, Jun 2, 2011 at 12:53 AM, Roger Lew <rogerlew@gmail.com> wrote:

...

Hi, I have implemented some approximations for studentized range quantiles and probabilities based on John R. Gleason's (1999) "An accurate, non-iterative approximation for studentized range quantiles." Computational Statistics & Data Analysis, (31), 147-158. Probability approximations rely on scipy.optimize.fminbound. The functions accept both scalars or array-like data thanks to numpy.vectorize. A fair amount of validation and testing has been conducted on the code. More details can be found here: http://code.google.com/p/qsturng-py/ I welcome any thoughts as to whether you all think this might be useful to add to SciPy or make into a scikit. Any general comments would be helpful as well. I should mention I'm a cognitive neuroscientist by trade, my use of statistical jargon probably isn't that good.

...

From a quick look it looks very good. What I found a bit confusing is that qstrung takes the probability of

Hi Roger, I'm very interested in using this in scikits.statsmodels. The table that I am currently using is very limited http://statsmodels.sourceforge.net/devel/generated/scikits.statsmodels.sandb... the cdf and not of the survival function. Without reading the docstring carefully enough, I interpreted it as a p-value (upper tail) especially since pstrung returns the upper tail probability,

...

...
...
import scikits.statsmodels.sandbox.stats.multicomp as smmc for i in range(3, 10): x = qsturng(0.95, i, 16) x, psturng(x, i, 16), smmc.get_tukeyQcrit(i, 16, 0.05), smmc.tukey_pvalues(x*np.ones(i), 16)[0]

(3.647864471854692, 0.049999670839029453, array(3.6499999999999999), 0.050092818925981608) (4.0464124382823847, 0.050001178443752514, array(4.0499999999999998), 0.037164602483501508) (4.3332505094058114, 0.049999838126148499, array(4.3300000000000001), 0.029954033157223781) (4.5573603020371234, 0.049999276281813887, array(4.5599999999999996), 0.025276987281047769) (4.7410585998112742, 0.049998508166777755, array(4.7400000000000002), 0.022010630154416622) (4.8965400268915289, 0.04999983345598491, array(4.9000000000000004), 0.019614841752159107) (5.0312039650945257, 0.049999535359310343, array(5.0300000000000002), 0.017721848279719898) The last column is (in my interpretation) supposed to be 0.05. I was trying to get the pvalues for Tukeys range statistic through the multivariate t-distribution, but the unit test looks only at one point (and I ran out of time to work on this during Christmas break). Either there is a bug (it's still in the sandbox) or my interpretation is wrong. The advantage of the multivariate t-distribution is that it allows for arbitrary correlation, but it's not a substitute for pre-calculated tables for standard cases/distributions because it's much too slow. ------------ As a bit of background on the multiple testing, multiple comparison status in statsmodels: The tukeyhsd test has one test case against R, but it has too many options (it allows unequal variances and unequal sample sizes, that still need to be checked.) http://statsmodels.sourceforge.net/devel/generated/scikits.statsmodels.sandb... What I did manage to finish and verify against R http://statsmodels.sourceforge.net/devel/generated/scikits.statsmodels.sandb... multiple testing for general linear models is very incomplete and as an aside: I'm not a statistician, and if the module in the statsmodels sandbox is still a mess then it's because I took me a long time and many functions to figure out what's going on. ---------- scipy.special has a nice collection of standard distributions functions, but it would be very useful to have some additional distributions either in scipy or scikits.statsmodels available, like your studentized range statistic, (and maybe some others in multiple comparisons, like Duncan, Dunnet) and Anderson-Darling, and ... Thanks, Josef

...

Regards, Roger Roger Lew

_______________________________________________ SciPy-Dev mailing list SciPy-Dev@scipy.org http://mail.scipy.org/mailman/listinfo/scipy-dev

Roger Lew

6:15 p.m.

Hi Josef, Thanks for the feedback. The "choice" of using cdf is more a carryover from how the algorithm is described by Gleason. Perhaps it would be best to have it match your intuition and accept the survival function? Feel free to treat it like your own for statsmodels. I will definitely check out some of your multcomp module so I'm not reinventing the wheel. In the grand scheme, I could see these having a home in scipy.special (after more extensive review of course). That is where I went to look for it when I didn't find it in distributions. Roger On Thu, Jun 2, 2011 at 1:38 AM, <josef.pktd@gmail.com> wrote:

...

On Thu, Jun 2, 2011 at 12:53 AM, Roger Lew <rogerlew@gmail.com> wrote:

...
Hi, I have implemented some approximations for studentized range quantiles and probabilities based on John R. Gleason's (1999) "An accurate, non-iterative approximation for studentized range quantiles." Computational Statistics & Data Analysis, (31), 147-158. Probability approximations rely on scipy.optimize.fminbound. The functions accept both scalars or array-like data thanks to numpy.vectorize. A fair amount of validation and testing has been conducted on the code. More details can be found here: http://code.google.com/p/qsturng-py/ I welcome any thoughts as to whether you all think this might be useful to add to SciPy or make into a scikit. Any general comments would be helpful as well. I should mention I'm a cognitive neuroscientist by trade, my use of statistical jargon probably isn't that good.

Hi Roger,

I'm very interested in using this in scikits.statsmodels. The table that I am currently using is very limited

http://statsmodels.sourceforge.net/devel/generated/scikits.statsmodels.sandb...

...
From a quick look it looks very good. What I found a bit confusing is that qstrung takes the probability of the cdf and not of the survival function. Without reading the docstring carefully enough, I interpreted it as a p-value (upper tail) especially since pstrung returns the upper tail probability,

...
...
...
import scikits.statsmodels.sandbox.stats.multicomp as smmc for i in range(3, 10): x = qsturng(0.95, i, 16) x, psturng(x, i, 16), smmc.get_tukeyQcrit(i, 16, 0.05), smmc.tukey_pvalues(x*np.ones(i), 16)[0]

(3.647864471854692, 0.049999670839029453, array(3.6499999999999999), 0.050092818925981608) (4.0464124382823847, 0.050001178443752514, array(4.0499999999999998), 0.037164602483501508) (4.3332505094058114, 0.049999838126148499, array(4.3300000000000001), 0.029954033157223781) (4.5573603020371234, 0.049999276281813887, array(4.5599999999999996), 0.025276987281047769) (4.7410585998112742, 0.049998508166777755, array(4.7400000000000002), 0.022010630154416622) (4.8965400268915289, 0.04999983345598491, array(4.9000000000000004), 0.019614841752159107) (5.0312039650945257, 0.049999535359310343, array(5.0300000000000002), 0.017721848279719898)

The last column is (in my interpretation) supposed to be 0.05. I was trying to get the pvalues for Tukeys range statistic through the multivariate t-distribution, but the unit test looks only at one point (and I ran out of time to work on this during Christmas break). Either there is a bug (it's still in the sandbox) or my interpretation is wrong.

The advantage of the multivariate t-distribution is that it allows for arbitrary correlation, but it's not a substitute for pre-calculated tables for standard cases/distributions because it's much too slow.

------------ As a bit of background on the multiple testing, multiple comparison status in statsmodels:

The tukeyhsd test has one test case against R, but it has too many options (it allows unequal variances and unequal sample sizes, that still need to be checked.)

http://statsmodels.sourceforge.net/devel/generated/scikits.statsmodels.sandb...

What I did manage to finish and verify against R

http://statsmodels.sourceforge.net/devel/generated/scikits.statsmodels.sandb...

multiple testing for general linear models is very incomplete

and as an aside: I'm not a statistician, and if the module in the statsmodels sandbox is still a mess then it's because I took me a long time and many functions to figure out what's going on. ----------

scipy.special has a nice collection of standard distributions functions, but it would be very useful to have some additional distributions either in scipy or scikits.statsmodels available, like your studentized range statistic, (and maybe some others in multiple comparisons, like Duncan, Dunnet) and Anderson-Darling, and ...

Thanks,

Josef

...
Regards, Roger Roger Lew

_______________________________________________ SciPy-Dev mailing list SciPy-Dev@scipy.org http://mail.scipy.org/mailman/listinfo/scipy-dev

_______________________________________________ SciPy-Dev mailing list SciPy-Dev@scipy.org http://mail.scipy.org/mailman/listinfo/scipy-dev

Bruce Southey

1:37 p.m.

On 06/02/2011 01:15 PM, Roger Lew wrote:

...

Hi Josef,

Thanks for the feedback. The "choice" of using cdf is more a carryover from how the algorithm is described by Gleason. Perhaps it would be best to have it match your intuition and accept the survival function?

Feel free to treat it like your own for statsmodels. I will definitely check out some of your multcomp module so I'm not reinventing the wheel.

In the grand scheme, I could see these having a home in scipy.special (after more extensive review of course). That is where I went to look for it when I didn't find it in distributions.

Roger

On Thu, Jun 2, 2011 at 1:38 AM, <josef.pktd@gmail.com <mailto:josef.pktd@gmail.com>> wrote:

On Thu, Jun 2, 2011 at 12:53 AM, Roger Lew <rogerlew@gmail.com <mailto:rogerlew@gmail.com>> wrote: > Hi, > I have implemented some approximations for studentized range quantiles and > probabilities based on John R. Gleason's (1999) "An accurate, non-iterative > approximation for studentized range quantiles." Computational Statistics & > Data Analysis, (31), 147-158. > Probability approximations rely on scipy.optimize.fminbound. The functions > accept both scalars or array-like data thanks to numpy.vectorize. A fair > amount of validation and testing has been conducted on the code. More > details can be found here: http://code.google.com/p/qsturng-py/ > I welcome any thoughts as to whether you all think this might be useful to > add to SciPy or make into a scikit. Any general comments would be helpful as > well. I should mention I'm a cognitive neuroscientist by trade, my use of > statistical jargon probably isn't that good.

Hi Roger,

I'm very interested in using this in scikits.statsmodels. The table that I am currently using is very limited http://statsmodels.sourceforge.net/devel/generated/scikits.statsmodels.sandb...

>From a quick look it looks very good. What I found a bit confusing is that qstrung takes the probability of the cdf and not of the survival function. Without reading the docstring carefully enough, I interpreted it as a p-value (upper tail) especially since pstrung returns the upper tail probability,

>>> import scikits.statsmodels.sandbox.stats.multicomp as smmc >>> for i in range(3, 10): x = qsturng(0.95, i, 16) x, psturng(x, i, 16), smmc.get_tukeyQcrit(i, 16, 0.05), smmc.tukey_pvalues(x*np.ones(i), 16)[0]

(3.647864471854692, 0.049999670839029453, array(3.6499999999999999), 0.050092818925981608) (4.0464124382823847, 0.050001178443752514, array(4.0499999999999998), 0.037164602483501508) (4.3332505094058114, 0.049999838126148499, array(4.3300000000000001), 0.029954033157223781) (4.5573603020371234, 0.049999276281813887, array(4.5599999999999996), 0.025276987281047769) (4.7410585998112742, 0.049998508166777755, array(4.7400000000000002), 0.022010630154416622) (4.8965400268915289, 0.04999983345598491, array(4.9000000000000004), 0.019614841752159107) (5.0312039650945257, 0.049999535359310343, array(5.0300000000000002), 0.017721848279719898)

The last column is (in my interpretation) supposed to be 0.05. I was trying to get the pvalues for Tukeys range statistic through the multivariate t-distribution, but the unit test looks only at one point (and I ran out of time to work on this during Christmas break). Either there is a bug (it's still in the sandbox) or my interpretation is wrong.

The advantage of the multivariate t-distribution is that it allows for arbitrary correlation, but it's not a substitute for pre-calculated tables for standard cases/distributions because it's much too slow.

------------ As a bit of background on the multiple testing, multiple comparison status in statsmodels:

The tukeyhsd test has one test case against R, but it has too many options (it allows unequal variances and unequal sample sizes, that still need to be checked.)

http://statsmodels.sourceforge.net/devel/generated/scikits.statsmodels.sandb...

What I did manage to finish and verify against R

http://statsmodels.sourceforge.net/devel/generated/scikits.statsmodels.sandb...

multiple testing for general linear models is very incomplete

and as an aside: I'm not a statistician, and if the module in the statsmodels sandbox is still a mess then it's because I took me a long time and many functions to figure out what's going on. ----------

scipy.special has a nice collection of standard distributions functions, but it would be very useful to have some additional distributions either in scipy or scikits.statsmodels available, like your studentized range statistic, (and maybe some others in multiple comparisons, like Duncan, Dunnet) and Anderson-Darling, and ...

Thanks,

Josef

> Regards, > Roger > Roger Lew > > _______________________________________________ > SciPy-Dev mailing list > SciPy-Dev@scipy.org <mailto:SciPy-Dev@scipy.org> > http://mail.scipy.org/mailman/listinfo/scipy-dev > > _______________________________________________ SciPy-Dev mailing list SciPy-Dev@scipy.org <mailto:SciPy-Dev@scipy.org> http://mail.scipy.org/mailman/listinfo/scipy-dev

_______________________________________________ SciPy-Dev mailing list SciPy-Dev@scipy.org http://mail.scipy.org/mailman/listinfo/scipy-dev The problem I have is that this is still an approximation that probably covers what most situations. Do you have the Copenhaver & Holland (1988) approach or know any BSD-licensed Python code for it?

Also, the code probably needs to conform to numpy/scipy standards (which I don't remember the links). You also have vary less desirable features: 1) hard coded numbers like '1e38' - numpy does define infinity (np.inf). 2) comparisons to 'inf' ('v==inf') that are not desirable. Bruce

Roger Lew

6:17 p.m.

Hi Bruce, My comparisons were made with R's qtukey. qtukey is GNU v2 or later and uses the Copenhaver & Holland (1988) approach. From testing I have found qtukey has weaknesses of its own. qtukey doesn't do well when v is small and p is close to 1. When compared to Harter's (1960) qtukey severely underestimates quantiles with p=.999 and v=2. For r values in the table (ranging from 2 to 100) qtukey is off by the following percentages 100 * (qtukey - table) / table: -22.32 -29.80 -33.68 -36.10 -37.79 -39.04 -40.01 -40.79 -41.45 -41.97 -42.46 -42.86 -43.26 -43.55 -43.88 -44.15 -44.35 -44.57 -44.78 -46.17 -47.01 -47.96 -48.56 -48.96 with p=.999 and v=3 qtukey is overestimates quantiles and fails to converge for r=80 and r=100: -0.02 1.65 3.82 6.15 8.18 9.84 11.17 12.18 13.02 13.69 14.24 14.72 15.14 15.50 15.78 16.06 16.32 16.51 16.72 18.03 18.70 19.46 nan nan At p=.975 and v=2 qtukey is still biased but much closer (in percentages): -0.24 -0.03 -0.01 0.02 -0.10 -0.23 -0.38 -0.55 -0.70 -0.83 -0.94 -1.02 -1.08 -1.16 -1.26 -1.34 -1.38 -1.45 -1.46 -1.82 -2.00 -2.23 -2.35 -2.45 qtukey also has a hard time converging in certain parameter ranges. When r=70 and v= inf: p,q 0.5,4.707268 0.5050404,4.715069 0.5100808,NaN 0.5151212,NaN 0.5201616,4.738577 0.525202,4.746453 0.5302424,NaN 0.5352828,NaN 0.5403232,NaN 0.5453636,NaN 0.550404,NaN 0.5554444,NaN 0.5604848,NaN 0.5655253,NaN 0.5705657,NaN 0.5756061,NaN 0.5806465,NaN 0.5856869,4.843067 0.5907273,4.851339 0.5957677,4.859652 0.6008081,4.86801 0.6058485,4.876412 0.6108889,4.884861 0.6159293,4.893358 0.6209697,4.901907 I think the case could be made that the qtukey (Copenhaver & Holland approach) may be better for some things (higher precision given certain qualifications), and Gleason's approach may be better for others, but I don't think that one is clearly superior to the other. (A Copenhaver & Holland with 128 bit floats may be a different story). I think the case can be made that for multiple comparison testing Gleason's approach may be better. The fact that it is fit to Harter's tables means "apple to apple" comparisons to people still using "paper tables." Secondly, because Gleason's approach may be preferred since the empirical data seems to suggest it is less biased for small v and cdf values close to 1. I should have documented it better, but the reason 1e38 is used instead of inf is because scipy.stats.t.isf has what I would describe as a loss of entropy when v == inf

...

...
...
scipy.stats.t.isf(.1, inf), scipy.stats.t.isf(.1, 1e38) (-1e+100, 1.2815515655446004) scipy.stats.t.isf(.2, inf), scipy.stats.t.isf(.2, 1e38) (-1e+100, 0.84162123357291441) scipy.stats.t.isf(.5, inf), scipy.stats.t.isf(.5, 1e38) (-6.637989916170446e-17, -6.637989916170446e-17) scipy.stats.t.isf(.8, inf), scipy.stats.t.isf(.8, 1e38) (-1e+100, -0.84162123357291441) scipy.stats.t.isf(.95, inf), scipy.stats.t.isf(.95, 1e38) (-1e+100, -1.6448536269514722) scipy.stats.t.isf(.999, inf), scipy.stats.t.isf(.999, 1e38) (-1e+100, -3.0902323061678132)

For consistency 1e38 was used throughout. It is also difficult to apply interpolation between 120 and inf. If there is a better method of getting t inverse, I'm willing to rework the code. Given how much the distribution changes in that range relative to the overall precision of the algorithm, my personal opinion is that it is a small issue. I do think it should be better commented and added to the list of caveats. v is always a scalar for the v==inf comparisons, but I don't have a problem using math.isinf if that is the preferred method. Roger On Fri, Jun 3, 2011 at 6:37 AM, Bruce Southey <bsouthey@gmail.com> wrote:

...

On 06/02/2011 01:15 PM, Roger Lew wrote:

Hi Josef,

Thanks for the feedback. The "choice" of using cdf is more a carryover from how the algorithm is described by Gleason. Perhaps it would be best to have it match your intuition and accept the survival function?

Feel free to treat it like your own for statsmodels. I will definitely check out some of your multcomp module so I'm not reinventing the wheel.

In the grand scheme, I could see these having a home in scipy.special (after more extensive review of course). That is where I went to look for it when I didn't find it in distributions.

Roger

On Thu, Jun 2, 2011 at 1:38 AM, <josef.pktd@gmail.com> wrote:

...
On Thu, Jun 2, 2011 at 12:53 AM, Roger Lew <rogerlew@gmail.com> wrote:

...
Hi, I have implemented some approximations for studentized range quantiles and probabilities based on John R. Gleason's (1999) "An accurate, non-iterative approximation for studentized range quantiles." Computational Statistics & Data Analysis, (31), 147-158. Probability approximations rely on scipy.optimize.fminbound. The functions accept both scalars or array-like data thanks to numpy.vectorize. A fair amount of validation and testing has been conducted on the code. More details can be found here: http://code.google.com/p/qsturng-py/ I welcome any thoughts as to whether you all think this might be useful to add to SciPy or make into a scikit. Any general comments would be helpful as well. I should mention I'm a cognitive neuroscientist by trade, my use of statistical jargon probably isn't that good.

Hi Roger,

I'm very interested in using this in scikits.statsmodels. The table that I am currently using is very limited

http://statsmodels.sourceforge.net/devel/generated/scikits.statsmodels.sandb...

...
From a quick look it looks very good. What I found a bit confusing is that qstrung takes the probability of the cdf and not of the survival function. Without reading the docstring carefully enough, I interpreted it as a p-value (upper tail) especially since pstrung returns the upper tail probability,

...
...
...
import scikits.statsmodels.sandbox.stats.multicomp as smmc for i in range(3, 10): x = qsturng(0.95, i, 16) x, psturng(x, i, 16), smmc.get_tukeyQcrit(i, 16, 0.05), smmc.tukey_pvalues(x*np.ones(i), 16)[0]

(3.647864471854692, 0.049999670839029453, array(3.6499999999999999), 0.050092818925981608) (4.0464124382823847, 0.050001178443752514, array(4.0499999999999998), 0.037164602483501508) (4.3332505094058114, 0.049999838126148499, array(4.3300000000000001), 0.029954033157223781) (4.5573603020371234, 0.049999276281813887, array(4.5599999999999996), 0.025276987281047769) (4.7410585998112742, 0.049998508166777755, array(4.7400000000000002), 0.022010630154416622) (4.8965400268915289, 0.04999983345598491, array(4.9000000000000004), 0.019614841752159107) (5.0312039650945257, 0.049999535359310343, array(5.0300000000000002), 0.017721848279719898)

The last column is (in my interpretation) supposed to be 0.05. I was trying to get the pvalues for Tukeys range statistic through the multivariate t-distribution, but the unit test looks only at one point (and I ran out of time to work on this during Christmas break). Either there is a bug (it's still in the sandbox) or my interpretation is wrong.

The advantage of the multivariate t-distribution is that it allows for arbitrary correlation, but it's not a substitute for pre-calculated tables for standard cases/distributions because it's much too slow.

------------ As a bit of background on the multiple testing, multiple comparison status in statsmodels:

The tukeyhsd test has one test case against R, but it has too many options (it allows unequal variances and unequal sample sizes, that still need to be checked.)

http://statsmodels.sourceforge.net/devel/generated/scikits.statsmodels.sandb...

What I did manage to finish and verify against R

http://statsmodels.sourceforge.net/devel/generated/scikits.statsmodels.sandb...

multiple testing for general linear models is very incomplete

and as an aside: I'm not a statistician, and if the module in the statsmodels sandbox is still a mess then it's because I took me a long time and many functions to figure out what's going on. ----------

scipy.special has a nice collection of standard distributions functions, but it would be very useful to have some additional distributions either in scipy or scikits.statsmodels available, like your studentized range statistic, (and maybe some others in multiple comparisons, like Duncan, Dunnet) and Anderson-Darling, and ...

Thanks,

Josef

...
Regards, Roger Roger Lew

_______________________________________________ SciPy-Dev mailing list SciPy-Dev@scipy.org http://mail.scipy.org/mailman/listinfo/scipy-dev

_______________________________________________ SciPy-Dev mailing list SciPy-Dev@scipy.org http://mail.scipy.org/mailman/listinfo/scipy-dev

_______________________________________________ SciPy-Dev mailing listSciPy-Dev@scipy.orghttp://mail.scipy.org/mailman/listinfo/scipy-dev

The problem I have is that this is still an approximation that probably covers what most situations. Do you have the Copenhaver & Holland (1988) approach or know any BSD-licensed Python code for it?

Also, the code probably needs to conform to numpy/scipy standards (which I don't remember the links). You also have vary less desirable features: 1) hard coded numbers like '1e38' - numpy does define infinity (np.inf). 2) comparisons to 'inf' ('v==inf') that are not desirable.

Bruce

_______________________________________________ SciPy-Dev mailing list SciPy-Dev@scipy.org http://mail.scipy.org/mailman/listinfo/scipy-dev

josef.pktd＠gmail.com

March 2012

6:58 p.m.

On Thu, Jun 2, 2011 at 2:15 PM, Roger Lew <rogerlew@gmail.com> wrote:

...

Hi Josef,

Thanks for the feedback. The "choice" of using cdf is more a carryover from how the algorithm is described by Gleason. Perhaps it would be best to have it match your intuition and accept the survival function?

Feel free to treat it like your own for statsmodels. I will definitely check out some of your multcomp module so I'm not reinventing the wheel.

In the grand scheme, I could see these having a home in scipy.special (after more extensive review of course). That is where I went to look for it when I didn't find it in distributions.

Hi Roger, I'm just on the way of including qstrung-py in statsmodels. I needed to add an OrderedDict for python < 2.7 but everything looks as good as in my first impression. Thank you for making this available, Josef

...

Roger

On Thu, Jun 2, 2011 at 1:38 AM, <josef.pktd@gmail.com> wrote:

...
On Thu, Jun 2, 2011 at 12:53 AM, Roger Lew <rogerlew@gmail.com> wrote:

...
Hi, I have implemented some approximations for studentized range quantiles and probabilities based on John R. Gleason's (1999) "An accurate, non-iterative approximation for studentized range quantiles." Computational Statistics & Data Analysis, (31), 147-158. Probability approximations rely on scipy.optimize.fminbound. The functions accept both scalars or array-like data thanks to numpy.vectorize. A fair amount of validation and testing has been conducted on the code. More details can be found here: http://code.google.com/p/qsturng-py/ I welcome any thoughts as to whether you all think this might be useful to add to SciPy or make into a scikit. Any general comments would be helpful as well. I should mention I'm a cognitive neuroscientist by trade, my use of statistical jargon probably isn't that good.

Hi Roger,

I'm very interested in using this in scikits.statsmodels. The table that I am currently using is very limited

http://statsmodels.sourceforge.net/devel/generated/scikits.statsmodels.sandb...

...
From a quick look it looks very good. What I found a bit confusing is that qstrung takes the probability of the cdf and not of the survival function. Without reading the docstring carefully enough, I interpreted it as a p-value (upper tail) especially since pstrung returns the upper tail probability,

...
...
...
import scikits.statsmodels.sandbox.stats.multicomp as smmc for i in range(3, 10): x = qsturng(0.95, i, 16) x, psturng(x, i, 16), smmc.get_tukeyQcrit(i, 16, 0.05), smmc.tukey_pvalues(x*np.ones(i), 16)[0]

(3.647864471854692, 0.049999670839029453, array(3.6499999999999999), 0.050092818925981608) (4.0464124382823847, 0.050001178443752514, array(4.0499999999999998), 0.037164602483501508) (4.3332505094058114, 0.049999838126148499, array(4.3300000000000001), 0.029954033157223781) (4.5573603020371234, 0.049999276281813887, array(4.5599999999999996), 0.025276987281047769) (4.7410585998112742, 0.049998508166777755, array(4.7400000000000002), 0.022010630154416622) (4.8965400268915289, 0.04999983345598491, array(4.9000000000000004), 0.019614841752159107) (5.0312039650945257, 0.049999535359310343, array(5.0300000000000002), 0.017721848279719898)

The last column is (in my interpretation) supposed to be 0.05. I was trying to get the pvalues for Tukeys range statistic through the multivariate t-distribution, but the unit test looks only at one point (and I ran out of time to work on this during Christmas break). Either there is a bug (it's still in the sandbox) or my interpretation is wrong.

The advantage of the multivariate t-distribution is that it allows for arbitrary correlation, but it's not a substitute for pre-calculated tables for standard cases/distributions because it's much too slow.

------------ As a bit of background on the multiple testing, multiple comparison status in statsmodels:

The tukeyhsd test has one test case against R, but it has too many options (it allows unequal variances and unequal sample sizes, that still need to be checked.)

http://statsmodels.sourceforge.net/devel/generated/scikits.statsmodels.sandb...

What I did manage to finish and verify against R

http://statsmodels.sourceforge.net/devel/generated/scikits.statsmodels.sandb...

multiple testing for general linear models is very incomplete

and as an aside: I'm not a statistician, and if the module in the statsmodels sandbox is still a mess then it's because I took me a long time and many functions to figure out what's going on. ----------

scipy.special has a nice collection of standard distributions functions, but it would be very useful to have some additional distributions either in scipy or scikits.statsmodels available, like your studentized range statistic, (and maybe some others in multiple comparisons, like Duncan, Dunnet) and Anderson-Darling, and ...

Thanks,

Josef

...
Regards, Roger Roger Lew

_______________________________________________ SciPy-Dev mailing list SciPy-Dev@scipy.org http://mail.scipy.org/mailman/listinfo/scipy-dev

_______________________________________________ SciPy-Dev mailing list SciPy-Dev@scipy.org http://mail.scipy.org/mailman/listinfo/scipy-dev

_______________________________________________ SciPy-Dev mailing list SciPy-Dev@scipy.org http://mail.scipy.org/mailman/listinfo/scipy-dev

4733

Age (days ago)

5033

Last active (days ago)

List overview

Download

5 comments

3 participants

participants (3)

Bruce Southey
josef.pktd＠gmail.com
Roger Lew

studentized range approximations

Roger Lew

josef.pktd＠gmail.com

Roger Lew

Bruce Southey

Roger Lew

josef.pktd＠gmail.com

tags

participants (3)