Scipy's probplot compared to R's qqplot
Hey folks, I've taken more of an interest in statistics and Scipy lately and decided to compare the scipy.stats.probplot() function to R's qqplot(). For a given dataset, the results are slightly different. Here's a link to the script I wrote to do the comparison. http://dpaste.com/167464/ Basically, it does the following: -Uses numpy to generate some fake, noramlly distributed data -Uses both R and Scipy to compute the values needed for quantile/probability plot -Computes linear regressions on the quantile data with both R and Scipy. -prints some output to compare the two My initial conclusions: 1) R's lm(y~x) and scipy.stats.linregress(x,y) yield the same slope and intercept of a linear model. (good) 2) R and Scipy compute the quantiles of a dataset in slightly different manners (??) Any clue as to why the discrepancy in #2 occurs? Would you consider it a big deal? I'm using: Python v2.6.2 (XP) and v2.6.4 (Karmic and Snow Leopard) Scipy v0.7.1 Numpy v1.4.0 R v2.10.0 Rpy2 v2.0.8 Thanks, -Paul H.
On Wed, Mar 3, 2010 at 2:09 PM, <PHobson@geosyntec.com> wrote:
Hey folks,
I've taken more of an interest in statistics and Scipy lately and decided to compare the scipy.stats.probplot() function to R's qqplot(). For a given dataset, the results are slightly different.
Here's a link to the script I wrote to do the comparison. http://dpaste.com/167464/
Basically, it does the following: -Uses numpy to generate some fake, noramlly distributed data -Uses both R and Scipy to compute the values needed for quantile/probability plot -Computes linear regressions on the quantile data with both R and Scipy. -prints some output to compare the two
My initial conclusions: 1) R's lm(y~x) and scipy.stats.linregress(x,y) yield the same slope and intercept of a linear model. (good) 2) R and Scipy compute the quantiles of a dataset in slightly different manners (??)
Any clue as to why the discrepancy in #2 occurs? Would you consider it a big deal?
I would consider any significant deviation a big deal, unless we know that there are differences in the definitions or underlying assumptions. I'm not sure what's going on since I never looked at the details of probplot. However, when I plot the quantiles
plt.plot(np.sort(qR)) plt.plot(qS[0]) plt.show()
then the graph looks almost the same except for the first and last point. qS[0]-np.sort(qR) differs in the second decimal, except for first and last observation. My guess would be that there are some differences for example in the continuity correction, or similar. The boundary points, however, look suspicious. Thanks for checking this, Josef I'm using:
Python v2.6.2 (XP) and v2.6.4 (Karmic and Snow Leopard) Scipy v0.7.1 Numpy v1.4.0 R v2.10.0 Rpy2 v2.0.8
Thanks, -Paul H. _______________________________________________ SciPy-User mailing list SciPy-User@scipy.org http://mail.scipy.org/mailman/listinfo/scipy-user
On Wed, Mar 3, 2010 at 2:09 PM, <PHobson@geosyntec.com> wrote:
Hey folks,
I've taken more of an interest in statistics and Scipy lately and decided to compare the scipy.stats.probplot() function to R's qqplot(). For a given dataset, the results are slightly different.
Here's a link to the script I wrote to do the comparison. http://dpaste.com/167464/
Basically, it does the following: -Uses numpy to generate some fake, noramlly distributed data -Uses both R and Scipy to compute the values needed for quantile/probability plot -Computes linear regressions on the quantile data with both R and Scipy. -prints some output to compare the two
My initial conclusions: 1) R's lm(y~x) and scipy.stats.linregress(x,y) yield the same slope and intercept of a linear model. (good) 2) R and Scipy compute the quantiles of a dataset in slightly different manners (??)
Any clue as to why the discrepancy in #2 occurs? Would you consider it a big deal?
From: scipy-user-bounces@scipy.org [mailto:scipy-user-bounces@scipy.org] On Behalf Of josef.pktd@gmail.com I would consider any significant deviation a big deal, unless we know that there are differences in the definitions or underlying assumptions.
I'm not sure what's going on since I never looked at the details of probplot. However, when I plot the quantiles
plt.plot(np.sort(qR)) plt.plot(qS[0]) plt.show()
then the graph looks almost the same except for the first and last point.
Yes. When I plotted them, I could not visually distinguish them (see attached). I forgot to mention that.
qS[0]-np.sort(qR)
differs in the second decimal, except for first and last observation. My guess would be that there are some differences for example in the continuity correction, or similar.
The boundary points, however, look suspicious.
Thanks for looking further into this. When I saw that the slopes and intercepts were different, I immediately inspected just the max and min values (laziness, sorry). If I find some time next week, I'll dig around in the source and see if I can't figure out what's happening at those points. -Paul H.
On Wed, Mar 3, 2010 at 2:49 PM, <PHobson@geosyntec.com> wrote:
On Wed, Mar 3, 2010 at 2:09 PM, <PHobson@geosyntec.com> wrote:
Hey folks,
I've taken more of an interest in statistics and Scipy lately and decided to compare the scipy.stats.probplot() function to R's qqplot(). For a given dataset, the results are slightly different.
Here's a link to the script I wrote to do the comparison. http://dpaste.com/167464/
Basically, it does the following: -Uses numpy to generate some fake, noramlly distributed data -Uses both R and Scipy to compute the values needed for quantile/probability plot -Computes linear regressions on the quantile data with both R and Scipy. -prints some output to compare the two
My initial conclusions: 1) R's lm(y~x) and scipy.stats.linregress(x,y) yield the same slope and intercept of a linear model. (good) 2) R and Scipy compute the quantiles of a dataset in slightly different manners (??)
Any clue as to why the discrepancy in #2 occurs? Would you consider it a big deal?
From: scipy-user-bounces@scipy.org [mailto:scipy-user-bounces@scipy.org] On Behalf Of josef.pktd@gmail.com I would consider any significant deviation a big deal, unless we know that there are differences in the definitions or underlying assumptions.
I'm not sure what's going on since I never looked at the details of probplot. However, when I plot the quantiles
plt.plot(np.sort(qR)) plt.plot(qS[0]) plt.show()
then the graph looks almost the same except for the first and last point.
Yes. When I plotted them, I could not visually distinguish them (see attached). I forgot to mention that.
qS[0]-np.sort(qR)
differs in the second decimal, except for first and last observation. My guess would be that there are some differences for example in the continuity correction, or similar.
The boundary points, however, look suspicious.
Thanks for looking further into this. When I saw that the slopes and intercepts were different, I immediately inspected just the max and min values (laziness, sorry). If I find some time next week, I'll dig around in the source and see if I can't figure out what's happening at those points.
my prime candidate for the 2nd decimal differences, are differences in the correction Ui[1:-1] = (i-0.3175)/(N+0.365) There are several conventions, David Huard posted a list of them attached to a ticket (?), for empirical cdf. There might be another correction for boundary points that is different. Ui[-1] = 0.5**(1.0/N) Ui[0] = 1-Ui[-1] But for graphical inspection, R and scipy look close enough. Josef
-Paul H.
_______________________________________________ SciPy-User mailing list SciPy-User@scipy.org http://mail.scipy.org/mailman/listinfo/scipy-user
On Wed, Mar 3, 2010 at 13:09, <PHobson@geosyntec.com> wrote:
Hey folks,
I've taken more of an interest in statistics and Scipy lately and decided to compare the scipy.stats.probplot() function to R's qqplot(). For a given dataset, the results are slightly different.
Here's a link to the script I wrote to do the comparison. http://dpaste.com/167464/
Basically, it does the following: -Uses numpy to generate some fake, noramlly distributed data -Uses both R and Scipy to compute the values needed for quantile/probability plot -Computes linear regressions on the quantile data with both R and Scipy. -prints some output to compare the two
My initial conclusions: 1) R's lm(y~x) and scipy.stats.linregress(x,y) yield the same slope and intercept of a linear model. (good) 2) R and Scipy compute the quantiles of a dataset in slightly different manners (??)
Any clue as to why the discrepancy in #2 occurs?
There are several, slightly different but mostly reasonable ways of computing quantiles.
Would you consider it a big deal?
Probably not, but I'm happy to entertain arguments to the contrary if you would care to explain how R is computing the quantiles. -- Robert Kern "I have come to believe that the whole world is an enigma, a harmless enigma that is made terrible by our own mad attempt to interpret it as though it had an underlying truth." -- Umberto Eco
On Wed, Mar 3, 2010 at 13:09, <PHobson@geosyntec.com> wrote:
2) R and Scipy compute the quantiles of a dataset in slightly different manners (??)
Any clue as to why the discrepancy in #2 occurs?
From: scipy-user-bounces@scipy.org [mailto:scipy-user-bounces@scipy.org] On Behalf Of Robert Kern There are several, slightly different but mostly reasonable ways of computing quantiles.
Good to know. Thanks.
Would you consider it a big deal?
Probably not, but I'm happy to entertain arguments to the contrary if you would care to explain how R is computing the quantiles.
After reading Josef's response, I'm only concerned with the min/max values. R, like you, mentions that several methods are available to compute quantiles. By default, it uses what it calls Type 7: """ All sample quantiles are defined as weighted averages of consecutive order statistics. Sample quantiles of type i are defined by: Q[i](p) = (1 - gamma) x[j] + gamma x[j+1], where 1 <= i <= 9, (j-m)/n <= p < (j-m+1)/n, x[j] is the jth order statistic, n is the sample size, the value of gamma is a function of j = floor(np + m) and g = np + m - j, and m is a constant determined by the sample quantile type. [snip] *Continuous sample quantile types 4 through 9* For types 4 through 9, Q[i](p) is a continuous function of p, with gamma = g and m given below. The sample quantiles can be obtained equivalently by linear interpolation between the points (p[k],x[k]) where x[k] is the kth order statistic. Specific expressions for p[k] are given below. [snip] Type 7 m = 1-p. p[k] = (k - 1) / (n - 1). In this case, p[k] = mode[F(x[k])]. This is used by S. """ I haven't really had time to look into Scipy's method yet. I'll dig a little deeper as soon as I get the chance. Thanks for the insight, -Paul H.
participants (3)
-
josef.pktd@gmail.com
-
PHobson@Geosyntec.com
-
Robert Kern