[Numpy-discussion] scipy.stats.qqplot and scipy.stats.probplot axis labeling

Mark Gawron gawron at mail.sdsu.edu
Sat Jun 11 14:49:20 EDT 2016


Thanks, Jozef.  This is very helpful.  And I will direct this
to one of the other mailing lists, once I read the previous posts.

Regarding your remark:  Maybe Im having a terminology problem.  It seems to me once you do

>> osm = dist.ppf(osm_uniform)

you’re back in the value space for the particular distribution. So this
gives you known probability intervals, but not UNIFORM probability
intervals (the interval between 0 and 1 STD covers a bigger prob interval
than the the interval between 1 and 2).  And the idea of a quantile is
that it’s a division point in a UNIFORM division of the probability axis.

Mark
On Jun 11, 2016, at 10:03 AM, josef.pktd at gmail.com wrote:

> 
> 
> On Sat, Jun 11, 2016 at 8:53 AM, Ralf Gommers <ralf.gommers at gmail.com> wrote:
> Hi Mark,
> 
> Note that the scipy-dev or scipy-user mailing list would have been more appropriate for this question. 
> 
> 
> On Fri, Jun 10, 2016 at 9:06 AM, Mark Gawron <gawron at mail.sdsu.edu> wrote:
> 
> 
> The scipy.stats.qqplot and scipy.stats.probplot  functions plot expected values versus actual data values for visualization of fit to a distribution.  First a one-D array of expected percentiles is generated for  a sample of size N; then that is passed to  dist.ppf, the per cent point function for the chosen distribution, to return an array of expected values.  The visualized data points are pairs of expected and actual values, and a linear regression is done on these to produce the line data points in this distribution should lie on.
> 
> Where x is the input data array and dist the chosen distribution we have:
> 
>> osr = np.sort(x)
>> osm_uniform = _calc_uniform_order_statistic_medians(len(x))
>> osm = dist.ppf(osm_uniform)
>> slope, intercept, r, prob, sterrest = stats.linregress(osm, osr)
> 
> My question concerns the plot display.  
> 
>> ax.plot(osm, osr, 'bo', osm, slope*osm + intercept, 'r-')
> 
> 
> The x-axis of the resulting plot is labeled quantiles, but the xticks and xticklabels produced produced by qqplot and problplot do not seem correct for the their intended interpretations.  First the numbers on the x-axis do not represent quantiles; the intervals between them do not in general contain equal numbers of points.  For a normal distribution with sigma=1, they represent standard deviations.  Changing the label on the x-axis does not seem like a very good solution, because the interpretation of the values on the x-axis will be different for different distributions.  Rather the right solution seems to be to actually show quantiles on the x-axis. The numbers on the x-axis can stay as they are, representing quantile indexes, but they need to be spaced so as to show the actual division points that carve the population up into  groups of the same size.  This can be done in something like the following way. 
> 
> The ticks are correct I think, but they're theoretical quantiles and not sample quantiles. This was discussed in [1] and is consistent with R [2] and statsmodels [3]. I see that we just forgot to add "theoretical" to the x-axis label (mea culpa). Does adding that resolve your concern?
> 
> [1] https://github.com/scipy/scipy/issues/1821
> [2] http://data.library.virginia.edu/understanding-q-q-plots/
> [3] http://statsmodels.sourceforge.net/devel/generated/statsmodels.graphics.gofplots.qqplot.html?highlight=qqplot#statsmodels.graphics.gofplots.qqplot
> 
> Ralf
> 
> 
> as related link http://phobson.github.io/mpl-probscale/tutorial/closer_look_at_viz.html
> 
> Paul Hobson has done a lot of work for getting different probabitlity scales attached to pp-plots or generalized versions of probability plots. I think qqplots are less ambiguous because they are on the original or standardized scale.
> 
> I haven't worked my way through the various interpretation of probability axis yet because I find it "not obvious". It might be easier for fields that have a tradition of using probability papers.
> 
> It's planned to be added to the statsmodels probability plots so that there will be a large choice of axis labels and scales.
> 
> Josef
>  
>  
> 
> 
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Discussion at scipy.org
> https://mail.scipy.org/mailman/listinfo/numpy-discussion
> 
> 
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Discussion at scipy.org
> https://mail.scipy.org/mailman/listinfo/numpy-discussion

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20160611/3efd2d07/attachment.html>


More information about the NumPy-Discussion mailing list