On Sun, May 9, 2010 at 4:49 AM, <josef.pktd@gmail.com> wrote:
On Sun, May 9, 2010 at 1:01 AM, T J <tjhnson@gmail.com> wrote:
The docstring for np.pareto says:
This is a simplified version of the Generalized Pareto distribution (available in SciPy), with the scale set to one and the location set to zero. Most authors default the location to one.
and also:
The probability density for the Pareto distribution is
.. math:: p(x) = \frac{am^a}{x^{a+1}}
where :math:`a` is the shape and :math:`m` the location
These two statements seem to be in contradiction. I think what was meant is that m is the scale, rather than the location. For if m were equal to zero, as the first portion of the docstring states, then the entire pdf would be zero for all shapes a>0.
----
Also, I'm not quite understanding how the stated pdf is actually the same as the pdf for the generalized pareto with the scale=1 and location=0. By the wikipedia definition of the generalized Pareto distribution, if we take \sigma=1 (scale equal to one) and \mu=0 (location equal to zero), then we get:
(1 + a x)^(-1/a - 1)
which is normalized over $x \in (0, \infty)$. If we compare this to the distribution stated in the docstring (with m=1)
a x^{-a-1}
we see that it is normalized over $x \in (1, \infty)$. And indeed, the distribution requires x > scale = 1.
If we integrate the generalized Pareto (with scale=1, location=0) over $x \in (1, \infty)$ then we have to re-normalize. So should the docstring say:
This is a simplified version of the Generalized Pareto distribution (available in Scipy), with the scale set to one, the location set to zero, and the distribution re-normalized over the range (1, \infty). Most authors default the location to one.
I think this is the same point, I was trying to make last year.
Instead of renormalizing, my conclusion was the following, (copied from the mailinglist August last year)
""" my conclusion: --------------------- What numpy.random.pareto actually produces, are random numbers from a pareto distribution with lower bound m=1, but location parameter loc=-1, that shifts the distribution to the left.
To actually get useful random numbers (that are correct in the usual usage http://en.wikipedia.org/wiki/Pareto_distribution), we need to add 1 to them. stats.distributions doesn't use mtrand.pareto is thr rvs_pareto = 1 + numpy.random.pareto(a, size)
"""
I still have to work though the math of your argument, but maybe we can come to an agreement how the docstrings (or the function) should be changed, and what numpy.random.pareto really means.
Josef (grateful, that there are another set of eyes on this)
Maybe this is obvious, but what I would suggest is:
0) Determine precisely what the existing code is doing 1) Decide if that is the desired behavior 2) If it is, make the docstring conform 3) If it isn't, file a "bug" ticket 4) If disposed to do so, fix the code 4) Ensure that the docstring documents the desired behavior (The last two have the same number because they may be done in either order, or even concurrently.) Again, sorry if the above is obvious... DG