The docstring for np.pareto says:
This is a simplified version of the Generalized Pareto distribution (available in SciPy), with the scale set to one and the location set to zero. Most authors default the location to one.
and also:
The probability density for the Pareto distribution is
.. math:: p(x) = \frac{am^a}{x^{a+1}}
where :math:`a` is the shape and :math:`m` the location
These two statements seem to be in contradiction. I think what was meant is that m is the scale, rather than the location. For if m were equal to zero, as the first portion of the docstring states, then the entire pdf would be zero for all shapes a>0.
----
Also, I'm not quite understanding how the stated pdf is actually the same as the pdf for the generalized pareto with the scale=1 and location=0. By the wikipedia definition of the generalized Pareto distribution, if we take \sigma=1 (scale equal to one) and \mu=0 (location equal to zero), then we get:
(1 + a x)^(-1/a - 1)
which is normalized over $x \in (0, \infty)$. If we compare this to the distribution stated in the docstring (with m=1)
a x^{-a-1}
we see that it is normalized over $x \in (1, \infty)$. And indeed, the distribution requires x > scale = 1.
If we integrate the generalized Pareto (with scale=1, location=0) over $x \in (1, \infty)$ then we have to re-normalize. So should the docstring say:
This is a simplified version of the Generalized Pareto distribution (available in Scipy), with the scale set to one, the location set to zero, and the distribution re-normalized over the range (1, \infty). Most authors default the location to one.
On Sun, May 9, 2010 at 1:01 AM, T J tjhnson@gmail.com wrote:
The docstring for np.pareto says:
This is a simplified version of the Generalized Pareto distribution (available in SciPy), with the scale set to one and the location set to zero. Most authors default the location to one.
and also:
The probability density for the Pareto distribution is
.. math:: p(x) = \frac{am^a}{x^{a+1}}
where :math:`a` is the shape and :math:`m` the location
These two statements seem to be in contradiction. I think what was meant is that m is the scale, rather than the location. For if m were equal to zero, as the first portion of the docstring states, then the entire pdf would be zero for all shapes a>0.
Also, I'm not quite understanding how the stated pdf is actually the same as the pdf for the generalized pareto with the scale=1 and location=0. By the wikipedia definition of the generalized Pareto distribution, if we take \sigma=1 (scale equal to one) and \mu=0 (location equal to zero), then we get:
(1 + a x)^(-1/a - 1)
which is normalized over $x \in (0, \infty)$. If we compare this to the distribution stated in the docstring (with m=1)
a x^{-a-1}
we see that it is normalized over $x \in (1, \infty)$. And indeed, the distribution requires x > scale = 1.
If we integrate the generalized Pareto (with scale=1, location=0) over $x \in (1, \infty)$ then we have to re-normalize. So should the docstring say:
This is a simplified version of the Generalized Pareto distribution (available in Scipy), with the scale set to one, the location set to zero, and the distribution re-normalized over the range (1, \infty). Most authors default the location to one.
I think this is the same point, I was trying to make last year.
Instead of renormalizing, my conclusion was the following, (copied from the mailinglist August last year)
""" my conclusion: --------------------- What numpy.random.pareto actually produces, are random numbers from a pareto distribution with lower bound m=1, but location parameter loc=-1, that shifts the distribution to the left.
To actually get useful random numbers (that are correct in the usual usage http://en.wikipedia.org/wiki/Pareto_distribution), we need to add 1 to them. stats.distributions doesn't use mtrand.pareto
rvs_pareto = 1 + numpy.random.pareto(a, size)
"""
I still have to work though the math of your argument, but maybe we can come to an agreement how the docstrings (or the function) should be changed, and what numpy.random.pareto really means.
Josef (grateful, that there are another set of eyes on this)
NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
On Sun, May 9, 2010 at 4:49 AM, josef.pktd@gmail.com wrote:
On Sun, May 9, 2010 at 1:01 AM, T J tjhnson@gmail.com wrote:
The docstring for np.pareto says:
This is a simplified version of the Generalized Pareto distribution (available in SciPy), with the scale set to one and the location set
to
zero. Most authors default the location to one.
and also:
The probability density for the Pareto distribution is
.. math:: p(x) = \frac{am^a}{x^{a+1}}
where :math:`a` is the shape and :math:`m` the location
These two statements seem to be in contradiction. I think what was meant is that m is the scale, rather than the location. For if m were equal to zero, as the first portion of the docstring states, then the entire pdf would be zero for all shapes a>0.
Also, I'm not quite understanding how the stated pdf is actually the same as the pdf for the generalized pareto with the scale=1 and location=0. By the wikipedia definition of the generalized Pareto distribution, if we take \sigma=1 (scale equal to one) and \mu=0 (location equal to zero), then we get:
(1 + a x)^(-1/a - 1)
which is normalized over $x \in (0, \infty)$. If we compare this to the distribution stated in the docstring (with m=1)
a x^{-a-1}
we see that it is normalized over $x \in (1, \infty)$. And indeed, the distribution requires x > scale = 1.
If we integrate the generalized Pareto (with scale=1, location=0) over $x \in (1, \infty)$ then we have to re-normalize. So should the docstring say:
This is a simplified version of the Generalized Pareto distribution (available in Scipy), with the scale set to one, the location set to
zero,
and the distribution re-normalized over the range (1, \infty). Most authors default the location to one.
I think this is the same point, I was trying to make last year.
Instead of renormalizing, my conclusion was the following, (copied from the mailinglist August last year)
""" my conclusion:
What numpy.random.pareto actually produces, are random numbers from a pareto distribution with lower bound m=1, but location parameter loc=-1, that shifts the distribution to the left.
To actually get useful random numbers (that are correct in the usual usage http://en.wikipedia.org/wiki/Pareto_distribution), we need to add 1 to them. stats.distributions doesn't use mtrand.pareto is thr rvs_pareto = 1 + numpy.random.pareto(a, size)
"""
I still have to work though the math of your argument, but maybe we can come to an agreement how the docstrings (or the function) should be changed, and what numpy.random.pareto really means.
Josef (grateful, that there are another set of eyes on this)
Maybe this is obvious, but what I would suggest is:
0) Determine precisely what the existing code is doing
1) Decide if that is the desired behavior
2) If it is, make the docstring conform
3) If it isn't, file a "bug" ticket
4) If disposed to do so, fix the code
4) Ensure that the docstring documents the desired behavior
(The last two have the same number because they may be done in either order, or even concurrently.)
Again, sorry if the above is obvious...
DG
On Sun, May 9, 2010 at 4:49 AM, josef.pktd@gmail.com wrote:
I think this is the same point, I was trying to make last year.
Instead of renormalizing, my conclusion was the following, (copied from the mailinglist August last year)
""" my conclusion:
What numpy.random.pareto actually produces, are random numbers from a pareto distribution with lower bound m=1, but location parameter loc=-1, that shifts the distribution to the left.
To actually get useful random numbers (that are correct in the usual usage http://en.wikipedia.org/wiki/Pareto_distribution), we need to add 1 to them. stats.distributions doesn't use mtrand.pareto
rvs_pareto = 1 + numpy.random.pareto(a, size)
"""
I still have to work though the math of your argument, but maybe we can come to an agreement how the docstrings (or the function) should be changed, and what numpy.random.pareto really means.
Josef (grateful, that there are another set of eyes on this)
Yes, I think my "renormalizing" statement is incorrect as it is really just sampling from a different pdf altogether. See the following image:
http://www.dumpt.com/img/viewer.php?file=q9tfk7ehxsw865vn067c.png
It plots histograms of the various implementations against the pdfs. Summarizing:
The NumPy implementation is based on (Devroye p. 262). The pdf listed there is:
a / (1+x)^(a+1)
This differs from the "standard" Pareto pdf:
a / x^(a+1)
It also differs from the pdf of the generalized Pareto distribution, with scale=1 and location=0:
(1 + a x)^(-1/a - 1)
And it also differs from the pdf of the generalized Pareto distribution with scale=1 and location=-1 or location=1.
random.paretovariate and scipy.stats.pareto sample from the standard Pareto, and this is the desired behavior, IMO. Its true that "1 + np.random.pareto" provides the fix, but I think we're better off changing the underlying implementation. Devroye has a more recent paper:
http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.85.8760
which states the Pareto distribution in the standard way. So I think it is safe to make this change. Backwards compatibility might be the only argument for not making this change. So here is my proposal:
1) Remove every mention of the generalized Pareto distribution from the docstring. As far as I can see, the generalized Pareto distribution does not reduce to the "standard" Pareto at all. We can still mention scipy.stats.distributions.genpareto and scipy.stats.distributions.pareto. The former is something different and the latter will (now) be equivalent to the NumPy function.
2) Modify numpy/random/mtrand/distributions.c in the following way:
double rk_pareto(rk_state *state, double a) { //return exp(rk_standard_exponential(state)/a) - 1; return 1.0 / rk_double(state)**(1.0 / a); }
Does this sound good?
On Mon, May 10, 2010 at 11:14 AM, T J tjhnson@gmail.com wrote:
On Sun, May 9, 2010 at 4:49 AM, josef.pktd@gmail.com wrote:
I think this is the same point, I was trying to make last year.
Instead of renormalizing, my conclusion was the following, (copied from the mailinglist August last year)
""" my conclusion:
What numpy.random.pareto actually produces, are random numbers from a pareto distribution with lower bound m=1, but location parameter loc=-1, that shifts the distribution to the left.
To actually get useful random numbers (that are correct in the usual usage http://en.wikipedia.org/wiki/Pareto_distribution), we need to add 1 to them. stats.distributions doesn't use mtrand.pareto
rvs_pareto = 1 + numpy.random.pareto(a, size)
"""
I still have to work though the math of your argument, but maybe we can come to an agreement how the docstrings (or the function) should be changed, and what numpy.random.pareto really means.
Josef (grateful, that there are another set of eyes on this)
Yes, I think my "renormalizing" statement is incorrect as it is really just sampling from a different pdf altogether. See the following image:
http://www.dumpt.com/img/viewer.php?file=q9tfk7ehxsw865vn067c.png
It plots histograms of the various implementations against the pdfs. Summarizing:
The NumPy implementation is based on (Devroye p. 262). The pdf listed there is:
a / (1+x)^(a+1)
This differs from the "standard" Pareto pdf:
a / x^(a+1)
It also differs from the pdf of the generalized Pareto distribution, with scale=1 and location=0:
(1 + a x)^(-1/a - 1)
And it also differs from the pdf of the generalized Pareto distribution with scale=1 and location=-1 or location=1.
random.paretovariate and scipy.stats.pareto sample from the standard Pareto, and this is the desired behavior, IMO. Its true that "1 + np.random.pareto" provides the fix, but I think we're better off changing the underlying implementation. Devroye has a more recent paper:
http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.85.8760
which states the Pareto distribution in the standard way. So I think it is safe to make this change. Backwards compatibility might be the only argument for not making this change. So here is my proposal:
- Remove every mention of the generalized Pareto distribution from
the docstring. As far as I can see, the generalized Pareto distribution does not reduce to the "standard" Pareto at all. We can still mention scipy.stats.distributions.genpareto and scipy.stats.distributions.pareto. The former is something different and the latter will (now) be equivalent to the NumPy function.
- Modify numpy/random/mtrand/distributions.c in the following way:
double rk_pareto(rk_state *state, double a) { //return exp(rk_standard_exponential(state)/a) - 1; return 1.0 / rk_double(state)**(1.0 / a); }
Does this sound good? _______________________________________________
Whatever the community decides, don't forget to please go through the formal procedure of submitting a "bug" ticket so all of this is recorded in the "right" way in the "right" place. Thanks!
DG
On Mon, May 10, 2010 at 2:14 PM, T J tjhnson@gmail.com wrote:
On Sun, May 9, 2010 at 4:49 AM, josef.pktd@gmail.com wrote:
I think this is the same point, I was trying to make last year.
Instead of renormalizing, my conclusion was the following, (copied from the mailinglist August last year)
""" my conclusion:
What numpy.random.pareto actually produces, are random numbers from a pareto distribution with lower bound m=1, but location parameter loc=-1, that shifts the distribution to the left.
To actually get useful random numbers (that are correct in the usual usage http://en.wikipedia.org/wiki/Pareto_distribution), we need to add 1 to them. stats.distributions doesn't use mtrand.pareto
rvs_pareto = 1 + numpy.random.pareto(a, size)
"""
I still have to work though the math of your argument, but maybe we can come to an agreement how the docstrings (or the function) should be changed, and what numpy.random.pareto really means.
Josef (grateful, that there are another set of eyes on this)
Yes, I think my "renormalizing" statement is incorrect as it is really just sampling from a different pdf altogether. See the following image:
http://www.dumpt.com/img/viewer.php?file=q9tfk7ehxsw865vn067c.png
It plots histograms of the various implementations against the pdfs. Summarizing:
The NumPy implementation is based on (Devroye p. 262). The pdf listed there is:
a / (1+x)^(a+1)
This differs from the "standard" Pareto pdf:
a / x^(a+1)
It also differs from the pdf of the generalized Pareto distribution, with scale=1 and location=0:
(1 + a x)^(-1/a - 1)
And it also differs from the pdf of the generalized Pareto distribution with scale=1 and location=-1 or location=1.
random.paretovariate and scipy.stats.pareto sample from the standard Pareto, and this is the desired behavior, IMO. Its true that "1 + np.random.pareto" provides the fix, but I think we're better off changing the underlying implementation. Devroye has a more recent paper:
http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.85.8760
which states the Pareto distribution in the standard way. So I think it is safe to make this change. Backwards compatibility might be the only argument for not making this change. So here is my proposal:
1) Remove every mention of the generalized Pareto distribution from the docstring. As far as I can see, the generalized Pareto distribution does not reduce to the "standard" Pareto at all. We can still mention scipy.stats.distributions.genpareto and scipy.stats.distributions.pareto. The former is something different and the latter will (now) be equivalent to the NumPy function.
2) Modify numpy/random/mtrand/distributions.c in the following way:
double rk_pareto(rk_state *state, double a) { //return exp(rk_standard_exponential(state)/a) - 1; return 1.0 / rk_double(state)**(1.0 / a); }
Does this sound good?
I went googling and found a new interpretation
numpy.random.pareto is actually the Lomax distribution also known as Pareto 2, Pareto (II) or Pareto Second Kind distribution
http://www.itl.nist.gov/div898/software/dataplot/refman2/auxillar/pa2pdf.htm http://www.mathwave.com/articles/pareto_2_lomax_distribution.html http://hosho.ees.hokudai.ac.jp/~kubo/Rdoc/library/VGAM/html/lomax.html
which is different from the Pareto (First Kind) distribution http://www.mathwave.com/help/easyfit/html/analyses/distributions/pareto.html http://en.wikipedia.org/wiki/Pareto_distribution#Density_function
The R-VGAM docs have the reference to http://books.google.ca/books?id=pGsAZ3W7uEAC&pg=PA226&lpg=PA226&...
which states on page 227
X is distributed as Lomax(b,q) <=> X + b is distributed Pareto(b, q)
where b is the scale parameter b=1 in numpy.random.pareto notation and q is the shape parameter alpha
quote: "... the Pareto (II) distribution - after all, it's just a shifted classical Pareto distribution - ..."
So, from this it looks like numpy.random does not have a Pareto distribution, only Lomax, and the confusion came maybe because somewhere in the history the (II) (second kind) got dropped in the explanations.
and actually it is in scipy.stats.distributions, but without rvs
# LOMAX (Pareto of the second kind.) # Special case of Pareto of the first kind (location=-1.0)
class lomax_gen(rv_continuous): def _pdf(self, x, c): return c*1.0/(1.0+x)**(c+1.0) def _cdf(self, x, c): return 1.0-1.0/(1.0+x)**c def _ppf(self, q, c): return pow(1.0-q,-1.0/c)-1 def _stats(self, c): mu, mu2, g1, g2 = pareto.stats(c, loc=-1.0, moments='mvsk') return mu, mu2, g1, g2 def _entropy(self, c): return 1+1.0/c-log(c) lomax = lomax_gen(a=0.0, name="lomax", longname="A Lomax (Pareto of the second kind)", shapes="c", extradoc="""
Lomax (Pareto of the second kind) distribution
lomax.pdf(x,c) = c / (1+x)**(c+1) for x >= 0, c > 0. """
There are too many distribution, and too many different names.
So here is my proposal:
- Remove every mention of the generalized Pareto distribution from
the docstring. As far as I can see, the generalized Pareto distribution does not reduce to the "standard" Pareto at all. We can still mention scipy.stats.distributions.genpareto and scipy.stats.distributions.pareto. The former is something different and the latter will (now) be equivalent to the NumPy function.
agreed, although I haven't figured out yet why Pareto and generalized Pareto have similar names
- Modify numpy/random/mtrand/distributions.c in the following way:
double rk_pareto(rk_state *state, double a) { //return exp(rk_standard_exponential(state)/a) - 1; return 1.0 / rk_double(state)**(1.0 / a); }
I'm not an expert on random number generator, but using the uniform distribution as in http://en.wikipedia.org/wiki/Pareto_distribution#Generating_a_random_sample_... and your Devroy reference seems better, than based on the relationship to the exponential distribution http://en.wikipedia.org/wiki/Pareto_distribution#Relation_to_the_exponential...
I think without changing the source we can rewrite the docstring that this is Lomax (or Pareto of the Second Kind), so that at least the documentation is less misleading.
But I find calling it Pareto very confusing, and I'm not the only one anymore, (and I don't know if anyone has used it assuming it is classical Pareto), so my preferred solution would be
* rename numpy.random.pareto to numpy.random.lomax * and create a real (classical, first kind) pareto distribution (even though it's just adding or subtracting 1, ones we know it)
(and I'm adding the _rvs to scipy.stats.lomax)
What's the backwards compatibility policy with very confusing names in numpy?
Josef
NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
On Mon, May 10, 2010 at 8:37 PM, josef.pktd@gmail.com wrote:
I went googling and found a new interpretation
numpy.random.pareto is actually the Lomax distribution also known as Pareto 2, Pareto (II) or Pareto Second Kind distribution
Great!
So, from this it looks like numpy.random does not have a Pareto distribution, only Lomax, and the confusion came maybe because somewhere in the history the (II) (second kind) got dropped in the explanations.
and actually it is in scipy.stats.distributions, but without rvs
# LOMAX (Pareto of the second kind.) # Special case of Pareto of the first kind (location=-1.0)
I understand the point with this last comment, but I think it can be confusing in that the Pareto (of the first kind) has no "location" parameter and people might think you are referring to the Generalized Pareto distribution. I think its much clearer to say:
# Special case of the Pareto of the first kind, but shifted to the left by 1. x --> x + 1
2) Modify numpy/random/mtrand/distributions.c in the following way:
double rk_pareto(rk_state *state, double a) { //return exp(rk_standard_exponential(state)/a) - 1; return 1.0 / rk_double(state)**(1.0 / a); }
I'm not an expert on random number generator, but using the uniform distribution as in http://en.wikipedia.org/wiki/Pareto_distribution#Generating_a_random_sample_... and your Devroy reference seems better, than based on the relationship to the exponential distribution http://en.wikipedia.org/wiki/Pareto_distribution#Relation_to_the_exponential...
Correct. The exp relationship was for the existing implementation (which corresponds to the Lomax). I commented that line out and just used 1/U^(1/a).
I think without changing the source we can rewrite the docstring that this is Lomax (or Pareto of the Second Kind), so that at least the documentation is less misleading.
But I find calling it Pareto very confusing, and I'm not the only one anymore, (and I don't know if anyone has used it assuming it is classical Pareto), so my preferred solution would be
- rename numpy.random.pareto to numpy.random.lomax
- and create a real (classical, first kind) pareto distribution (even
though it's just adding or subtracting 1, ones we know it)
I personally have used numpy.random.pareto thinking it was the Pareto distribution of the first kind---which led to this post in the first place. So, I'm in strong agreement. While doing this, perhaps we should increase functionality and allow users the ability to specify the scale of the distribution (instead of just the shape)?
I can make a ticket for this and give a stab at creating the necessary patch.
What's the backwards compatibility policy with very confusing names in numpy?
It seems reasonable that we might have to follow the deprecation route, but I'd be happier with a "faster" fix.
1.5 - Provide numpy.random.lomax. Make numpy.random.pareto raise a DeprecationWarning and then call lomax. 2.0 (if there is no 1.6) - Make numpy.random.pareto behave as Pareto distribution of 1st kind.
Immediately though, we can modify the docstring that is currently in there to make the situation clear, instructing users how they can generate samples from the "standard" Pareto distribution. This is the first patch I'll submit. Perhaps it is better to only change the docstring and then save all changes in functionality for 2.0. Deferring to others on this one...
On Tue, May 11, 2010 at 12:23 AM, T J tjhnson@gmail.com wrote:
On Mon, May 10, 2010 at 8:37 PM, josef.pktd@gmail.com wrote:
I went googling and found a new interpretation
numpy.random.pareto is actually the Lomax distribution also known as
Pareto 2,
Pareto (II) or Pareto Second Kind distribution
Great!
So, from this it looks like numpy.random does not have a Pareto distribution, only Lomax, and the confusion came maybe because somewhere in the history the (II) (second kind) got dropped in the explanations.
and actually it is in scipy.stats.distributions, but without rvs
# LOMAX (Pareto of the second kind.) # Special case of Pareto of the first kind (location=-1.0)
I understand the point with this last comment, but I think it can be confusing in that the Pareto (of the first kind) has no "location" parameter and people might think you are referring to the Generalized Pareto distribution. I think its much clearer to say:
# Special case of the Pareto of the first kind, but shifted to the left by 1. x --> x + 1
- Modify numpy/random/mtrand/distributions.c in the following way:
double rk_pareto(rk_state *state, double a) { //return exp(rk_standard_exponential(state)/a) - 1; return 1.0 / rk_double(state)**(1.0 / a); }
I'm not an expert on random number generator, but using the uniform
distribution
as in
http://en.wikipedia.org/wiki/Pareto_distribution#Generating_a_random_sample_...
and your Devroy reference seems better, than based on the relationship to the exponential distribution
http://en.wikipedia.org/wiki/Pareto_distribution#Relation_to_the_exponential...
Correct. The exp relationship was for the existing implementation (which corresponds to the Lomax). I commented that line out and just used 1/U^(1/a).
I think without changing the source we can rewrite the docstring that this is Lomax (or Pareto of the Second Kind), so that at least the documentation is less misleading.
But I find calling it Pareto very confusing, and I'm not the only one
anymore,
(and I don't know if anyone has used it assuming it is classical Pareto), so my preferred solution would be
- rename numpy.random.pareto to numpy.random.lomax
- and create a real (classical, first kind) pareto distribution (even
though it's just adding or subtracting 1, ones we know it)
I personally have used numpy.random.pareto thinking it was the Pareto distribution of the first kind---which led to this post in the first place. So, I'm in strong agreement. While doing this, perhaps we should increase functionality and allow users the ability to specify the scale of the distribution (instead of just the shape)?
I can make a ticket for this and give a stab at creating the necessary patch.
What's the backwards compatibility policy with very confusing names in
numpy?
It seems reasonable that we might have to follow the deprecation route, but I'd be happier with a "faster" fix.
1.5
- Provide numpy.random.lomax. Make numpy.random.pareto raise a
DeprecationWarning and then call lomax. 2.0 (if there is no 1.6)
- Make numpy.random.pareto behave as Pareto distribution of 1st kind.
Immediately though, we can modify the docstring that is currently in there to make the situation clear, instructing users how they can generate samples from the "standard" Pareto distribution. This is the first patch I'll submit. Perhaps it is better to only change the docstring and then save all changes in functionality for 2.0. Deferring to others on this one...
Elsewhere in the mailing list, it has been stated that our "policy" is to document desired/intended behavior, when such differs from actual (current) behavior. This can be done in advance of a code fix to implement the desired behavior, but we have discouraged (to the point of saying "don't do it") documenting current behavior when it is known that this should (and presumably will) be changed.
DG
NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Tue, 11 May 2010 00:23:52 -0700, T J wrote: [clip]
It seems reasonable that we might have to follow the deprecation route, but I'd be happier with a "faster" fix.
1.5
- Provide numpy.random.lomax. Make numpy.random.pareto raise a DeprecationWarning and then call lomax.
2.0 (if there is no 1.6)
- Make numpy.random.pareto behave as Pareto distribution of 1st kind.
I think the next Numpy release will be 2.0.
How things were done with changes in the histogram function were:
1) Add a "new=False" keyword argument, and raise a DeprecationWarning if new==False. The user then must call it with "pareto(..., new=True)" to get the correct behaviour.
2) In the next release, change the default to "new=True".
Another option would be to add a correct implementation with a different name, e.g. `pareto1` to signal it's the first kind, and deprecate the old function altogether.
A third option would be just to silently fix the bug. In any case the change should be mentioned noticeably in the release notes.
On Tue, May 11, 2010 at 4:14 AM, Pauli Virtanen pav@iki.fi wrote:
A third option would be just to silently fix the bug. In any case the change should be mentioned noticeably in the release notes.
I see this as two bugs: the Lomax distribution was named incorrectly and the Parato distribution was incorrect or confusingly labeled. Both should be fixed and clearly documented. Unlike cases of changing tastes and preferences, it seems unduly complicated and confusing to perseverate with backward compatibility shims. The next release is NumPy 2.0, which will have other known and well advertised API and ABI incompatibilities.
Just my 2e-10 cents, -Kevin
On Tue, May 11, 2010 at 8:11 AM, Kevin Jacobs jacobs@bioinformed.com bioinformed@gmail.com wrote:
On Tue, May 11, 2010 at 4:14 AM, Pauli Virtanen pav@iki.fi wrote:
A third option would be just to silently fix the bug. In any case the change should be mentioned noticeably in the release notes.
I see this as two bugs: the Lomax distribution was named incorrectly and the Parato distribution was incorrect or confusingly labeled. Both should be fixed and clearly documented. Unlike cases of changing tastes and preferences, it seems unduly complicated and confusing to perseverate with backward compatibility shims. The next release is NumPy 2.0, which will have other known and well advertised API and ABI incompatibilities. Just my 2e-10 cents, -Kevin
I would have also considered it as a bug fix, except that there might be users who use the correction (+1) as a workaround. In that case, just changing the behavior without raising an exception for the current usage will introduce hard to find bugs. (It's difficult to see whether the random numbers are correct or as expected without proper testing.)
For example, we use the work-around in the docstring of http://docs.scipy.org/numpy/docs/numpy.random.mtrand.RandomState.power/
and actually, reading the numpy.random.pareto docstring again more carefully, the example does the correction also::
Draw samples from the distribution:
a, m = 3., 1. # shape and mode s = np.random.pareto(a, 1000) + m
But it's very confusing, also there is a relationship between Pareto/Lomax and GPD, but I'm not sure yet my algebra is correct. (and I have misplaced the graphs and tables with the relationships between different distributions)
To minimize backwards compatibility problems we could attach a *big* warning text to pareto ("use at your own risk") and create new random variates, as Pauli proposed
pareto1 - classical pareto pareto2 or lomax - with random variates the same as current pareto
both could then get clear, unambiguous descriptions.
and a note to using the uniform distribution for the generation of random numbers. python 2.5 random.py uses the half open uniform distribution to avoid division by zero, I don't know how numpy.random handles boundary values
def paretovariate(self, alpha): """Pareto distribution. alpha is the shape parameter.""" # Jain, pg. 495
u = 1.0 - self.random() return 1.0 / pow(u, 1.0/alpha)
Josef
NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Assuming no typos
Relationship between Pareto, Pareto(II)/Lomax and Generalized Pareto,GPD
import sympy as sy x = sy.Symbol('x') k = sy.Symbol('k') c = sy.Symbol('c') a = sy.Symbol('a') m = sy.Symbol('m') mgpd = sy.Symbol('mgpd')
gpd0 = (1 - c*x/k)**(1/c - 1)/k #JKB notation (c reversed sign) gpd = 1/k/(1 + c*(x-mgpd)/k)**(1/c + 1) #similar to Wikipedia par = a*k**a/(x-m)**(1+a) #JKB lom = a/k/(1+x/k)**(1+a)
lom.subs(k,1) #Pareto(II), Lomax (loc=0, scale=1)
a*(1 + x)**(-1 - a)
par.subs(k,1).subs(m,-1) #Pareto with loc=-1, scale=1
a*(1 + x)**(-1 - a)
gpd.subs(c,1/a).subs(k,1/a).subs(mgpd,0) #GPD with loc=0, scale=1/a
a*(1 + x)**(-1 - a)
par.subs(k,1).subs(m,0) # standard Pareto (loc=0, scale=1)
a*x**(-1 - a)
gpd.subs(c,1/a).subs(k,1/a).subs(mgpd,1) #GPD with loc=1, scale=1/a
a*x**(-1 - a)
Josef