On Mon, May 10, 2010 at 11:14 AM, T J <tjhnson@gmail.com> wrote:
On Sun, May 9, 2010 at 4:49 AM,  <josef.pktd@gmail.com> wrote:
>
> I think this is the same point, I was trying to make last year.
>
> Instead of renormalizing, my conclusion was the following,
> (copied from the mailinglist August last year)
>
> """
> my conclusion:
> ---------------------
> What numpy.random.pareto actually produces, are random numbers from a
> pareto distribution with lower bound m=1, but location parameter
> loc=-1, that shifts the distribution to the left.
>
> To actually get useful  random numbers (that are correct in the usual
> usage http://en.wikipedia.org/wiki/Pareto_distribution), we need to
> add 1 to them.
> stats.distributions doesn't use mtrand.pareto
>
> rvs_pareto = 1 + numpy.random.pareto(a, size)
>
> """
>
> I still have to work though the math of your argument, but maybe we
> can come to an agreement how the docstrings (or the function) should
> be changed, and what numpy.random.pareto really means.
>
> Josef
> (grateful, that there are another set of eyes on this)
>
>


Yes, I think my "renormalizing" statement is incorrect as it is really
just sampling from a different pdf altogether.  See the following image:

http://www.dumpt.com/img/viewer.php?file=q9tfk7ehxsw865vn067c.png

It plots histograms of the various implementations against the pdfs.
Summarizing:

The NumPy implementation is based on (Devroye p. 262).  The pdf listed
there is:

   a / (1+x)^(a+1)

This differs from the "standard" Pareto pdf:

   a / x^(a+1)

It also differs from the pdf of the generalized Pareto distribution,
with scale=1 and location=0:

   (1 + a x)^(-1/a - 1)

And it also differs from the pdf of the generalized Pareto
distribution with scale=1 and location=-1  or location=1.

random.paretovariate and scipy.stats.pareto sample from the standard
Pareto, and this is the desired behavior, IMO.  Its true that "1 +
np.random.pareto" provides the fix, but I think we're better off
changing the underlying implementation.  Devroye has a more recent
paper:

 http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.85.8760

which states the Pareto distribution in the standard way.  So I think
it is safe to make this change.  Backwards compatibility might be the
only argument for not making this change.  So here is my proposal:

 1) Remove every mention of the generalized Pareto distribution from
the docstring.  As far as I can see, the generalized Pareto
distribution does not reduce to the "standard" Pareto at all.  We can
still mention scipy.stats.distributions.genpareto and
scipy.stats.distributions.pareto.  The former is something different
and the latter will (now) be equivalent to the NumPy function.

 2) Modify numpy/random/mtrand/distributions.c in the following way:

double rk_pareto(rk_state *state, double a)
{
  //return exp(rk_standard_exponential(state)/a) - 1;
  return 1.0 / rk_double(state)**(1.0 / a);
}

Does this sound good?
_______________________________________________

Whatever the community decides, don't forget to please go through the formal procedure of submitting a "bug" ticket so all of this is recorded in the "right" way in the "right" place.  Thanks!

DG
--
Mathematician: noun, someone who disavows certainty when their uncertainty set is non-empty, even if that set has measure zero.