[Numpy-discussion] Deprecate zipf distribution?

Warren Weckesser warren.weckesser at gmail.com
Sat Oct 7 14:15:17 EDT 2017


On Sat, Oct 7, 2017 at 11:29 AM, Charles R Harris <charlesr.harris at gmail.com
> wrote:

> Hi All,
>
> The current NumPy implementation of the truncated zipf distribution has
> several drawbacks.
>
>
>    - Extremely poor performance when the parameter `a` is near 1. For
>    instance, when `a = 1.000001` a simple change in the implementation speeds
>    things up by a factor of 1,657. When the parameter is closer to 1, the
>    algorithm effectively hangs.
>    - Because the distribution is truncated, say to integers in the range
>    of int64, the parameter could be allowed to take all values > 0, even
>    though the untruncated series diverges. There is some indication that such
>    values of `a` can be useful in modeling  because of the heavy distribution
>    in the tail.
>
> Because fixing these problems will change the output stream, I suggest
> implementing a truncated zeta distribution, which is an alternative name
> for the same distribution, and deprecating the the zipf distribution.
> Furthermore, rather than truncate at the value of C long, which varies,
> truncate at max(int64), or some possibly smaller value, say 2**44, which
> allows all integers up to that value to be realized with approximately
> correct probabilities when using double precision for the intermediate
> computations.
>
> Thoughts?
>
>
It is time that the 'random' API is extended to include some means of
selecting a version of the random number generation algorithm.  This has
come up in discussions on github (e.g.
https://github.com/numpy/numpy/pull/5158#issuecomment-58185802).  Then
instead of deprecating the existing 'zipf`' function, the user has the
option of selecting which version of the code to use.  Current users that
are satisfied with the existing 'zipf' implementation are not affected.
But I'm not against deprecating 'zipf' if the code is bad enough that the
best long-term option is removing it.

Something like this will be needed if there is interest in merging a pull
request that I just submitted (https://github.com/numpy/numpy/pull/9834)
that fixes (and improves the performance of) the generation of
hypergeometric variates when the number of samples drawn is small.

Warren



> Chubk
>

I think Chuck just got a new hip-hop name. :)



>
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Discussion at python.org
> https://mail.python.org/mailman/listinfo/numpy-discussion
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20171007/590e19fa/attachment.html>


More information about the NumPy-Discussion mailing list