[Numpy-discussion] Deprecate zipf distribution?

Charles R Harris charlesr.harris at gmail.com
Sat Oct 7 11:29:12 EDT 2017


Hi All,

The current NumPy implementation of the truncated zipf distribution has
several drawbacks.


   - Extremely poor performance when the parameter `a` is near 1. For
   instance, when `a = 1.000001` a simple change in the implementation speeds
   things up by a factor of 1,657. When the parameter is closer to 1, the
   algorithm effectively hangs.
   - Because the distribution is truncated, say to integers in the range of
   int64, the parameter could be allowed to take all values > 0, even though
   the untruncated series diverges. There is some indication that such values
   of `a` can be useful in modeling  because of the heavy distribution in the
   tail.

Because fixing these problems will change the output stream, I suggest
implementing a truncated zeta distribution, which is an alternative name
for the same distribution, and deprecating the the zipf distribution.
Furthermore, rather than truncate at the value of C long, which varies,
truncate at max(int64), or some possibly smaller value, say 2**44, which
allows all integers up to that value to be realized with approximately
correct probabilities when using double precision for the intermediate
computations.

Thoughts?

Chubk
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20171007/414128cd/attachment.html>


More information about the NumPy-Discussion mailing list