[Numpy-discussion] Deprecate zipf distribution?
Charles R Harris
charlesr.harris at gmail.com
Sat Oct 7 11:29:12 EDT 2017
Hi All,
The current NumPy implementation of the truncated zipf distribution has
several drawbacks.
- Extremely poor performance when the parameter `a` is near 1. For
instance, when `a = 1.000001` a simple change in the implementation speeds
things up by a factor of 1,657. When the parameter is closer to 1, the
algorithm effectively hangs.
- Because the distribution is truncated, say to integers in the range of
int64, the parameter could be allowed to take all values > 0, even though
the untruncated series diverges. There is some indication that such values
of `a` can be useful in modeling because of the heavy distribution in the
tail.
Because fixing these problems will change the output stream, I suggest
implementing a truncated zeta distribution, which is an alternative name
for the same distribution, and deprecating the the zipf distribution.
Furthermore, rather than truncate at the value of C long, which varies,
truncate at max(int64), or some possibly smaller value, say 2**44, which
allows all integers up to that value to be realized with approximately
correct probabilities when using double precision for the intermediate
computations.
Thoughts?
Chubk
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20171007/414128cd/attachment.html>
More information about the NumPy-Discussion
mailing list