A few suggestions for the random module
I'd like to suggest what I consider a few minor improvements to Python's random module. 1. random.expovariate(lambd) The current implementation [1] is random = self.random u = random() while u <= 1e-7: u = random() return -_log(u)/lambd I'd suggest to simplify this to return -log(1.0 - self.random())/lambd self.random() returns a float in the half-open interval [0.0, 1.0), so 1.0 - self.random() is in the half-open interval (0.0, 1.0], which is exactly what we need here. Even if the random number gets as close to 1.0 as possible within the limits of double precision, taking the logarithm won't be any problem (the lowest occuring value of log(...) being roughly -36.7). The suggested implementation is not only simpler and faster, it is also more correct. The current implementation will never return 0.0, although this is a perfectly valid return value, and it chooses the arbitrary cut-off value 1e7, while the suggested implementation is only limited by floating point precision and the range of random.random(). [1]: http://hg.python.org/cpython/file/54fb77e0762c/Lib/random.py#l392 2. random.sample(population, k) The current implementation [2] chooses one of two algorithms, depending on whether the number of samples k is relatively small compared to the size n of the population. I suggest to add reservoir sampling [3] as a third algorithm, used in either of the following cases: * k is "close" to n (for example k > .75 * n) * population is not a sequence (but of course an iterable) Reservoir sampling would only use O(1) additional memory, compared to O(n) for the current algorithm for relativly large values of k. It would also facilitate to select samples from iterables without storing a complete list of all items in memory at any time. It could also be used for smaller values of k, trading reduced memory usage for increased execution time. Example implementation: it = iter(population) result = list(_islice(it, k)) self.shuffle(result) for i, x in enumerate(it, k + 1): j = randbelow(i) if j < k: result[j] = x We need to call self.shuffle(result) here to ensure that all subslices are valid random samples again. An alternative to adding this algorithm to the current sample() function is to add a new function sample_iterable() or similar. (This new function probably shouldn't call shuffle(), since the caller can always do this if required.) [2]: http://hg.python.org/cpython/file/54fb77e0762c/Lib/random.py#l268 [3]: http://en.wikipedia.org/wiki/Reservoir_sampling 3. random.choice(seq) Similarly to the above explanation, random.choice() could be generalised to arbitrary iterables. Of course the current O(1) alogrithm should be used if a sequence is passed. In case of positive feedback on one or more of those points, I'd be happy to prepare a patch. -- Sven
participants (4)
-
Daniel Stutzbach
-
Mark Dickinson
-
Raymond Hettinger
-
Sven Marnach