A few suggestions for the random module
I'd like to suggest what I consider a few minor improvements to Python's random module.
1. random.expovariate(lambd)
The current implementation [1] is
random = self.random u = random() while u <= 1e7: u = random() return _log(u)/lambd
I'd suggest to simplify this to
return log(1.0  self.random())/lambd
self.random() returns a float in the halfopen interval [0.0, 1.0), so 1.0  self.random() is in the halfopen interval (0.0, 1.0], which is exactly what we need here. Even if the random number gets as close to 1.0 as possible within the limits of double precision, taking the logarithm won't be any problem (the lowest occuring value of log(...) being roughly 36.7).
The suggested implementation is not only simpler and faster, it is also more correct. The current implementation will never return 0.0, although this is a perfectly valid return value, and it chooses the arbitrary cutoff value 1e7, while the suggested implementation is only limited by floating point precision and the range of random.random().
[1]: http://hg.python.org/cpython/file/54fb77e0762c/Lib/random.py#l392
2. random.sample(population, k)
The current implementation [2] chooses one of two algorithms, depending on whether the number of samples k is relatively small compared to the size n of the population. I suggest to add reservoir sampling [3] as a third algorithm, used in either of the following cases:
* k is "close" to n (for example k > .75 * n)
* population is not a sequence (but of course an iterable)
Reservoir sampling would only use O(1) additional memory, compared to O(n) for the current algorithm for relativly large values of k. It would also facilitate to select samples from iterables without storing a complete list of all items in memory at any time. It could also be used for smaller values of k, trading reduced memory usage for increased execution time.
Example implementation:
it = iter(population) result = list(_islice(it, k)) self.shuffle(result) for i, x in enumerate(it, k + 1): j = randbelow(i) if j < k: result[j] = x
We need to call self.shuffle(result) here to ensure that all subslices are valid random samples again.
An alternative to adding this algorithm to the current sample() function is to add a new function sample_iterable() or similar. (This new function probably shouldn't call shuffle(), since the caller can always do this if required.)
[2]: http://hg.python.org/cpython/file/54fb77e0762c/Lib/random.py#l268 [3]: http://en.wikipedia.org/wiki/Reservoir_sampling
3. random.choice(seq)
Similarly to the above explanation, random.choice() could be generalised to arbitrary iterables. Of course the current O(1) alogrithm should be used if a sequence is passed.
In case of positive feedback on one or more of those points, I'd be happy to prepare a patch.
 Sven
On Thu, Jun 23, 2011 at 7:53 PM, Sven Marnach sven@marnach.net wrote:
I'd like to suggest what I consider a few minor improvements to Python's random module. [...]
Agreed on point 1 (implementation of random.expovariate).
For all three issues, I suggest filing a report in the issue tracker. (Probably one separate report per item, though perhaps items 2 and 3 could be combined.)
Mark
+1 on all three suggestions.
As Mark said, please file these ideas at http://bugs.python.org so we can keep track of them. I think the second and third suggestions can be combined as one issue. It would also be helpful if you could supply a patch for the changes (and for the second and third suggestions, unit tests to exercise the new functionality).
On Jun 23, 2011, at 7:53 PM, Sven Marnach wrote:
I'd like to suggest what I consider a few minor improvements to Python's random module.
You can create a feature request on the bug tracker and assign to me.
 random.expovariate(lambd)
This seems reasonable
 random.sample(population, k)
This may be a unnecessary optimization (not worth the complexity), but I will look at it further.
 random.choice(seq)
It could be generalized to arbitrary iterables (Bentley provides an example of how to do this) but it is fragile (i.e. falls apart badly with weak random number generators) and doesn't correspond well with real use cases.
Raymond
Raymond Hettinger schrieb am Fr, 24. Jun 2011, um 00:02:20 +0200:
1. random.expovariate(lambd)
This seems reasonable
When I started to prepare a patch, I noticed you already implemented this one. Thanks!
2. random.sample(population, k)
This may be a unnecessary optimization (not worth the complexity), but I will look at it further.
The main point was the generalization to arbitrary iterables, the reduced memory usage being just a positive side effect. I'm not really convinced of the idea any more since it doesn't really fit well in the current `sample()` implementation. If someone wants to save some memory in the case that k is not much less than n, she can just copy the population, `shuffle()` it and delete the unneeded part.
The only question that remains is if it is worthwhile to introduce a new function for this purpose. While there definitely are use cases for the algorithm, I simply don't know if they are common enough to justify a function in the standard library.
3. random.choice(seq)
It could be generalized to arbitrary iterables (Bentley provides an example of how to do this) but it is fragile (i.e. falls apart badly with weak random number generators) and doesn't correspond well with real use cases.
Again, there definitely are real world use cases. If the above mentioned function would be introduced, this one would simply be the special case k = 1.
 Sven
participants (4)

Daniel Stutzbach

Mark Dickinson

Raymond Hettinger

Sven Marnach