[Python-ideas] A few suggestions for the random module
Sven Marnach
sven at marnach.net
Thu Jun 23 19:53:29 CEST 2011
I'd like to suggest what I consider a few minor improvements to
Python's random module.
1. random.expovariate(lambd)
The current implementation [1] is
random = self.random
u = random()
while u <= 1e-7:
u = random()
return -_log(u)/lambd
I'd suggest to simplify this to
return -log(1.0 - self.random())/lambd
self.random() returns a float in the half-open interval [0.0, 1.0),
so 1.0 - self.random() is in the half-open interval (0.0, 1.0],
which is exactly what we need here. Even if the random number gets
as close to 1.0 as possible within the limits of double precision,
taking the logarithm won't be any problem (the lowest occuring
value of log(...) being roughly -36.7).
The suggested implementation is not only simpler and faster, it is
also more correct. The current implementation will never return
0.0, although this is a perfectly valid return value, and it
chooses the arbitrary cut-off value 1e7, while the suggested
implementation is only limited by floating point precision and the
range of random.random().
[1]: http://hg.python.org/cpython/file/54fb77e0762c/Lib/random.py#l392
2. random.sample(population, k)
The current implementation [2] chooses one of two algorithms,
depending on whether the number of samples k is relatively small
compared to the size n of the population. I suggest to add
reservoir sampling [3] as a third algorithm, used in either of the
following cases:
* k is "close" to n (for example k > .75 * n)
* population is not a sequence (but of course an iterable)
Reservoir sampling would only use O(1) additional memory, compared
to O(n) for the current algorithm for relativly large values of k.
It would also facilitate to select samples from iterables without
storing a complete list of all items in memory at any time. It
could also be used for smaller values of k, trading reduced memory
usage for increased execution time.
Example implementation:
it = iter(population)
result = list(_islice(it, k))
self.shuffle(result)
for i, x in enumerate(it, k + 1):
j = randbelow(i)
if j < k:
result[j] = x
We need to call self.shuffle(result) here to ensure that all
subslices are valid random samples again.
An alternative to adding this algorithm to the current sample()
function is to add a new function sample_iterable() or similar.
(This new function probably shouldn't call shuffle(), since the
caller can always do this if required.)
[2]: http://hg.python.org/cpython/file/54fb77e0762c/Lib/random.py#l268
[3]: http://en.wikipedia.org/wiki/Reservoir_sampling
3. random.choice(seq)
Similarly to the above explanation, random.choice() could be
generalised to arbitrary iterables. Of course the current O(1)
alogrithm should be used if a sequence is passed.
In case of positive feedback on one or more of those points, I'd be
happy to prepare a patch.
-- Sven
More information about the Python-ideas
mailing list