[Python-ideas] A few suggestions for the random module

Thu Jun 23 19:53:29 CEST 2011

I'd like to suggest what I consider a few minor improvements to
Python's random module.

1. random.expovariate(lambd)

   The current implementation [1] is

       random = self.random
       u = random()
       while u <= 1e-7:
           u = random()
       return -_log(u)/lambd

   I'd suggest to simplify this to

       return -log(1.0 - self.random())/lambd

   self.random() returns a float in the half-open interval [0.0, 1.0),
   so 1.0 - self.random() is in the half-open interval (0.0, 1.0],
   which is exactly what we need here.  Even if the random number gets
   as close to 1.0 as possible within the limits of double precision,
   taking the logarithm won't be any problem (the lowest occuring
   value of log(...) being roughly -36.7).

   The suggested implementation is not only simpler and faster, it is
   also more correct.  The current implementation will never return
   0.0, although this is a perfectly valid return value, and it
   chooses the arbitrary cut-off value 1e7, while the suggested
   implementation is only limited by floating point precision and the
   range of random.random().

   [1]: http://hg.python.org/cpython/file/54fb77e0762c/Lib/random.py#l392

2. random.sample(population, k)

   The current implementation [2] chooses one of two algorithms,
   depending on whether the number of samples k is relatively small
   compared to the size n of the population.  I suggest to add
   reservoir sampling [3] as a third algorithm, used in either of the
   following cases:

    * k is "close" to n (for example k > .75 * n)

    * population is not a sequence (but of course an iterable)

   Reservoir sampling would only use O(1) additional memory, compared
   to O(n) for the current algorithm for relativly large values of k.
   It would also facilitate to select samples from iterables without
   storing a complete list of all items in memory at any time.  It
   could also be used for smaller values of k, trading reduced memory
   usage for increased execution time.

   Example implementation:

       it = iter(population)
       result = list(_islice(it, k))
       self.shuffle(result)
       for i, x in enumerate(it, k + 1):
           j = randbelow(i)
           if j < k:
               result[j] = x

   We need to call self.shuffle(result) here to ensure that all
   subslices are valid random samples again.

   An alternative to adding this algorithm to the current sample()
   function is to add a new function sample_iterable() or similar.
   (This new function probably shouldn't call shuffle(), since the
   caller can always do this if required.)

   [2]: http://hg.python.org/cpython/file/54fb77e0762c/Lib/random.py#l268
   [3]: http://en.wikipedia.org/wiki/Reservoir_sampling

3. random.choice(seq)

   Similarly to the above explanation, random.choice() could be
   generalised to arbitrary iterables.  Of course the current O(1)
   alogrithm should be used if a sequence is passed.

In case of positive feedback on one or more of those points, I'd be
happy to prepare a patch.

 -- Sven