[Python-ideas] Re: Add random.choice_iter for randomly selecting from an arbitrary iterable

July 13, 2020

      ...
On Mon, Jul 13, 2020, 7:53 AM Oscar Benjamin <oscar.j.benjamin@gmail.com> wrote:
...
I posted this in the thread about indexing dict items but it seems to
have been buried there so I'm starting a new thread.
Maybe there could be a function in the random module for making a
random choice from an arbitrary (finite) iterable. This implementation
can choose a random element from an iterable by fully iterating over
it so is O(N) in terms of CPU operations but O(1) for memory usage:
On Mon, 13 Jul 2020 at 13:37, David Mertz <mertz@gnosis.cx> wrote:
This is an inefficient reservoir sampling. The optimized version does not need to call a random inclusion switch on every element, but can skip a geometrically ordered collection of (random) skip lengths to avoid most random inclusion decisions.
Obviously, all items must be iterated over no matter what, but if randrange() is significant compared to the cost of next(), the skip-then-decide version is much preferred, especially as size grows.
Yes, I was deliberately keeping the implementation simple because I
find the naive version understandable just by looking at it. I knew
there would be faster version but thanks for the pointer which quickly
lead me here:
https://en.wikipedia.org/wiki/Reservoir_sampling#An_optimal_algorithm

In Python that's:

from math import exp, log, floor
from random import random, randrange
from itertools import islice

def sample_iter(iterable, k=1):
    """Select k items uniformly from iterable.

    Returns the whole population if there are k or fewer items
    """
    iterator = iter(iterable)
    values = list(islice(iterator, k))

    W = exp(log(random())/k)
    while True:
        # skip is geometrically distributed
        skip = floor( log(random())/log(1-W) )
        selection = list(islice(iterator, skip, skip+1))
        if selection:
            values[randrange(k)] = selection[0]
            W *= exp(log(random())/k)
        else:
            return values

I haven't quite convinced myself that using floating point like this
for W is okay but this seems to work (maybe it would fail for really
large k or something...). If there was a proper implementation of a
geometric variate in the random module then that could be used
instead.

By my timings this can choose from a large dict in less time than a
conversion to list although the time taken depends on k:

In [2]: n = 6

In [3]: d = dict(zip(range(10**n), range(10**n)))

In [4]: %timeit sample_iter(d)
15.5 ms ± 326 µs per loop (mean ± std. dev. of 7 runs, 100 loops each

In [5]: %timeit list(d)
26.1 ms ± 1.72 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [6]: %timeit sample_iter(d, 2)
15.8 ms ± 427 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [7]: %timeit sample_iter(d, 20)
17.6 ms ± 2.17 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [8]: %timeit sample_iter(d, 100)
19.9 ms ± 297 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

--
Oscar