On Mon, Jul 13, 2020, 7:53 AM Oscar Benjamin <oscar.j.benjamin@gmail.com> wrote:
I posted this in the thread about indexing dict items but it seems to have been buried there so I'm starting a new thread.
Maybe there could be a function in the random module for making a random choice from an arbitrary (finite) iterable. This implementation can choose a random element from an iterable by fully iterating over it so is O(N) in terms of CPU operations but O(1) for memory usage:
On Mon, 13 Jul 2020 at 13:37, David Mertz <mertz@gnosis.cx> wrote:
This is an inefficient reservoir sampling. The optimized version does not need to call a random inclusion switch on every element, but can skip a geometrically ordered collection of (random) skip lengths to avoid most random inclusion decisions.
Obviously, all items must be iterated over no matter what, but if randrange() is significant compared to the cost of next(), the skip-then-decide version is much preferred, especially as size grows.
Yes, I was deliberately keeping the implementation simple because I find the naive version understandable just by looking at it. I knew there would be faster version but thanks for the pointer which quickly lead me here: https://en.wikipedia.org/wiki/Reservoir_sampling#An_optimal_algorithm In Python that's: from math import exp, log, floor from random import random, randrange from itertools import islice def sample_iter(iterable, k=1): """Select k items uniformly from iterable. Returns the whole population if there are k or fewer items """ iterator = iter(iterable) values = list(islice(iterator, k)) W = exp(log(random())/k) while True: # skip is geometrically distributed skip = floor( log(random())/log(1-W) ) selection = list(islice(iterator, skip, skip+1)) if selection: values[randrange(k)] = selection[0] W *= exp(log(random())/k) else: return values I haven't quite convinced myself that using floating point like this for W is okay but this seems to work (maybe it would fail for really large k or something...). If there was a proper implementation of a geometric variate in the random module then that could be used instead. By my timings this can choose from a large dict in less time than a conversion to list although the time taken depends on k: In [2]: n = 6 In [3]: d = dict(zip(range(10**n), range(10**n))) In [4]: %timeit sample_iter(d) 15.5 ms ± 326 µs per loop (mean ± std. dev. of 7 runs, 100 loops each In [5]: %timeit list(d) 26.1 ms ± 1.72 ms per loop (mean ± std. dev. of 7 runs, 10 loops each) In [6]: %timeit sample_iter(d, 2) 15.8 ms ± 427 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) In [7]: %timeit sample_iter(d, 20) 17.6 ms ± 2.17 ms per loop (mean ± std. dev. of 7 runs, 10 loops each) In [8]: %timeit sample_iter(d, 100) 19.9 ms ± 297 µs per loop (mean ± std. dev. of 7 runs, 10 loops each) -- Oscar