This is an inefficient reservoir sampling. The optimized version does not
need to call a random inclusion switch on every element, but can skip a
geometrically ordered collection of (random) skip lengths to avoid most
random inclusion decisions.
Obviously, all items must be iterated over no matter what, but if
randrange() is significant compared to the cost of next(), the
skip-then-decide version is much preferred, especially as size grows.
On Mon, Jul 13, 2020, 7:53 AM Oscar Benjamin
I posted this in the thread about indexing dict items but it seems to have been buried there so I'm starting a new thread.
Maybe there could be a function in the random module for making a random choice from an arbitrary (finite) iterable. This implementation can choose a random element from an iterable by fully iterating over it so is O(N) in terms of CPU operations but O(1) for memory usage:
# rnd.py from random import randrange
def choice_iter(iterable): iterator = iter(iterable) try: value = next(iterator) except StopIteration: raise ValueError("Empty iterable") for n, candidate in enumerate(iterator, 2): if not randrange(n): value = candidate return value
You could use that to get a choice from a dict, set etc:
choice_iter({'q', 'w', 'e'}) 'e' choice_iter({'q', 'w', 'e'}) 'q' choice_iter({'q', 'w', 'e'}) 'e' choice_iter({'q', 'w', 'e'}) 'w'
You can make a random choice from the keys, values or items of a dict. For a large dict/set it will be slow compared to converting to a list because it's calling randrange once per item. It can probably be made a lot faster by doing something like calling randbits and extracting the bits for multiple iterations of the loop.
Although choice_iter is slower than choice it works for arbitrary iterables and has constant memory requirements when used with a lazy iterator. There are situations in which you would not want to build a list and use random.choice and where the computational overhead is insignificant. For example choice_iter can be used to randomly choose a line from an arbitrarily large text file without loading it entirely into memory:
from rnd import choice_iter with open('rnd.py') as fin: ... line = choice_iter(fin) ... line ' except StopIteration:\n'
When reading from a large text file this is roughly optimal in performance terms because it does precisely the minimum amount of IO (one pass) while having constant memory overhead. Of course if you wanted to select multiple lines from the file then calling choice_iter repeatedly is immediately suboptimal because it would read the file more than once. It's not hard though to generalise the function above to something that e.g. makes a selection of k values from an iterable in a single pass. Here is a version that works without replacement:
def sample_iter(population, k): iterator = iter(population) values = [] for _ in range(k): try: value = next(iterator) except StopIteration: raise ValueError("Too few items") values.append(value) for n, candidate in enumerate(iterator, k+1): random_index = randrange(n) if random_index < k: values[random_index] = candidate return values # maybe shuffle first?
The memory cost of this is O(k) regardless of the size of the input iterable.
-- Oscar _______________________________________________ Python-ideas mailing list -- python-ideas@python.org To unsubscribe send an email to python-ideas-leave@python.org https://mail.python.org/mailman3/lists/python-ideas.python.org/ Message archived at https://mail.python.org/archives/list/python-ideas@python.org/message/4OZTRD... Code of Conduct: http://python.org/psf/codeofconduct/