This is an inefficient reservoir sampling. The optimized version does not need to call a random inclusion switch on every element, but can skip a geometrically ordered collection of (random) skip lengths to avoid most random inclusion decisions.
Obviously, all items must be iterated over no matter what, but if randrange() is significant compared to the cost of next(), the skip-then-decide version is much preferred, especially as size grows.
I posted this in the thread about indexing dict items but it seems to
have been buried there so I'm starting a new thread.
Maybe there could be a function in the random module for making a
random choice from an arbitrary (finite) iterable. This implementation
can choose a random element from an iterable by fully iterating over
it so is O(N) in terms of CPU operations but O(1) for memory usage:
# rnd.py
from random import randrange
def choice_iter(iterable):
iterator = iter(iterable)
try:
value = next(iterator)
except StopIteration:
raise ValueError("Empty iterable")
for n, candidate in enumerate(iterator, 2):
if not randrange(n):
value = candidate
return value
You could use that to get a choice from a dict, set etc:
>>> choice_iter({'q', 'w', 'e'})
'e'
>>> choice_iter({'q', 'w', 'e'})
'q'
>>> choice_iter({'q', 'w', 'e'})
'e'
>>> choice_iter({'q', 'w', 'e'})
'w'
You can make a random choice from the keys, values or items of a dict.
For a large dict/set it will be slow compared to converting to a list
because it's calling randrange once per item. It can probably be made
a lot faster by doing something like calling randbits and extracting
the bits for multiple iterations of the loop.
Although choice_iter is slower than choice it works for arbitrary
iterables and has constant memory requirements when used with a lazy
iterator. There are situations in which you would not want to build a
list and use random.choice and where the computational overhead is
insignificant. For example choice_iter can be used to randomly choose
a line from an arbitrarily large text file without loading it entirely
into memory:
>>> from rnd import choice_iter
>>> with open('rnd.py') as fin:
... line = choice_iter(fin)
...
>>> line
' except StopIteration:\n'
When reading from a large text file this is roughly optimal in
performance terms because it does precisely the minimum amount of IO
(one pass) while having constant memory overhead. Of course if you
wanted to select multiple lines from the file then calling choice_iter
repeatedly is immediately suboptimal because it would read the file
more than once. It's not hard though to generalise the function above
to something that e.g. makes a selection of k values from an iterable
in a single pass. Here is a version that works without replacement:
def sample_iter(population, k):
iterator = iter(population)
values = []
for _ in range(k):
try:
value = next(iterator)
except StopIteration:
raise ValueError("Too few items")
values.append(value)
for n, candidate in enumerate(iterator, k+1):
random_index = randrange(n)
if random_index < k:
values[random_index] = candidate
return values # maybe shuffle first?
The memory cost of this is O(k) regardless of the size of the input iterable.
--
Oscar
_______________________________________________
Python-ideas mailing list -- python-ideas@python.org
To unsubscribe send an email to python-ideas-leave@python.org
https://mail.python.org/mailman3/lists/python-ideas.python.org/
Message archived at https://mail.python.org/archives/list/python-ideas@python.org/message/4OZTRD7FLXXZ6R6RU4BME6DYR3AXHOBD/
Code of Conduct: http://python.org/psf/codeofconduct/