[Python-ideas] Re: Add random.choice_iter for randomly selecting from an arbitrary iterable

13 Jul 2020

      This is an inefficient reservoir sampling. The optimized version does not
need to call a random inclusion switch on every element, but can skip a
geometrically ordered collection of (random) skip lengths to avoid most
random inclusion decisions.

Obviously, all items must be iterated over no matter what, but if
randrange() is significant compared to the cost of next(), the
skip-then-decide version is much preferred, especially as size grows.

On Mon, Jul 13, 2020, 7:53 AM Oscar Benjamin 
wrote:
...
I posted this in the thread about indexing dict items but it seems to
have been buried there so I'm starting a new thread.
Maybe there could be a function in the random module for making a
random choice from an arbitrary (finite) iterable. This implementation
can choose a random element from an iterable by fully iterating over
it so is O(N) in terms of CPU operations but O(1) for memory usage:
# rnd.py
from random import randrange
def choice_iter(iterable):
    iterator = iter(iterable)
    try:
        value = next(iterator)
    except StopIteration:
        raise ValueError("Empty iterable")
    for n, candidate in enumerate(iterator, 2):
        if not randrange(n):
            value = candidate
    return value
You could use that to get a choice from a dict, set etc:
...
...
...
choice_iter({'q', 'w', 'e'})
'e'
choice_iter({'q', 'w', 'e'})
'q'
choice_iter({'q', 'w', 'e'})
'e'
choice_iter({'q', 'w', 'e'})
'w'
You can make a random choice from the keys, values or items of a dict.
For a large dict/set it will be slow compared to converting to a list
because it's calling randrange once per item. It can probably be made
a lot faster by doing something like calling randbits and extracting
the bits for multiple iterations of the loop.
Although choice_iter is slower than choice it works for arbitrary
iterables and has constant memory requirements when used with a lazy
iterator. There are situations in which you would not want to build a
list and use random.choice and where the computational overhead is
insignificant. For example choice_iter can be used to randomly choose
a line from an arbitrarily large text file without loading it entirely
into memory:
...
...
...
from rnd import choice_iter
with open('rnd.py') as fin:
...     line = choice_iter(fin)
...
line
'    except StopIteration:\n'
When reading from a large text file this is roughly optimal in
performance terms because it does precisely the minimum amount of IO
(one pass) while having constant memory overhead. Of course if you
wanted to select multiple lines from the file then calling choice_iter
repeatedly is immediately suboptimal because it would read the file
more than once. It's not hard though to generalise the function above
to something that e.g. makes a selection of k values from an iterable
in a single pass. Here is a version that works without replacement:
def sample_iter(population, k):
    iterator = iter(population)
    values = []
    for _ in range(k):
        try:
            value = next(iterator)
        except StopIteration:
            raise ValueError("Too few items")
        values.append(value)
    for n, candidate in enumerate(iterator, k+1):
        random_index = randrange(n)
        if random_index < k:
            values[random_index] = candidate
    return values  # maybe shuffle first?
The memory cost of this is O(k) regardless of the size of the input
iterable.
--
Oscar
_______________________________________________
Python-ideas mailing list -- python-ideas@python.org
To unsubscribe send an email to python-ideas-leave@python.org
https://mail.python.org/mailman3/lists/python-ideas.python.org/
Message archived at
https://mail.python.org/archives/list/python-ideas@python.org/message/4OZTRD...
Code of Conduct: http://python.org/psf/codeofconduct/

[Python-ideas] Re: Add random.choice_iter for randomly selecting from an arbitrary iterable

David Mertz