Mailman 3 Add random.choice_iter for randomly selecting from an arbitrary iterable - Python-ideas

Add random.choice_iter for randomly selecting from an arbitrary iterable

older
Add builtin function for min(max())

Oscar Benjamin

July 13, 2020

11:52 a.m.

I posted this in the thread about indexing dict items but it seems to have been buried there so I'm starting a new thread. Maybe there could be a function in the random module for making a random choice from an arbitrary (finite) iterable. This implementation can choose a random element from an iterable by fully iterating over it so is O(N) in terms of CPU operations but O(1) for memory usage: # rnd.py from random import randrange def choice_iter(iterable): iterator = iter(iterable) try: value = next(iterator) except StopIteration: raise ValueError("Empty iterable") for n, candidate in enumerate(iterator, 2): if not randrange(n): value = candidate return value You could use that to get a choice from a dict, set etc:

...

...
...
choice_iter({'q', 'w', 'e'}) 'e' choice_iter({'q', 'w', 'e'}) 'q' choice_iter({'q', 'w', 'e'}) 'e' choice_iter({'q', 'w', 'e'}) 'w'

You can make a random choice from the keys, values or items of a dict. For a large dict/set it will be slow compared to converting to a list because it's calling randrange once per item. It can probably be made a lot faster by doing something like calling randbits and extracting the bits for multiple iterations of the loop. Although choice_iter is slower than choice it works for arbitrary iterables and has constant memory requirements when used with a lazy iterator. There are situations in which you would not want to build a list and use random.choice and where the computational overhead is insignificant. For example choice_iter can be used to randomly choose a line from an arbitrarily large text file without loading it entirely into memory:

...

...
...
from rnd import choice_iter with open('rnd.py') as fin: ... line = choice_iter(fin) ... line ' except StopIteration:\n'

When reading from a large text file this is roughly optimal in performance terms because it does precisely the minimum amount of IO (one pass) while having constant memory overhead. Of course if you wanted to select multiple lines from the file then calling choice_iter repeatedly is immediately suboptimal because it would read the file more than once. It's not hard though to generalise the function above to something that e.g. makes a selection of k values from an iterable in a single pass. Here is a version that works without replacement: def sample_iter(population, k): iterator = iter(population) values = [] for _ in range(k): try: value = next(iterator) except StopIteration: raise ValueError("Too few items") values.append(value) for n, candidate in enumerate(iterator, k+1): random_index = randrange(n) if random_index < k: values[random_index] = candidate return values # maybe shuffle first? The memory cost of this is O(k) regardless of the size of the input iterable. -- Oscar

Show replies by date

David Mertz

July 2020

12:37 p.m.

This is an inefficient reservoir sampling. The optimized version does not need to call a random inclusion switch on every element, but can skip a geometrically ordered collection of (random) skip lengths to avoid most random inclusion decisions. Obviously, all items must be iterated over no matter what, but if randrange() is significant compared to the cost of next(), the skip-then-decide version is much preferred, especially as size grows. On Mon, Jul 13, 2020, 7:53 AM Oscar Benjamin <oscar.j.benjamin@gmail.com> wrote:

...

I posted this in the thread about indexing dict items but it seems to have been buried there so I'm starting a new thread.

Maybe there could be a function in the random module for making a random choice from an arbitrary (finite) iterable. This implementation can choose a random element from an iterable by fully iterating over it so is O(N) in terms of CPU operations but O(1) for memory usage:

# rnd.py from random import randrange

def choice_iter(iterable): iterator = iter(iterable) try: value = next(iterator) except StopIteration: raise ValueError("Empty iterable") for n, candidate in enumerate(iterator, 2): if not randrange(n): value = candidate return value

You could use that to get a choice from a dict, set etc:

...
...
...
choice_iter({'q', 'w', 'e'}) 'e' choice_iter({'q', 'w', 'e'}) 'q' choice_iter({'q', 'w', 'e'}) 'e' choice_iter({'q', 'w', 'e'}) 'w'

You can make a random choice from the keys, values or items of a dict. For a large dict/set it will be slow compared to converting to a list because it's calling randrange once per item. It can probably be made a lot faster by doing something like calling randbits and extracting the bits for multiple iterations of the loop.

Although choice_iter is slower than choice it works for arbitrary iterables and has constant memory requirements when used with a lazy iterator. There are situations in which you would not want to build a list and use random.choice and where the computational overhead is insignificant. For example choice_iter can be used to randomly choose a line from an arbitrarily large text file without loading it entirely into memory:

...
...
...
from rnd import choice_iter with open('rnd.py') as fin: ... line = choice_iter(fin) ... line ' except StopIteration:\n'

When reading from a large text file this is roughly optimal in performance terms because it does precisely the minimum amount of IO (one pass) while having constant memory overhead. Of course if you wanted to select multiple lines from the file then calling choice_iter repeatedly is immediately suboptimal because it would read the file more than once. It's not hard though to generalise the function above to something that e.g. makes a selection of k values from an iterable in a single pass. Here is a version that works without replacement:

def sample_iter(population, k): iterator = iter(population) values = [] for _ in range(k): try: value = next(iterator) except StopIteration: raise ValueError("Too few items") values.append(value) for n, candidate in enumerate(iterator, k+1): random_index = randrange(n) if random_index < k: values[random_index] = candidate return values # maybe shuffle first?

The memory cost of this is O(k) regardless of the size of the input iterable.

-- Oscar _______________________________________________ Python-ideas mailing list -- python-ideas@python.org To unsubscribe send an email to python-ideas-leave@python.org https://mail.python.org/mailman3/lists/python-ideas.python.org/ Message archived at https://mail.python.org/archives/list/python-ideas@python.org/message/4OZTRD... Code of Conduct: http://python.org/psf/codeofconduct/

David Mertz

12:40 p.m.

If we get this function (which I would like), the version with k items (default 1) is much better. Some iterators cannot be repeated at all, so not only is it slower to call multiple times if you need k>1, it's impossible. On Mon, Jul 13, 2020, 8:37 AM David Mertz <mertz@gnosis.cx> wrote:

...

This is an inefficient reservoir sampling. The optimized version does not need to call a random inclusion switch on every element, but can skip a geometrically ordered collection of (random) skip lengths to avoid most random inclusion decisions.

Obviously, all items must be iterated over no matter what, but if randrange() is significant compared to the cost of next(), the skip-then-decide version is much preferred, especially as size grows.

On Mon, Jul 13, 2020, 7:53 AM Oscar Benjamin <oscar.j.benjamin@gmail.com> wrote:

...
I posted this in the thread about indexing dict items but it seems to have been buried there so I'm starting a new thread.

Maybe there could be a function in the random module for making a random choice from an arbitrary (finite) iterable. This implementation can choose a random element from an iterable by fully iterating over it so is O(N) in terms of CPU operations but O(1) for memory usage:

# rnd.py from random import randrange

def choice_iter(iterable): iterator = iter(iterable) try: value = next(iterator) except StopIteration: raise ValueError("Empty iterable") for n, candidate in enumerate(iterator, 2): if not randrange(n): value = candidate return value

You could use that to get a choice from a dict, set etc:

...
...
...
choice_iter({'q', 'w', 'e'}) 'e' choice_iter({'q', 'w', 'e'}) 'q' choice_iter({'q', 'w', 'e'}) 'e' choice_iter({'q', 'w', 'e'}) 'w'

You can make a random choice from the keys, values or items of a dict. For a large dict/set it will be slow compared to converting to a list because it's calling randrange once per item. It can probably be made a lot faster by doing something like calling randbits and extracting the bits for multiple iterations of the loop.

Although choice_iter is slower than choice it works for arbitrary iterables and has constant memory requirements when used with a lazy iterator. There are situations in which you would not want to build a list and use random.choice and where the computational overhead is insignificant. For example choice_iter can be used to randomly choose a line from an arbitrarily large text file without loading it entirely into memory:

...
...
...
from rnd import choice_iter with open('rnd.py') as fin: ... line = choice_iter(fin) ... line ' except StopIteration:\n'

When reading from a large text file this is roughly optimal in performance terms because it does precisely the minimum amount of IO (one pass) while having constant memory overhead. Of course if you wanted to select multiple lines from the file then calling choice_iter repeatedly is immediately suboptimal because it would read the file more than once. It's not hard though to generalise the function above to something that e.g. makes a selection of k values from an iterable in a single pass. Here is a version that works without replacement:

def sample_iter(population, k): iterator = iter(population) values = [] for _ in range(k): try: value = next(iterator) except StopIteration: raise ValueError("Too few items") values.append(value) for n, candidate in enumerate(iterator, k+1): random_index = randrange(n) if random_index < k: values[random_index] = candidate return values # maybe shuffle first?

The memory cost of this is O(k) regardless of the size of the input iterable.

-- Oscar _______________________________________________ Python-ideas mailing list -- python-ideas@python.org To unsubscribe send an email to python-ideas-leave@python.org https://mail.python.org/mailman3/lists/python-ideas.python.org/ Message archived at https://mail.python.org/archives/list/python-ideas@python.org/message/4OZTRD... Code of Conduct: http://python.org/psf/codeofconduct/

Guido van Rossum

2:39 p.m.

I think you all should get together and come up with a good implementation, and then petition Raymond Hettinger. Or maybe there is an existing open source 3rd party project that has code you can copy? I don’t recall if random has a C accelerator, but if it does, you should come up with C code as well. —Guido On Mon, Jul 13, 2020 at 05:40 David Mertz <mertz@gnosis.cx> wrote:

...

If we get this function (which I would like), the version with k items (default 1) is much better. Some iterators cannot be repeated at all, so not only is it slower to call multiple times if you need k>1, it's impossible.

On Mon, Jul 13, 2020, 8:37 AM David Mertz <mertz@gnosis.cx> wrote:

...
This is an inefficient reservoir sampling. The optimized version does not need to call a random inclusion switch on every element, but can skip a geometrically ordered collection of (random) skip lengths to avoid most random inclusion decisions.

Obviously, all items must be iterated over no matter what, but if randrange() is significant compared to the cost of next(), the skip-then-decide version is much preferred, especially as size grows.

On Mon, Jul 13, 2020, 7:53 AM Oscar Benjamin <oscar.j.benjamin@gmail.com> wrote:

...
I posted this in the thread about indexing dict items but it seems to have been buried there so I'm starting a new thread.

Maybe there could be a function in the random module for making a random choice from an arbitrary (finite) iterable. This implementation can choose a random element from an iterable by fully iterating over it so is O(N) in terms of CPU operations but O(1) for memory usage:

[,,,]

...

...
...
--

--Guido (mobile)

Christopher Barker

6:34 p.m.

On Mon, Jul 13, 2020 at 7:44 AM Guido van Rossum <guido@python.org> wrote:

...

I think you all should get together and come up with a good implementation,

...

and then petition Raymond Hettinger.

Is this Raymond's turf? I would think it belongs more in the random module than itertools, or is that Raymond's responsibility as well? In any case, a good implementation would be a lot easier to evaluate for inclusion. Or maybe there is an existing open source 3rd party project that has code

...

you can copy? I don’t recall if random has a C accelerator, but if it does, you should come up with C code as well.

Or Cython— at least to prototype it. -CHB

...

—Guido

On Mon, Jul 13, 2020 at 05:40 David Mertz <mertz@gnosis.cx> wrote:

...
If we get this function (which I would like), the version with k items (default 1) is much better. Some iterators cannot be repeated at all, so not only is it slower to call multiple times if you need k>1, it's impossible.

On Mon, Jul 13, 2020, 8:37 AM David Mertz <mertz@gnosis.cx> wrote:

...
This is an inefficient reservoir sampling. The optimized version does not need to call a random inclusion switch on every element, but can skip a geometrically ordered collection of (random) skip lengths to avoid most random inclusion decisions.

Obviously, all items must be iterated over no matter what, but if randrange() is significant compared to the cost of next(), the skip-then-decide version is much preferred, especially as size grows.

On Mon, Jul 13, 2020, 7:53 AM Oscar Benjamin <oscar.j.benjamin@gmail.com> wrote:

...
I posted this in the thread about indexing dict items but it seems to have been buried there so I'm starting a new thread.

Maybe there could be a function in the random module for making a random choice from an arbitrary (finite) iterable. This implementation can choose a random element from an iterable by fully iterating over it so is O(N) in terms of CPU operations but O(1) for memory usage:

[,,,]

...
...
...
--

--Guido (mobile) _______________________________________________ Python-ideas mailing list -- python-ideas@python.org To unsubscribe send an email to python-ideas-leave@python.org https://mail.python.org/mailman3/lists/python-ideas.python.org/ Message archived at https://mail.python.org/archives/list/python-ideas@python.org/message/DZ7KQZ... Code of Conduct: http://python.org/psf/codeofconduct/

Oscar Benjamin

7:52 p.m.

On Mon, 13 Jul 2020 at 19:39, Christopher Barker <pythonchb@gmail.com> wrote:

...

On Mon, Jul 13, 2020 at 7:44 AM Guido van Rossum <guido@python.org> wrote:

...
I think you all should get together and come up with a good implementation,

In any case, a good implementation would be a lot easier to evaluate for inclusion.

...
Or maybe there is an existing open source 3rd party project that has code you can copy? I don’t recall if random has a C accelerator, but if it does, you should come up with C code as well.

I've already shown a second implementation that is plenty fast enough and pretty much optimal in a performance sense after following David's suggestion. It can possibly be improved by making a more reliable routine for geometric random variates which would also be a useful inclusion for the random module. I doubt it can be made any faster in the large iterable case using Cython/C though: it's already limited by the speed that islice can skip through items in the iterator and there's no way to avoid consuming the iterator in full. -- Oscar

Oscar Benjamin

5:45 p.m.

...

On Mon, Jul 13, 2020, 7:53 AM Oscar Benjamin <oscar.j.benjamin@gmail.com> wrote:

...
I posted this in the thread about indexing dict items but it seems to have been buried there so I'm starting a new thread.

Maybe there could be a function in the random module for making a random choice from an arbitrary (finite) iterable. This implementation can choose a random element from an iterable by fully iterating over it so is O(N) in terms of CPU operations but O(1) for memory usage:

On Mon, 13 Jul 2020 at 13:37, David Mertz <mertz@gnosis.cx> wrote:

This is an inefficient reservoir sampling. The optimized version does not need to call a random inclusion switch on every element, but can skip a geometrically ordered collection of (random) skip lengths to avoid most random inclusion decisions.

Obviously, all items must be iterated over no matter what, but if randrange() is significant compared to the cost of next(), the skip-then-decide version is much preferred, especially as size grows.

Yes, I was deliberately keeping the implementation simple because I find the naive version understandable just by looking at it. I knew there would be faster version but thanks for the pointer which quickly lead me here: https://en.wikipedia.org/wiki/Reservoir_sampling#An_optimal_algorithm In Python that's: from math import exp, log, floor from random import random, randrange from itertools import islice def sample_iter(iterable, k=1): """Select k items uniformly from iterable. Returns the whole population if there are k or fewer items """ iterator = iter(iterable) values = list(islice(iterator, k)) W = exp(log(random())/k) while True: # skip is geometrically distributed skip = floor( log(random())/log(1-W) ) selection = list(islice(iterator, skip, skip+1)) if selection: values[randrange(k)] = selection[0] W *= exp(log(random())/k) else: return values I haven't quite convinced myself that using floating point like this for W is okay but this seems to work (maybe it would fail for really large k or something...). If there was a proper implementation of a geometric variate in the random module then that could be used instead. By my timings this can choose from a large dict in less time than a conversion to list although the time taken depends on k: In [2]: n = 6 In [3]: d = dict(zip(range(10**n), range(10**n))) In [4]: %timeit sample_iter(d) 15.5 ms ± 326 µs per loop (mean ± std. dev. of 7 runs, 100 loops each In [5]: %timeit list(d) 26.1 ms ± 1.72 ms per loop (mean ± std. dev. of 7 runs, 10 loops each) In [6]: %timeit sample_iter(d, 2) 15.8 ms ± 427 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) In [7]: %timeit sample_iter(d, 20) 17.6 ms ± 2.17 ms per loop (mean ± std. dev. of 7 runs, 10 loops each) In [8]: %timeit sample_iter(d, 100) 19.9 ms ± 297 µs per loop (mean ± std. dev. of 7 runs, 10 loops each) -- Oscar

1654

Age (days ago)

1654

Last active (days ago)

List overview

Download

6 comments

4 participants

participants (4)

Christopher Barker
David Mertz
Guido van Rossum
Oscar Benjamin

Add random.choice_iter for randomly selecting from an arbitrary iterable

Oscar Benjamin

David Mertz

David Mertz

Guido van Rossum

Christopher Barker

Oscar Benjamin

Oscar Benjamin

tags

participants (4)