Allow random.choice, random.sample to work on iterators

Currently these functions fail if the supplied object has no len(). There are algorithms for this task that can work on any finite iterator (for example, files as a stream of lines), and the functions could fall back to these if there is no len().

On 2016-11-30 17:25, Random832 wrote:
I like the idea, as long as it does not add too much overhead to currently existing code. It could be a special code path for reservoir sampling (I assume) for both functions (the first taking only one sample from the stream). -- Bernardo Sulzbach http://www.mafagafogigante.org/ mafagafogigante@gmail.com

This was also brought up back in April: https://mail.python.org/pipermail//python-ideas/2016-April/039707.html It got a few replies from Guido (https://mail.python.org/pipermail//python-ideas/2016-April/039713.html for one of them). It seems the idea got dropped due to problems with making it properly random (practically, the sampling has issues in cases of very large sequences, let along infinite) and performance (either large numbers of calls to random, or copying the sequence, each of which has its own problems). There are also issues with how it should behave on iterables that cannot be re-iterated (eg, random.choice will consume the iterator, and could only be called once safely). Chris On Wed, Nov 30, 2016 at 11:42 AM, Bernardo Sulzbach <mafagafogigante@gmail.com> wrote:

On Wed, Nov 30, 2016 at 11:52 AM, Chris Kaynor <ckaynor@zindagigames.com> wrote:
I meant to include a sample in my previous e-mail: Consider that this code will not produce the "correct" results (for a reasonable definition of correct): a = (i for i in range(100)) # Pretend this does something more interesting, and isn't a trivial generator - maybe a file object reading by line. randomEntries = [random.choice(a) for i in range(10)] # May not be quite as obvious, such as the choices could be chosen in a more complex loop. randomEntries may not contain 10 items (or the generation may error due to having insufficent items, depending on implementation), it will not contain duplicates, and will be sorted. This occurs because the first call to random.choice will consume elements from a until it picks one. The next call then consumes from there until it picks one, and so forth. Chris

On 2016-11-30 17:57, Chris Kaynor wrote:
In such a case you should explicitly use a sample. I see your example as the caller's fault, which ignored the fact that the iterator would change after calls to choice. Hold the first 10 (in this case). For every subsequent element, randomly choose to replace one of the "held" ones by it (with a diminishing probability). Assume this does not offer the same performance as loading everything into memory. But it isn't meant to do so, as if you need / can / want, you could just shove it all into a list and use what we currently have. -- Bernardo Sulzbach http://www.mafagafogigante.org/ mafagafogigante@gmail.com

Is the goal to allow them to consume a finite generator of *unknown* length (requires reservoir sampling https://en.wikipedia.org/wiki/Reservoir_sampling with N random calls, which seemed to be the rub before?) or just consume a generator with known length that's not indexable (a rare beast?). Consuming iterables if they have a length like below wouldn't be so bad, but might be too niche. class X: def __init__(self, ele): self.ele = ele def __len__(self): return len(self.ele) def __iter__(self): return iter(self.ele) x = X([1, 2, 3, 4, 5]) random.choice(x) # TypeError: 'X' object does not support indexing Would allowing an optional 'len' argument alongside the iterator to sample/choice be too narrow to be useful? On Wed, Nov 30, 2016 at 2:21 PM, Bernardo Sulzbach < mafagafogigante@gmail.com> wrote:

On 12/1/2016 3:19 AM, Sjoerd Job Postmus wrote:
On Wed, Nov 30, 2016 at 02:32:54PM -0600, Nick Timkovich wrote:
a generator with known length that's not indexable (a rare beast?).
I don't believe a generator is ever indexable.
It is also not a generator. (It is an iterable.). If an *arbitrary* choice (without replacement) from a set is sufficient, set.pop() works. Otherwise, make a list. If we wanted selection selection from sets to be easy, without making a list, we should add a method that accesses the internal indexable array. -- Terry Jan Reedy

On Wed, Nov 30, 2016 at 12:21 PM, Bernardo Sulzbach <mafagafogigante@gmail.com> wrote:
It would be the caller's fault: they failed to follow the documentation. That said, it would be a fairly subtle difference, and in many cases may go unnoticed in simple testing. There is a good probability that if you needed 10 samples out of 1000 entries, my code would appear to work correctly (I don't care to do the math to figure out the actual chance). Only on deeper testing or analysis of the results, would you notice that the probabilities are messed up. Most likely, my exact example would be fairly obvious, but imagine if the code were more complex, with the random.choice call inside a loop, or even another function. If random.choice were updated to use a reservoir sampling as a fallback, it also means that it is more likely for performance to degrade with some types. Giving a list (and maybe other sequences) would be O(1), but if a generator gets passed in, it falls back to O(n), and with lots of random.random calls that are generally naturally slow anyways. All that said, I would not be opposed to Python including a random.reservoir_choice (probably not the best name) function *in addition* to random.choice. The algorithm has its uses, but enough drawbacks and gotchas that it likely is not a good candidate for a fallback.

On 2016-11-30 19:11, Chris Kaynor wrote:
I think this may be the path of least resistance for this. Even if this does imply one or two new functions in random, it may be better than changing random.choice. If these functions would be used enough by enough end users to justify this change is debatable, however. -- Bernardo Sulzbach http://www.mafagafogigante.org/ mafagafogigante@gmail.com

On Wed, Nov 30, 2016 at 11:57:46AM -0800, Chris Kaynor wrote:
Indeed. Given a iterator with 100 items, there's a 9% chance that the first choice will be in the last nine items, which implies that failures here will be common.
it will not contain duplicates, and will be sorted.
Right. In other words, its a highly biased, non-random "random sample". It may be worth adding a "reservoir sample" function, but I think it is a mistake to try modifying the existing functions to deal with iterators. Its too hard to get the same characteristics when sampling from a sequence and an iterator. Better to keep them as separate functions so that the caller knows what they're getting. Here's my first attempt at implementing Algorithm R from Wikipedia: https://en.wikipedia.org/wiki/Reservoir_sampling#Algorithm_R from random import randrange, shuffle import itertools def reservoir_sample(iterable, count=1): """Return a list of count items sampled without replacement from iterable, using reservoir sampling "Algorithm R". This will exhaust the iterable, which must be finite. """ it = iter(iterable) reservoir = list(itertools.islice(it, count)) if len(reservoir) < count: raise ValueError('iterator is too short to sample %d items' % count) shuffle(reservoir) for i, value in enumerate(it, count+1): j = randrange(0, i) if j < count: reservoir[j] = value return reservoir -- Steve

On 2016-11-30 17:25, Random832 wrote:
I like the idea, as long as it does not add too much overhead to currently existing code. It could be a special code path for reservoir sampling (I assume) for both functions (the first taking only one sample from the stream). -- Bernardo Sulzbach http://www.mafagafogigante.org/ mafagafogigante@gmail.com

This was also brought up back in April: https://mail.python.org/pipermail//python-ideas/2016-April/039707.html It got a few replies from Guido (https://mail.python.org/pipermail//python-ideas/2016-April/039713.html for one of them). It seems the idea got dropped due to problems with making it properly random (practically, the sampling has issues in cases of very large sequences, let along infinite) and performance (either large numbers of calls to random, or copying the sequence, each of which has its own problems). There are also issues with how it should behave on iterables that cannot be re-iterated (eg, random.choice will consume the iterator, and could only be called once safely). Chris On Wed, Nov 30, 2016 at 11:42 AM, Bernardo Sulzbach <mafagafogigante@gmail.com> wrote:

On Wed, Nov 30, 2016 at 11:52 AM, Chris Kaynor <ckaynor@zindagigames.com> wrote:
I meant to include a sample in my previous e-mail: Consider that this code will not produce the "correct" results (for a reasonable definition of correct): a = (i for i in range(100)) # Pretend this does something more interesting, and isn't a trivial generator - maybe a file object reading by line. randomEntries = [random.choice(a) for i in range(10)] # May not be quite as obvious, such as the choices could be chosen in a more complex loop. randomEntries may not contain 10 items (or the generation may error due to having insufficent items, depending on implementation), it will not contain duplicates, and will be sorted. This occurs because the first call to random.choice will consume elements from a until it picks one. The next call then consumes from there until it picks one, and so forth. Chris

On 2016-11-30 17:57, Chris Kaynor wrote:
In such a case you should explicitly use a sample. I see your example as the caller's fault, which ignored the fact that the iterator would change after calls to choice. Hold the first 10 (in this case). For every subsequent element, randomly choose to replace one of the "held" ones by it (with a diminishing probability). Assume this does not offer the same performance as loading everything into memory. But it isn't meant to do so, as if you need / can / want, you could just shove it all into a list and use what we currently have. -- Bernardo Sulzbach http://www.mafagafogigante.org/ mafagafogigante@gmail.com

Is the goal to allow them to consume a finite generator of *unknown* length (requires reservoir sampling https://en.wikipedia.org/wiki/Reservoir_sampling with N random calls, which seemed to be the rub before?) or just consume a generator with known length that's not indexable (a rare beast?). Consuming iterables if they have a length like below wouldn't be so bad, but might be too niche. class X: def __init__(self, ele): self.ele = ele def __len__(self): return len(self.ele) def __iter__(self): return iter(self.ele) x = X([1, 2, 3, 4, 5]) random.choice(x) # TypeError: 'X' object does not support indexing Would allowing an optional 'len' argument alongside the iterator to sample/choice be too narrow to be useful? On Wed, Nov 30, 2016 at 2:21 PM, Bernardo Sulzbach < mafagafogigante@gmail.com> wrote:

On 12/1/2016 3:19 AM, Sjoerd Job Postmus wrote:
On Wed, Nov 30, 2016 at 02:32:54PM -0600, Nick Timkovich wrote:
a generator with known length that's not indexable (a rare beast?).
I don't believe a generator is ever indexable.
It is also not a generator. (It is an iterable.). If an *arbitrary* choice (without replacement) from a set is sufficient, set.pop() works. Otherwise, make a list. If we wanted selection selection from sets to be easy, without making a list, we should add a method that accesses the internal indexable array. -- Terry Jan Reedy

On Wed, Nov 30, 2016 at 12:21 PM, Bernardo Sulzbach <mafagafogigante@gmail.com> wrote:
It would be the caller's fault: they failed to follow the documentation. That said, it would be a fairly subtle difference, and in many cases may go unnoticed in simple testing. There is a good probability that if you needed 10 samples out of 1000 entries, my code would appear to work correctly (I don't care to do the math to figure out the actual chance). Only on deeper testing or analysis of the results, would you notice that the probabilities are messed up. Most likely, my exact example would be fairly obvious, but imagine if the code were more complex, with the random.choice call inside a loop, or even another function. If random.choice were updated to use a reservoir sampling as a fallback, it also means that it is more likely for performance to degrade with some types. Giving a list (and maybe other sequences) would be O(1), but if a generator gets passed in, it falls back to O(n), and with lots of random.random calls that are generally naturally slow anyways. All that said, I would not be opposed to Python including a random.reservoir_choice (probably not the best name) function *in addition* to random.choice. The algorithm has its uses, but enough drawbacks and gotchas that it likely is not a good candidate for a fallback.

On 2016-11-30 19:11, Chris Kaynor wrote:
I think this may be the path of least resistance for this. Even if this does imply one or two new functions in random, it may be better than changing random.choice. If these functions would be used enough by enough end users to justify this change is debatable, however. -- Bernardo Sulzbach http://www.mafagafogigante.org/ mafagafogigante@gmail.com

On Wed, Nov 30, 2016 at 11:57:46AM -0800, Chris Kaynor wrote:
Indeed. Given a iterator with 100 items, there's a 9% chance that the first choice will be in the last nine items, which implies that failures here will be common.
it will not contain duplicates, and will be sorted.
Right. In other words, its a highly biased, non-random "random sample". It may be worth adding a "reservoir sample" function, but I think it is a mistake to try modifying the existing functions to deal with iterators. Its too hard to get the same characteristics when sampling from a sequence and an iterator. Better to keep them as separate functions so that the caller knows what they're getting. Here's my first attempt at implementing Algorithm R from Wikipedia: https://en.wikipedia.org/wiki/Reservoir_sampling#Algorithm_R from random import randrange, shuffle import itertools def reservoir_sample(iterable, count=1): """Return a list of count items sampled without replacement from iterable, using reservoir sampling "Algorithm R". This will exhaust the iterable, which must be finite. """ it = iter(iterable) reservoir = list(itertools.islice(it, count)) if len(reservoir) < count: raise ValueError('iterator is too short to sample %d items' % count) shuffle(reservoir) for i, value in enumerate(it, count+1): j = randrange(0, i) if j < count: reservoir[j] = value return reservoir -- Steve
participants (7)
-
Bernardo Sulzbach
-
Chris Kaynor
-
Nick Timkovich
-
Random832
-
Sjoerd Job Postmus
-
Steven D'Aprano
-
Terry Reedy