random.sample should work better with iterators

The docs on random.sample indicate that it works with iterators:
However, when I try to use iterators other than range, like so: random.sample(itertools.product(range(height), range(with)), 0.5*height*width) I get: TypeError: Population must be a sequence or set. For dicts, use list(d). I don't know if Python Ideas is the right channel for this, but this seems overly constrained. The inability to handle dictionaries is especially puzzling. Randomly sampling from some population is often done because the entire population is impractically large which is also a motivation for using iterators, so it seems natural that one would be able to sample from an iterator. A naive implementation could use a heap queue: import heapq import random def stream(): while True: yield random.random() def sample(population, size): q = [tuple()]*size for el in zip(stream(), population): if el > q[0]: heapq.heapreplace(q, el) return [el[1] for el in q if el] It would also be helpful to add a ratio version of the function: def sample(population, size=None, *, ratio=None): assert None in (size, ratio), "can't specify both sample size and ratio" if ratio: return [el for el in population if random.random() < ratio] ...

On Tue, Jun 26, 2018 at 05:36:51PM -0700, Abe Dillon wrote:
That doesn't mention anything about iterators.
However, when I try to use iterators other than range, like so:
range is not an iterator. Thinking it is is a very common error, but it certainly is not. It is a lazily-generated *sequence*, not an iterator. The definition of an iterator is that the object must have an __iter__ method returning *itself*, and a __next__ method (the "iterator protocol"): py> obj = range(100) py> hasattr(obj, '__next__') False py> obj.__iter__() is obj False However, it is a sequence: py> import collections py> isinstance(obj, collections.Sequence) True (Aside: I'm surprised there's no inspect.isiterator and .isiterable functions.)
Puzzling in what way? If sample() supported dicts, should it return the keys or the values or both? Also consider this: https://bugs.python.org/issue33098
Is that an improvement over: sample(list(itertools.slice(population, size))) and if so, please explain.
Helpful under what circumstances? Don't let the source speak for itself. Explain what it means. I understand what sample(population, size=100) does. What would sample(population, ratio=0.25) do? (That's not a rhetorical question, I genuinely don't understand the semantics of this proposed ratio argument.) -- Steve

Steven D'Aprano writes:
Same misconception, I suppose.
If sample() supported dicts, should it return the keys or the values or both?
I argue below that *if* we were going to make the change, it should be to consistently try list() on non-sequences. But "not every one-liner" and EIBTI: d = {'a': 1, 'b': 2}
But this is weird:
Oh, I see. Key views are "set-like", item views *may* be set-like, but value views are *not* set-like. Since views are all listable, why not try "list" on them? In general, I would think it makes sense to define this as "Population must be a sequence or convertible to a sequence using list()." And for most of the applications I can think of in my own use, sample(list(d)) is not particularly useful because it's a sample of keys. I usually want sample(list(d.values())). The ramifications are unclear to me, but I guess it's too late to change this because of the efficiency implications Tim describes in issue33098 (so EIBTI; thanks for the reference!) On the other hand, that issue says sets can't be sampled efficiently, so the current behavior seems to *promote* inefficient usage? I would definitely change the error message. I think "Use list(d)" is bad advice because I believe it's not even "almost always" what you'll want, and if keys and values are of the same type, it won't be obvious from the output that you're *not* getting a sample from d.values() if that's what you wanted and thought you were getting.
I assume sample(pop, ratio=0.25) == sample(pop, size=0.25*len(pop)).

[Abe Dillon]
[Steven D'Aprano]
Is that an improvement over:
sample(list(itertools.slice(population, size)))
sample(list(itertools.slice(population, size)). size) That is, it merely returns some permutation of the _initial_ `size` items in the iterable. The rest of the population is ignored. In Python today, the easiest way to spell Abe's intent is, e.g.,
That also arranges to preserve `sample()'s promise that all sub-slices of the result are valid random samples too (because `nlargest` sorts by the randomly generated keys before returning the list). However, it does _not_ preserve - and nothing can preserve for arbitrary iterables - `sample()`'s promise to "[leave] the original population unchanged". We can't examine an arbitrary iterable's population at all without exhausting the iterable, and that can be destructive. So while this can indeed be useful, it would require changing `sample()` to break that promise in some cases. BTW, using a heap for this is uncommon. Search on "reservoir sampling" for more-common ways Most common is probably Vitter's "Algorithm R", which runs in O(len(iterable)) time (no additional log factor for a heap - it doesn't use a heap). I'd prefer to leave `sample()` alone, and introduce some spelling of `possibly_destructive_sample()` for arbitrary iterables - if that's wanted enough for someone to do the work ;-)

[Tim]
[Antoine Pitrou]
How could slicing return an invalid random sample?
For example, consider random.sample(range(2), 2). As a set, there is only one possible output, {0, 1}. But it doesn't return a set, it returns a list. So there are two possible outputs: [0, 1] [1, 0] random.sample() promises to return each of those about equally often, so that, e.g., result[0:1] and result[1:2] are also random 1-samples. If it always returned, say, [0, 1], that's "a random" 2-sample, but its 1-slices are as far from random 1-samples as is possible to get.

Let me start off by saying I agree with Tim Peters that it would be best to implement these changes in a new function (if ever). On Tuesday, June 26, 2018 at 8:06:35 PM UTC-5, Steven D'Aprano wrote:
range is not an iterator.
My misunderstanding of the details of range objects was, indeed, a huge contributing factor to my confusion. I assumed range was more like a generator function when I initially discovered that random.sample doesn't permit iterators, however; the reason I'm proposing a version of random.sample that accepts iterators is still sound. It's even in the text for why one should use a range object: To choose a sample from a range of integers, use a range()
As I claimed before: A major use-case for sampling is to avoid working with an impractically large population. This is also a major use-case for iterators. On Tuesday, June 26, 2018 at 8:06:35 PM UTC-5, Steven D'Aprano wrote:
this seems overly constrained. The inability to handle dictionaries is especially puzzling.
Like in all other contexts where a dictionary is treated as a collection, it should be treated as a collection of keys. There are plenty of precedence of this: d = dict(zip(names, ages)) chronological_names = sorted(d, key=d.get) name_list, name_set = list(d), set(d) print(*d) On Tuesday, June 26, 2018 at 8:06:35 PM UTC-5, Steven D'Aprano wrote:
I respectfully disagree with the conclusion of that issue. It goes against the "consenting adults" ethos of Python. As long as the performance implications are expressly documented and and maybe even a warning thrown, I don't see a reason to prevent people from using a useful function. You can't protect programmers from writing inefficient programs. Also, It seems like the dict interface could expose a way to get a sequence view of the keys. This would be very efficient given the current implementation of dictionaries in CPython. So, it's not like it's fundamentally impossible for random.choice to work efficiently with dicts, it's more of a implementation detail. On Tuesday, June 26, 2018 at 8:06:35 PM UTC-5, Steven D'Aprano wrote:
Do you mean: sample(list(itertools.islice(population, size), size)? If so, then I'll refer you to Tim Peter's response, otherwise: please clarify what you meant. On Tuesday, June 26, 2018 at 8:06:35 PM UTC-5, Steven D'Aprano wrote:
I wasn't aware of the linear-time reservoir sampling algorithms that Tim Peters suggested. Those make the ratio proposal less helpful. As you can see from the implementation I proposed, the ratio would be able to work with iterators of undetermined size in linear time, however; it wouldn't satisfy the valid subsampling criteria (unless you shuffle the output) and it would only return *roughly* ratio*len(population) elements instead of an exact number. On Tuesday, June 26, 2018 at 8:06:35 PM UTC-5, Steven D'Aprano wrote:
It would return a sample of roughly 25% of the population. [Stephen J. Turnbull]
Converting the input to a list is exactly what I'm trying to avoid. I'd like to sample from an enormous file that won't fit in memory or populate 5% of a large game-of-life grid without using up gigabytes of memory: width, height, ratio = 100000, 100000, 0.05 live_set = {*random.sample(itertools.product(range(height), range(width)) , ratio*width*height)}

On Tue, Jun 26, 2018 at 05:36:51PM -0700, Abe Dillon wrote:
That doesn't mention anything about iterators.
However, when I try to use iterators other than range, like so:
range is not an iterator. Thinking it is is a very common error, but it certainly is not. It is a lazily-generated *sequence*, not an iterator. The definition of an iterator is that the object must have an __iter__ method returning *itself*, and a __next__ method (the "iterator protocol"): py> obj = range(100) py> hasattr(obj, '__next__') False py> obj.__iter__() is obj False However, it is a sequence: py> import collections py> isinstance(obj, collections.Sequence) True (Aside: I'm surprised there's no inspect.isiterator and .isiterable functions.)
Puzzling in what way? If sample() supported dicts, should it return the keys or the values or both? Also consider this: https://bugs.python.org/issue33098
Is that an improvement over: sample(list(itertools.slice(population, size))) and if so, please explain.
Helpful under what circumstances? Don't let the source speak for itself. Explain what it means. I understand what sample(population, size=100) does. What would sample(population, ratio=0.25) do? (That's not a rhetorical question, I genuinely don't understand the semantics of this proposed ratio argument.) -- Steve

Steven D'Aprano writes:
Same misconception, I suppose.
If sample() supported dicts, should it return the keys or the values or both?
I argue below that *if* we were going to make the change, it should be to consistently try list() on non-sequences. But "not every one-liner" and EIBTI: d = {'a': 1, 'b': 2}
But this is weird:
Oh, I see. Key views are "set-like", item views *may* be set-like, but value views are *not* set-like. Since views are all listable, why not try "list" on them? In general, I would think it makes sense to define this as "Population must be a sequence or convertible to a sequence using list()." And for most of the applications I can think of in my own use, sample(list(d)) is not particularly useful because it's a sample of keys. I usually want sample(list(d.values())). The ramifications are unclear to me, but I guess it's too late to change this because of the efficiency implications Tim describes in issue33098 (so EIBTI; thanks for the reference!) On the other hand, that issue says sets can't be sampled efficiently, so the current behavior seems to *promote* inefficient usage? I would definitely change the error message. I think "Use list(d)" is bad advice because I believe it's not even "almost always" what you'll want, and if keys and values are of the same type, it won't be obvious from the output that you're *not* getting a sample from d.values() if that's what you wanted and thought you were getting.
I assume sample(pop, ratio=0.25) == sample(pop, size=0.25*len(pop)).

[Abe Dillon]
[Steven D'Aprano]
Is that an improvement over:
sample(list(itertools.slice(population, size)))
sample(list(itertools.slice(population, size)). size) That is, it merely returns some permutation of the _initial_ `size` items in the iterable. The rest of the population is ignored. In Python today, the easiest way to spell Abe's intent is, e.g.,
That also arranges to preserve `sample()'s promise that all sub-slices of the result are valid random samples too (because `nlargest` sorts by the randomly generated keys before returning the list). However, it does _not_ preserve - and nothing can preserve for arbitrary iterables - `sample()`'s promise to "[leave] the original population unchanged". We can't examine an arbitrary iterable's population at all without exhausting the iterable, and that can be destructive. So while this can indeed be useful, it would require changing `sample()` to break that promise in some cases. BTW, using a heap for this is uncommon. Search on "reservoir sampling" for more-common ways Most common is probably Vitter's "Algorithm R", which runs in O(len(iterable)) time (no additional log factor for a heap - it doesn't use a heap). I'd prefer to leave `sample()` alone, and introduce some spelling of `possibly_destructive_sample()` for arbitrary iterables - if that's wanted enough for someone to do the work ;-)

[Tim]
[Antoine Pitrou]
How could slicing return an invalid random sample?
For example, consider random.sample(range(2), 2). As a set, there is only one possible output, {0, 1}. But it doesn't return a set, it returns a list. So there are two possible outputs: [0, 1] [1, 0] random.sample() promises to return each of those about equally often, so that, e.g., result[0:1] and result[1:2] are also random 1-samples. If it always returned, say, [0, 1], that's "a random" 2-sample, but its 1-slices are as far from random 1-samples as is possible to get.

Let me start off by saying I agree with Tim Peters that it would be best to implement these changes in a new function (if ever). On Tuesday, June 26, 2018 at 8:06:35 PM UTC-5, Steven D'Aprano wrote:
range is not an iterator.
My misunderstanding of the details of range objects was, indeed, a huge contributing factor to my confusion. I assumed range was more like a generator function when I initially discovered that random.sample doesn't permit iterators, however; the reason I'm proposing a version of random.sample that accepts iterators is still sound. It's even in the text for why one should use a range object: To choose a sample from a range of integers, use a range()
As I claimed before: A major use-case for sampling is to avoid working with an impractically large population. This is also a major use-case for iterators. On Tuesday, June 26, 2018 at 8:06:35 PM UTC-5, Steven D'Aprano wrote:
this seems overly constrained. The inability to handle dictionaries is especially puzzling.
Like in all other contexts where a dictionary is treated as a collection, it should be treated as a collection of keys. There are plenty of precedence of this: d = dict(zip(names, ages)) chronological_names = sorted(d, key=d.get) name_list, name_set = list(d), set(d) print(*d) On Tuesday, June 26, 2018 at 8:06:35 PM UTC-5, Steven D'Aprano wrote:
I respectfully disagree with the conclusion of that issue. It goes against the "consenting adults" ethos of Python. As long as the performance implications are expressly documented and and maybe even a warning thrown, I don't see a reason to prevent people from using a useful function. You can't protect programmers from writing inefficient programs. Also, It seems like the dict interface could expose a way to get a sequence view of the keys. This would be very efficient given the current implementation of dictionaries in CPython. So, it's not like it's fundamentally impossible for random.choice to work efficiently with dicts, it's more of a implementation detail. On Tuesday, June 26, 2018 at 8:06:35 PM UTC-5, Steven D'Aprano wrote:
Do you mean: sample(list(itertools.islice(population, size), size)? If so, then I'll refer you to Tim Peter's response, otherwise: please clarify what you meant. On Tuesday, June 26, 2018 at 8:06:35 PM UTC-5, Steven D'Aprano wrote:
I wasn't aware of the linear-time reservoir sampling algorithms that Tim Peters suggested. Those make the ratio proposal less helpful. As you can see from the implementation I proposed, the ratio would be able to work with iterators of undetermined size in linear time, however; it wouldn't satisfy the valid subsampling criteria (unless you shuffle the output) and it would only return *roughly* ratio*len(population) elements instead of an exact number. On Tuesday, June 26, 2018 at 8:06:35 PM UTC-5, Steven D'Aprano wrote:
It would return a sample of roughly 25% of the population. [Stephen J. Turnbull]
Converting the input to a list is exactly what I'm trying to avoid. I'd like to sample from an enormous file that won't fit in memory or populate 5% of a large game-of-life grid without using up gigabytes of memory: width, height, ratio = 100000, 100000, 0.05 live_set = {*random.sample(itertools.product(range(height), range(width)) , ratio*width*height)}
participants (6)
-
Abe Dillon
-
Antoine Pitrou
-
Franklin? Lee
-
Stephen J. Turnbull
-
Steven D'Aprano
-
Tim Peters