Mailman 3 random.sample should work better with iterators - Python-ideas

newer
collections.Counter should...

random.sample should work better with iterators

older
Correct way for writing Python...

Abe Dillon

June 27, 2018

12:36 a.m.

The docs on random.sample indicate that it works with iterators:

...

However, when I try to use iterators other than range, like so: random.sample(itertools.product(range(height), range(with)), 0.5*height*width) I get: TypeError: Population must be a sequence or set. For dicts, use list(d). I don't know if Python Ideas is the right channel for this, but this seems overly constrained. The inability to handle dictionaries is especially puzzling. Randomly sampling from some population is often done because the entire population is impractically large which is also a motivation for using iterators, so it seems natural that one would be able to sample from an iterator. A naive implementation could use a heap queue: import heapq import random def stream(): while True: yield random.random() def sample(population, size): q = [tuple()]*size for el in zip(stream(), population): if el > q[0]: heapq.heapreplace(q, el) return [el[1] for el in q if el] It would also be helpful to add a ratio version of the function: def sample(population, size=None, *, ratio=None): assert None in (size, ratio), "can't specify both sample size and ratio" if ratio: return [el for el in population if random.random() < ratio] ...

Attachments:

attachment.htm (text/html — 11.9 KB)

Show replies by date

Steven D'Aprano

June 2018

1:05 a.m.

On Tue, Jun 26, 2018 at 05:36:51PM -0700, Abe Dillon wrote:

...

The docs on random.sample indicate that it works with iterators:

...
To choose a sample from a range of integers, use a range() <https://docs.python.org/3/library/stdtypes.html#range> object as an argument. This is especially fast and space efficient for sampling from a large population: sample(range(10000000),k=60).

That doesn't mention anything about iterators.

...

However, when I try to use iterators other than range, like so:

range is not an iterator. Thinking it is is a very common error, but it certainly is not. It is a lazily-generated *sequence*, not an iterator. The definition of an iterator is that the object must have an __iter__ method returning *itself*, and a __next__ method (the "iterator protocol"): py> obj = range(100) py> hasattr(obj, '__next__') False py> obj.__iter__() is obj False However, it is a sequence: py> import collections py> isinstance(obj, collections.Sequence) True (Aside: I'm surprised there's no inspect.isiterator and .isiterable functions.)

...

random.sample(itertools.product(range(height), range(with)), 0.5*height*width)

I get:

TypeError: Population must be a sequence or set. For dicts, use list(d).

I don't know if Python Ideas is the right channel for this, but this seems overly constrained. The inability to handle dictionaries is especially puzzling.

Puzzling in what way? If sample() supported dicts, should it return the keys or the values or both? Also consider this: https://bugs.python.org/issue33098

...

Randomly sampling from some population is often done because the entire population is impractically large which is also a motivation for using iterators, so it seems natural that one would be able to sample from an iterator. A naive implementation could use a heap queue:

import heapq import random

def stream(): while True: yield random.random()

def sample(population, size): q = [tuple()]*size for el in zip(stream(), population): if el > q[0]: heapq.heapreplace(q, el) return [el[1] for el in q if el]

Is that an improvement over: sample(list(itertools.slice(population, size))) and if so, please explain.

...

It would also be helpful to add a ratio version of the function:

def sample(population, size=None, *, ratio=None): assert None in (size, ratio), "can't specify both sample size and ratio" if ratio: return [el for el in population if random.random() < ratio] ...

Helpful under what circumstances? Don't let the source speak for itself. Explain what it means. I understand what sample(population, size=100) does. What would sample(population, ratio=0.25) do? (That's not a rhetorical question, I genuinely don't understand the semantics of this proposed ratio argument.) -- Steve

Stephen J. Turnbull

3:07 a.m.

Steven D'Aprano writes:

...

...
I don't know if Python Ideas is the right channel for this, but this seems overly constrained. The inability to handle dictionaries is especially puzzling.

Puzzling in what way?

Same misconception, I suppose.

...

If sample() supported dicts, should it return the keys or the values or both?

I argue below that *if* we were going to make the change, it should be to consistently try list() on non-sequences. But "not every one-liner" and EIBTI: d = {'a': 1, 'b': 2}

...

...
...
sample(d.keys(),1) ['a'] sample(d.items(),1) [('a', 1)]

But this is weird:

...

...
...
sample(d.values(),1) Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/opt/local/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/random.py", line 314, in sample raise TypeError("Population must be a sequence or set. For dicts, use list(d).") TypeError: Population must be a sequence or set. For dicts, use list(d).

Oh, I see. Key views are "set-like", item views *may* be set-like, but value views are *not* set-like. Since views are all listable, why not try "list" on them? In general, I would think it makes sense to define this as "Population must be a sequence or convertible to a sequence using list()." And for most of the applications I can think of in my own use, sample(list(d)) is not particularly useful because it's a sample of keys. I usually want sample(list(d.values())). The ramifications are unclear to me, but I guess it's too late to change this because of the efficiency implications Tim describes in issue33098 (so EIBTI; thanks for the reference!) On the other hand, that issue says sets can't be sampled efficiently, so the current behavior seems to *promote* inefficient usage? I would definitely change the error message. I think "Use list(d)" is bad advice because I believe it's not even "almost always" what you'll want, and if keys and values are of the same type, it won't be obvious from the output that you're *not* getting a sample from d.values() if that's what you wanted and thought you were getting.

...

Don't let the source speak for itself. Explain what it means. I understand what sample(population, size=100) does. What would sample(population, ratio=0.25) do?

I assume sample(pop, ratio=0.25) == sample(pop, size=0.25*len(pop)).

Tim Peters

4:52 a.m.

[Abe Dillon]

...

Randomly sampling from some population is often done because the entire

...
population is impractically large which is also a motivation for using iterators, so it seems natural that one would be able to sample from an iterator. A naive implementation could use a heap queue:

import heapq import random

def stream(): while True: yield random.random()

def sample(population, size): q = [tuple()]*size for el in zip(stream(), population): if el > q[0]: heapq.heapreplace(q, el) return [el[1] for el in q if el]

[Steven D'Aprano]

...

Is that an improvement over:

sample(list(itertools.slice(population, size)))

...

and if so, please explain.

Different things entirely. Your spelling is missing sample's required second argument, and the difference should be clear if it's supplied:

sample(list(itertools.slice(population, size)). size) That is, it merely returns some permutation of the _initial_ `size` items in the iterable. The rest of the population is ignored. In Python today, the easiest way to spell Abe's intent is, e.g.,

...

...
...
from heapq import nlargest # or nsmallest - doesn't matter from random import random nlargest(4, (i for i in range(100000)), key=lambda x: random()) [75260, 45880, 99486, 13478] nlargest(4, (i for i in range(100000)), key=lambda x: random()) [31732, 72288, 26584, 72672] nlargest(4, (i for i in range(100000)), key=lambda x: random()) [14180, 86084, 22639, 2004]

That also arranges to preserve `sample()'s promise that all sub-slices of the result are valid random samples too (because `nlargest` sorts by the randomly generated keys before returning the list). However, it does _not_ preserve - and nothing can preserve for arbitrary iterables - `sample()`'s promise to "[leave] the original population unchanged". We can't examine an arbitrary iterable's population at all without exhausting the iterable, and that can be destructive. So while this can indeed be useful, it would require changing `sample()` to break that promise in some cases. BTW, using a heap for this is uncommon. Search on "reservoir sampling" for more-common ways Most common is probably Vitter's "Algorithm R", which runs in O(len(iterable)) time (no additional log factor for a heap - it doesn't use a heap). I'd prefer to leave `sample()` alone, and introduce some spelling of `possibly_destructive_sample()` for arbitrary iterables - if that's wanted enough for someone to do the work ;-)

Antoine Pitrou

7:11 a.m.

On Tue, 26 Jun 2018 23:52:55 -0500 Tim Peters <tim.peters@gmail.com> wrote:

...

How could slicing return an invalid random sample? Regards Antoine.

Franklin? Lee

4:58 p.m.

On Wed, Jun 27, 2018 at 3:11 AM, Antoine Pitrou <solipsis@pitrou.net> wrote:

...

If the sample isn't randomly ordered. def sample(population, k): population = list(population) shuffle(population) return sorted(population[:k]) #No, don't sort!

Tim Peters

5:35 p.m.

...

[Tim]

...

In Python today, the easiest way to spell Abe's intent is, e.g.,

...
...
...
...
from heapq import nlargest # or nsmallest - doesn't matter from random import random nlargest(4, (i for i in range(100000)), key=lambda x: random()) [75260, 45880, 99486, 13478] nlargest(4, (i for i in range(100000)), key=lambda x: random()) [31732, 72288, 26584, 72672] nlargest(4, (i for i in range(100000)), key=lambda x: random()) [14180, 86084, 22639, 2004]

That also arranges to preserve `sample()'s promise that all sub-slices of the result are valid random samples too (because `nlargest` sorts by the randomly generated keys before returning the list).

[Antoine Pitrou]

...

How could slicing return an invalid random sample?

For example, consider random.sample(range(2), 2). As a set, there is only one possible output, {0, 1}. But it doesn't return a set, it returns a list. So there are two possible outputs: [0, 1] [1, 0] random.sample() promises to return each of those about equally often, so that, e.g., result[0:1] and result[1:2] are also random 1-samples. If it always returned, say, [0, 1], that's "a random" 2-sample, but its 1-slices are as far from random 1-samples as is possible to get.

Abe Dillon

7:50 p.m.

Let me start off by saying I agree with Tim Peters that it would be best to implement these changes in a new function (if ever). On Tuesday, June 26, 2018 at 8:06:35 PM UTC-5, Steven D'Aprano wrote:

...

range is not an iterator.

My misunderstanding of the details of range objects was, indeed, a huge contributing factor to my confusion. I assumed range was more like a generator function when I initially discovered that random.sample doesn't permit iterators, however; the reason I'm proposing a version of random.sample that accepts iterators is still sound. It's even in the text for why one should use a range object: To choose a sample from a range of integers, use a range()

...

<https://docs.python.org/3/library/stdtypes.html#range> object as an argument. *This is especially fast and space efficient for sampling from a large population*

As I claimed before: A major use-case for sampling is to avoid working with an impractically large population. This is also a major use-case for iterators. On Tuesday, June 26, 2018 at 8:06:35 PM UTC-5, Steven D'Aprano wrote:

...

...
this seems overly constrained. The inability to handle dictionaries is especially puzzling.

...

Puzzling in what way?

If sample() supported dicts, should it return the keys or the values or both?

Like in all other contexts where a dictionary is treated as a collection, it should be treated as a collection of keys. There are plenty of precedence of this: d = dict(zip(names, ages)) chronological_names = sorted(d, key=d.get) name_list, name_set = list(d), set(d) print(*d) On Tuesday, June 26, 2018 at 8:06:35 PM UTC-5, Steven D'Aprano wrote:

...

Also consider this:

https://bugs.python.org/issue33098 <https://www.google.com/url?q=https%3A%2F%2Fbugs.python.org%2Fissue33098&sa=D&sntz=1&usg=AFQjCNHNqDR8TzlK_TlKFK5Lh39sfgqLJQ>

I respectfully disagree with the conclusion of that issue. It goes against the "consenting adults" ethos of Python. As long as the performance implications are expressly documented and and maybe even a warning thrown, I don't see a reason to prevent people from using a useful function. You can't protect programmers from writing inefficient programs. Also, It seems like the dict interface could expose a way to get a sequence view of the keys. This would be very efficient given the current implementation of dictionaries in CPython. So, it's not like it's fundamentally impossible for random.choice to work efficiently with dicts, it's more of a implementation detail. On Tuesday, June 26, 2018 at 8:06:35 PM UTC-5, Steven D'Aprano wrote:

...

...
Randomly sampling from some population is often done because the entire population is impractically large which is also a motivation for using iterators, so it seems natural that one would be able to sample from an iterator. A naive implementation could use a heap queue:

import heapq import random

def stream(): while True: yield random.random()

def sample(population, size): q = [tuple()]*size for el in zip(stream(), population): if el > q[0]: heapq.heapreplace(q, el) return [el[1] for el in q if el]

Is that an improvement over:

sample(list(itertools.slice(population, size)))

and if so, please explain.

Do you mean: sample(list(itertools.islice(population, size), size)? If so, then I'll refer you to Tim Peter's response, otherwise: please clarify what you meant. On Tuesday, June 26, 2018 at 8:06:35 PM UTC-5, Steven D'Aprano wrote:

...

...
It would also be helpful to add a ratio version of the function:

def sample(population, size=None, *, ratio=None): assert None in (size, ratio), "can't specify both sample size and ratio" if ratio: return [el for el in population if random.random() < ratio] ...

Helpful under what circumstances?

I wasn't aware of the linear-time reservoir sampling algorithms that Tim Peters suggested. Those make the ratio proposal less helpful. As you can see from the implementation I proposed, the ratio would be able to work with iterators of undetermined size in linear time, however; it wouldn't satisfy the valid subsampling criteria (unless you shuffle the output) and it would only return *roughly* ratio*len(population) elements instead of an exact number. On Tuesday, June 26, 2018 at 8:06:35 PM UTC-5, Steven D'Aprano wrote:

...

Don't let the source speak for itself. Explain what it means. I understand what sample(population, size=100) does. What would sample(population, ratio=0.25) do?

It would return a sample of roughly 25% of the population. [Stephen J. Turnbull]

...

I argue below that *if* we were going to make the change, it should be to consistently try list() on non-sequences. But "not every one-liner" and EIBTI:

Converting the input to a list is exactly what I'm trying to avoid. I'd like to sample from an enormous file that won't fit in memory or populate 5% of a large game-of-life grid without using up gigabytes of memory: width, height, ratio = 100000, 100000, 0.05 live_set = {*random.sample(itertools.product(range(height), range(width)) , ratio*width*height)}

Steven D'Aprano

June 2018

6:05 p.m.

On Tue, Jun 26, 2018 at 05:36:51PM -0700, Abe Dillon wrote:

...

The docs on random.sample indicate that it works with iterators:

...
To choose a sample from a range of integers, use a range() <https://docs.python.org/3/library/stdtypes.html#range> object as an argument. This is especially fast and space efficient for sampling from a large population: sample(range(10000000),k=60).

That doesn't mention anything about iterators.

...

However, when I try to use iterators other than range, like so:

...

random.sample(itertools.product(range(height), range(with)), 0.5*height*width)

I get:

TypeError: Population must be a sequence or set. For dicts, use list(d).

I don't know if Python Ideas is the right channel for this, but this seems overly constrained. The inability to handle dictionaries is especially puzzling.

Puzzling in what way? If sample() supported dicts, should it return the keys or the values or both? Also consider this: https://bugs.python.org/issue33098

...

Randomly sampling from some population is often done because the entire population is impractically large which is also a motivation for using iterators, so it seems natural that one would be able to sample from an iterator. A naive implementation could use a heap queue:

import heapq import random

def stream(): while True: yield random.random()

def sample(population, size): q = [tuple()]*size for el in zip(stream(), population): if el > q[0]: heapq.heapreplace(q, el) return [el[1] for el in q if el]

Is that an improvement over: sample(list(itertools.slice(population, size))) and if so, please explain.

...

It would also be helpful to add a ratio version of the function:

def sample(population, size=None, *, ratio=None): assert None in (size, ratio), "can't specify both sample size and ratio" if ratio: return [el for el in population if random.random() < ratio] ...

Stephen J. Turnbull

8:07 p.m.

Steven D'Aprano writes:

...

...
I don't know if Python Ideas is the right channel for this, but this seems overly constrained. The inability to handle dictionaries is especially puzzling.

Puzzling in what way?

Same misconception, I suppose.

...

If sample() supported dicts, should it return the keys or the values or both?

I argue below that *if* we were going to make the change, it should be to consistently try list() on non-sequences. But "not every one-liner" and EIBTI: d = {'a': 1, 'b': 2}

...

...
...
sample(d.keys(),1) ['a'] sample(d.items(),1) [('a', 1)]

But this is weird:

...

...
...
sample(d.values(),1) Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/opt/local/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/random.py", line 314, in sample raise TypeError("Population must be a sequence or set. For dicts, use list(d).") TypeError: Population must be a sequence or set. For dicts, use list(d).

...

Don't let the source speak for itself. Explain what it means. I understand what sample(population, size=100) does. What would sample(population, ratio=0.25) do?

I assume sample(pop, ratio=0.25) == sample(pop, size=0.25*len(pop)).

Tim Peters

9:52 p.m.

[Abe Dillon]

...

Randomly sampling from some population is often done because the entire

...
population is impractically large which is also a motivation for using iterators, so it seems natural that one would be able to sample from an iterator. A naive implementation could use a heap queue:

import heapq import random

def stream(): while True: yield random.random()

def sample(population, size): q = [tuple()]*size for el in zip(stream(), population): if el > q[0]: heapq.heapreplace(q, el) return [el[1] for el in q if el]

[Steven D'Aprano]

...

Is that an improvement over:

sample(list(itertools.slice(population, size)))

...

and if so, please explain.

Different things entirely. Your spelling is missing sample's required second argument, and the difference should be clear if it's supplied:

...

...
...
from heapq import nlargest # or nsmallest - doesn't matter from random import random nlargest(4, (i for i in range(100000)), key=lambda x: random()) [75260, 45880, 99486, 13478] nlargest(4, (i for i in range(100000)), key=lambda x: random()) [31732, 72288, 26584, 72672] nlargest(4, (i for i in range(100000)), key=lambda x: random()) [14180, 86084, 22639, 2004]

Antoine Pitrou

12:11 a.m.

On Tue, 26 Jun 2018 23:52:55 -0500 Tim Peters <tim.peters@gmail.com> wrote:

...

How could slicing return an invalid random sample? Regards Antoine.

Franklin? Lee

9:58 a.m.

On Wed, Jun 27, 2018 at 3:11 AM, Antoine Pitrou <solipsis@pitrou.net> wrote:

...

If the sample isn't randomly ordered. def sample(population, k): population = list(population) shuffle(population) return sorted(population[:k]) #No, don't sort!

Tim Peters

10:35 a.m.

...

[Tim]

...

In Python today, the easiest way to spell Abe's intent is, e.g.,

...
...
...
...
from heapq import nlargest # or nsmallest - doesn't matter from random import random nlargest(4, (i for i in range(100000)), key=lambda x: random()) [75260, 45880, 99486, 13478] nlargest(4, (i for i in range(100000)), key=lambda x: random()) [31732, 72288, 26584, 72672] nlargest(4, (i for i in range(100000)), key=lambda x: random()) [14180, 86084, 22639, 2004]

That also arranges to preserve `sample()'s promise that all sub-slices of the result are valid random samples too (because `nlargest` sorts by the randomly generated keys before returning the list).

[Antoine Pitrou]

...

How could slicing return an invalid random sample?

Abe Dillon

June 2018

12:50 p.m.

Let me start off by saying I agree with Tim Peters that it would be best to implement these changes in a new function (if ever). On Tuesday, June 26, 2018 at 8:06:35 PM UTC-5, Steven D'Aprano wrote:

...

range is not an iterator.

...

<https://docs.python.org/3/library/stdtypes.html#range> object as an argument. *This is especially fast and space efficient for sampling from a large population*

...

...
this seems overly constrained. The inability to handle dictionaries is especially puzzling.

...

Puzzling in what way?

If sample() supported dicts, should it return the keys or the values or both?

...

Also consider this:

https://bugs.python.org/issue33098 <https://www.google.com/url?q=https%3A%2F%2Fbugs.python.org%2Fissue33098&sa=D&sntz=1&usg=AFQjCNHNqDR8TzlK_TlKFK5Lh39sfgqLJQ>

...

...
Randomly sampling from some population is often done because the entire population is impractically large which is also a motivation for using iterators, so it seems natural that one would be able to sample from an iterator. A naive implementation could use a heap queue:

import heapq import random

def stream(): while True: yield random.random()

def sample(population, size): q = [tuple()]*size for el in zip(stream(), population): if el > q[0]: heapq.heapreplace(q, el) return [el[1] for el in q if el]

Is that an improvement over:

sample(list(itertools.slice(population, size)))

and if so, please explain.

...

...
It would also be helpful to add a ratio version of the function:

def sample(population, size=None, *, ratio=None): assert None in (size, ratio), "can't specify both sample size and ratio" if ratio: return [el for el in population if random.random() < ratio] ...

Helpful under what circumstances?

...

Don't let the source speak for itself. Explain what it means. I understand what sample(population, size=100) does. What would sample(population, ratio=0.25) do?

It would return a sample of roughly 25% of the population. [Stephen J. Turnbull]

...

I argue below that *if* we were going to make the change, it should be to consistently try list() on non-sequences. But "not every one-liner" and EIBTI:

2457

Age (days ago)

2457

Last active (days ago)

List overview

Download

7 comments

6 participants

participants (6)

Abe Dillon
Antoine Pitrou
Franklin? Lee
Stephen J. Turnbull
Steven D'Aprano
Tim Peters

random.sample should work better with iterators

tags

participants (6)