<div dir="ltr">Let me start off by saying I agree with Tim Peters that it would be best to implement these changes in a new function (if ever).<div><br>On Tuesday, June 26, 2018 at 8:06:35 PM UTC-5, Steven D'Aprano wrote:<blockquote class="gmail_quote" style="margin: 0;margin-left: 0.8ex;border-left: 1px #ccc solid;padding-left: 1ex;">range is not an iterator.
<br></blockquote><div><br></div><div>My misunderstanding of the details of range objects was, indeed, a huge contributing factor to my confusion. I assumed range was more like a generator function when I initially discovered that random.sample doesn't permit iterators, however; the reason I'm proposing a version of random.sample that accepts iterators is still sound. It's even in the text for why one should use a range object:</div><div><br></div><blockquote class="gmail_quote" style="margin: 0px 0px 0px 0.8ex; border-left: 1px solid rgb(204, 204, 204); padding-left: 1ex;"><font size="2" face="arial, sans-serif"><span style="text-align: justify;">To choose a sample from a range of integers, use a </span><a class="reference internal" href="https://docs.python.org/3/library/stdtypes.html#range" title="range" style="color: rgb(99, 99, 187); text-align: justify;"><code class="xref py py-func docutils literal notranslate" style="background-color: transparent; padding-right: 1px; padding-left: 1px; border-radius: 3px;"><span class="pre" style="hyphens: none;">range()</span></code></a><span style="text-align: justify;"> object as an argument. <b>This is especially fast and space efficient for sampling from a large population</b></span></font></blockquote><div><br></div><div>As I claimed before: A major use-case for sampling is to avoid working with an impractically large population. This is also a major use-case for iterators.</div><div><br></div><div>On Tuesday, June 26, 2018 at 8:06:35 PM UTC-5, Steven D'Aprano wrote:<br></div><blockquote class="gmail_quote" style="margin: 0;margin-left: 0.8ex;border-left: 1px #ccc solid;padding-left: 1ex;">> this seems overly constrained. The inability to handle dictionaries is <br>>  especially puzzling. </blockquote><blockquote class="gmail_quote" style="margin: 0;margin-left: 0.8ex;border-left: 1px #ccc solid;padding-left: 1ex;">
<br>Puzzling in what way?
<br>
<br>If sample() supported dicts, should it return the keys or the values or 
<br>both? </blockquote><div>Like in all other contexts where a dictionary is treated as a collection, it should be treated as a collection of keys. There are plenty of precedence of this:<br><br></div><div class="prettyprint" style="background-color: rgb(250, 250, 250); border-color: rgb(187, 187, 187); border-style: solid; border-width: 1px; word-wrap: break-word;"><code class="prettyprint"><div class="subprettyprint"><span style="color: #000;" class="styled-by-prettify">d </span><span style="color: #660;" class="styled-by-prettify">=</span><span style="color: #000;" class="styled-by-prettify"> dict</span><span style="color: #660;" class="styled-by-prettify">(</span><span style="color: #000;" class="styled-by-prettify">zip</span><span style="color: #660;" class="styled-by-prettify">(</span><span style="color: #000;" class="styled-by-prettify">names</span><span style="color: #660;" class="styled-by-prettify">,</span><span style="color: #000;" class="styled-by-prettify"> ages</span><span style="color: #660;" class="styled-by-prettify">))</span><span style="color: #000;" class="styled-by-prettify"><br>chronological_names </span><span style="color: #660;" class="styled-by-prettify">=</span><span style="color: #000;" class="styled-by-prettify"> sorted</span><span style="color: #660;" class="styled-by-prettify">(</span><span style="color: #000;" class="styled-by-prettify">d</span><span style="color: #660;" class="styled-by-prettify">,</span><span style="color: #000;" class="styled-by-prettify"> key</span><span style="color: #660;" class="styled-by-prettify">=</span><span style="color: #000;" class="styled-by-prettify">d</span><span style="color: #660;" class="styled-by-prettify">.</span><span style="color: #008;" class="styled-by-prettify">get</span><span style="color: #660;" class="styled-by-prettify">)</span><span style="color: #000;" class="styled-by-prettify"><br>name_list</span><span style="color: #660;" class="styled-by-prettify">,</span><span style="color: #000;" class="styled-by-prettify"> name_set </span><span style="color: #660;" class="styled-by-prettify">=</span><span style="color: #000;" class="styled-by-prettify"> list</span><span style="color: #660;" class="styled-by-prettify">(</span><span style="color: #000;" class="styled-by-prettify">d</span><span style="color: #660;" class="styled-by-prettify">),</span><span style="color: #000;" class="styled-by-prettify"> </span><span style="color: #008;" class="styled-by-prettify">set</span><span style="color: #660;" class="styled-by-prettify">(</span><span style="color: #000;" class="styled-by-prettify">d</span><span style="color: #660;" class="styled-by-prettify">)</span><span style="color: #000;" class="styled-by-prettify"><br></span><span style="color: #008;" class="styled-by-prettify">print</span><span style="color: #660;" class="styled-by-prettify">(*</span><span style="color: #000;" class="styled-by-prettify">d</span><span style="color: #660;" class="styled-by-prettify">)</span></div></code></div><div><br></div><div><br>On Tuesday, June 26, 2018 at 8:06:35 PM UTC-5, Steven D'Aprano wrote:<br></div><blockquote class="gmail_quote" style="margin: 0;margin-left: 0.8ex;border-left: 1px #ccc solid;padding-left: 1ex;">Also consider this:
<br>
<br><a href="https://www.google.com/url?q=https%3A%2F%2Fbugs.python.org%2Fissue33098&sa=D&sntz=1&usg=AFQjCNHNqDR8TzlK_TlKFK5Lh39sfgqLJQ" target="_blank" rel="nofollow" onmousedown="this.href='https://www.google.com/url?q\x3dhttps%3A%2F%2Fbugs.python.org%2Fissue33098\x26sa\x3dD\x26sntz\x3d1\x26usg\x3dAFQjCNHNqDR8TzlK_TlKFK5Lh39sfgqLJQ';return true;" onclick="this.href='https://www.google.com/url?q\x3dhttps%3A%2F%2Fbugs.python.org%2Fissue33098\x26sa\x3dD\x26sntz\x3d1\x26usg\x3dAFQjCNHNqDR8TzlK_TlKFK5Lh39sfgqLJQ';return true;">https://bugs.python.org/<wbr>issue33098</a> <br></blockquote><div>I respectfully disagree with the conclusion of that issue. It goes against the "consenting adults" ethos of Python. As long as the performance implications are expressly documented and and maybe even a warning thrown, I don't see a reason to prevent people from using a useful function. You can't protect programmers from writing inefficient programs. Also, It seems like the dict interface could expose a way to get a sequence view of the keys. This would be very efficient given the current implementation of dictionaries in CPython. So, it's not like it's fundamentally impossible for random.choice to work efficiently with dicts, it's more of a implementation detail.</div><div><br></div><div>On Tuesday, June 26, 2018 at 8:06:35 PM UTC-5, Steven D'Aprano wrote:<br></div><blockquote class="gmail_quote" style="margin: 0;margin-left: 0.8ex;border-left: 1px #ccc solid;padding-left: 1ex;">> Randomly sampling from some population is often done because the entire 
<br>> population is impractically large which is also a motivation for using 
<br>> iterators, so it seems natural that one would be able to sample from an 
<br>> iterator. A naive implementation could use a heap queue: 
<br>>
<br>> import heapq
<br>> import random
<br>> 
<br>> def stream(): 
<br>>     while True: yield random.random()
<br>> 
<br>> def sample(population, size):
<br>>     q = [tuple()]*size
<br>>     for el in zip(stream(), population):
<br>>         if el > q[0]: heapq.heapreplace(q, el)
<br>>     return [el[1] for el in q if el]
<br>
<br>Is that an improvement over:
<br>
<br>sample(list(itertools.slice(<wbr>population, size)))
<br>
<br>and if so, please explain.</blockquote><div> </div><div>Do you mean: <font face="courier new, monospace">sample(list(itertools.islice(population, size), size)</font>?</div><div>If so, then I'll refer you to Tim Peter's response, otherwise: please clarify what you meant.</div><div><br></div><div>On Tuesday, June 26, 2018 at 8:06:35 PM UTC-5, Steven D'Aprano wrote:<br></div><blockquote class="gmail_quote" style="margin: 0;margin-left: 0.8ex;border-left: 1px #ccc solid;padding-left: 1ex;">> It would also be helpful to add a ratio version of the function: 
<br>> 
<br>> def sample(population, size=None, *, ratio=None):
<br>>     assert None in (size, ratio), "can't specify both sample size and ratio"
<br>>     if ratio:
<br>>         return [el for el in population if random.random() < ratio]
<br>>     ...
<br>
<br>Helpful under what circumstances? </blockquote><div> <br>I wasn't aware of the linear-time reservoir sampling algorithms that Tim Peters suggested. Those make the ratio proposal less helpful.<br>As you can see from the implementation I proposed, the ratio would be able to work with iterators of undetermined size in linear time, however; it wouldn't satisfy the valid subsampling criteria (unless you shuffle the output) and it would only return <b>roughly</b> <font face="courier new, monospace">ratio*len(population)</font> elements instead of an exact number.</div><div><br></div><div>On Tuesday, June 26, 2018 at 8:06:35 PM UTC-5, Steven D'Aprano wrote:<br></div><blockquote class="gmail_quote" style="margin: 0;margin-left: 0.8ex;border-left: 1px #ccc solid;padding-left: 1ex;">Don't let the source speak for itself. Explain what it means. I 
<br>understand what sample(population, size=100) does. What would 
<br>sample(population, ratio=0.25) do? </blockquote><div> </div><div>It would return a sample of roughly 25% of the population.</div><div><br></div><div>[Stephen J. Turnbull]<br><blockquote class="gmail_quote" style="margin: 0px 0px 0px 0.8ex; border-left: 1px solid rgb(204, 204, 204); padding-left: 1ex;">I argue below that *if* we were going to make the change, it should be <br>to consistently try list() on non-sequences.  But "not every <br>one-liner" and EIBTI: </blockquote><div> <br>Converting the input to a list is exactly what I'm trying to avoid. I'd like to sample from an enormous file that won't fit in memory or populate 5% of a large game-of-life grid without using up gigabytes of memory:<br><br><div class="prettyprint" style="background-color: rgb(250, 250, 250); border-color: rgb(187, 187, 187); border-style: solid; border-width: 1px; word-wrap: break-word;"><code class="prettyprint"><div class="subprettyprint"><span style="color: #000;" class="styled-by-prettify">width</span><span style="color: #660;" class="styled-by-prettify">,</span><span style="color: #000;" class="styled-by-prettify"> height</span><span style="color: #660;" class="styled-by-prettify">,</span><span style="color: #000;" class="styled-by-prettify"> ratio </span><span style="color: #660;" class="styled-by-prettify">=</span><span style="color: #000;" class="styled-by-prettify"> </span><span style="color: #066;" class="styled-by-prettify">100000</span><span style="color: #660;" class="styled-by-prettify">,</span><span style="color: #000;" class="styled-by-prettify"> </span><span style="color: #066;" class="styled-by-prettify">100000</span><span style="color: #660;" class="styled-by-prettify">,</span><span style="color: #000;" class="styled-by-prettify"> </span><font color="#666600"><span style="color: #066;" class="styled-by-prettify">0.05</span></font><span style="color: #000;" class="styled-by-prettify"><br><br>live_set </span><span style="color: #660;" class="styled-by-prettify">=</span><span style="color: #000;" class="styled-by-prettify"> </span><span style="color: #660;" class="styled-by-prettify">{*</span><span style="color: #000;" class="styled-by-prettify">random</span><span style="color: #660;" class="styled-by-prettify">.</span><span style="color: #000;" class="styled-by-prettify">sample</span><span style="color: #660;" class="styled-by-prettify">(</span><span style="color: #000;" class="styled-by-prettify">itertools</span><span style="color: #660;" class="styled-by-prettify">.</span><span style="color: #000;" class="styled-by-prettify">product</span><span style="color: #660;" class="styled-by-prettify">(</span><span style="color: #000;" class="styled-by-prettify">range</span><span style="color: #660;" class="styled-by-prettify">(</span><span style="color: #000;" class="styled-by-prettify">height</span><span style="color: #660;" class="styled-by-prettify">),</span><span style="color: #000;" class="styled-by-prettify"> range</span><span style="color: #660;" class="styled-by-prettify">(</span><span style="color: #000;" class="styled-by-prettify">width</span><span style="color: #660;" class="styled-by-prettify">))</span><span style="color: #000;" class="styled-by-prettify"> </span><span style="color: #660;" class="styled-by-prettify">,</span><span style="color: #000;" class="styled-by-prettify"> ratio</span><span style="color: #660;" class="styled-by-prettify">*</span><span style="color: #000;" class="styled-by-prettify">width</span><span style="color: #660;" class="styled-by-prettify">*</span><span style="color: #000;" class="styled-by-prettify">height</span><span style="color: #660;" class="styled-by-prettify">)}</span></div></code></div><br></div></div></div></div>