[Python-ideas] random.sample should work better with iterators

Wed Jun 27 13:35:28 EDT 2018

>
> [Tim]

> In Python today, the easiest way to spell Abe's intent is, e.g.,
> >
> > >>> from heapq import nlargest # or nsmallest - doesn't matter
> > >>> from random import random
> > >>> nlargest(4, (i for i in range(100000)), key=lambda x: random())
> > [75260, 45880, 99486, 13478]
> > >>> nlargest(4, (i for i in range(100000)), key=lambda x: random())
> > [31732, 72288, 26584, 72672]
> > >>> nlargest(4, (i for i in range(100000)), key=lambda x: random())
> > [14180, 86084, 22639, 2004]
> >
> > That also arranges to preserve `sample()'s promise that all sub-slices of
> > the result are valid random samples too (because `nlargest` sorts by the
> > randomly generated keys before returning the list).
>

[Antoine Pitrou]

> How could slicing return an invalid random sample?
>

For example, consider random.sample(range(2), 2).  As a set, there is only
one possible output, {0, 1}.  But it doesn't return a set, it returns a
list.  So there are two possible outputs:

[0, 1]
[1, 0]

random.sample() promises to return each of those about equally often, so
that, e.g., result[0:1] and result[1:2] are also random 1-samples.

If it always returned, say, [0, 1], that's "a random" 2-sample, but its
1-slices are as far from random 1-samples as is possible to get.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-ideas/attachments/20180627/0e706988/attachment-0001.html>