[Python-ideas] random.sample should work better with iterators

Tim Peters tim.peters at gmail.com
Wed Jun 27 00:52:55 EDT 2018

[Abe Dillon]

>  Randomly sampling from some population is often done because the entire
> > population is impractically large which is also a motivation for using
> > iterators, so it seems natural that one would be able to sample from an
> > iterator. A naive implementation could use a heap queue:
> >
> > import heapq
> > import random
> >
> > def stream():
> >     while True: yield random.random()
> >
> > def sample(population, size):
> >     q = [tuple()]*size
> >     for el in zip(stream(), population):
> >         if el > q[0]: heapq.heapreplace(q, el)
> >     return [el[1] for el in q if el]

[Steven D'Aprano]

> Is that an improvement over:

sample(list(itertools.slice(population, size)))

> and if so, please explain.
> Different things entirely.  Your spelling is missing sample's required
second argument, and the difference should be clear if it's supplied:

  sample(list(itertools.slice(population, size)). size)

That is, it merely returns some permutation of the _initial_ `size` items
in the iterable.  The rest of the population is ignored.

In Python today, the easiest way to spell Abe's intent is, e.g.,

>>> from heapq import nlargest # or nsmallest - doesn't matter
>>> from random import random
>>> nlargest(4, (i for i in range(100000)), key=lambda x: random())
[75260, 45880, 99486, 13478]
>>> nlargest(4, (i for i in range(100000)), key=lambda x: random())
[31732, 72288, 26584, 72672]
>>> nlargest(4, (i for i in range(100000)), key=lambda x: random())
[14180, 86084, 22639, 2004]

That also arranges to preserve `sample()'s promise that all sub-slices of
the result are valid random samples too (because `nlargest` sorts by the
randomly generated keys before returning the list).

However, it does _not_ preserve - and nothing can preserve for arbitrary
iterables - `sample()`'s promise to "[leave] the original population
unchanged".  We can't examine an arbitrary iterable's population at all
without exhausting the iterable, and that can be destructive.

So while this can indeed be useful, it would require changing `sample()` to
break that promise in some cases.

BTW, using a heap for this is uncommon.  Search on "reservoir sampling" for
more-common ways   Most common is probably Vitter's "Algorithm R", which
runs in O(len(iterable)) time (no additional log factor for a heap - it
doesn't use a heap).

I'd prefer to leave `sample()` alone, and introduce some spelling of
`possibly_destructive_sample()` for arbitrary iterables - if that's wanted
enough for someone to do the work ;-)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-ideas/attachments/20180626/14d1513a/attachment.html>

More information about the Python-ideas mailing list