[Python-ideas] Allow random.choice, random.sample to work on iterators
Steven D'Aprano
steve at pearwood.info
Wed Nov 30 18:14:35 EST 2016
On Wed, Nov 30, 2016 at 11:57:46AM -0800, Chris Kaynor wrote:
> Consider that this code will not produce the "correct" results (for a
> reasonable definition of correct):
>
> a = (i for i in range(100)) # Pretend this does something more
> interesting, and isn't a trivial generator - maybe a file object
> reading by line.
> randomEntries = [random.choice(a) for i in range(10)] # May not be
> quite as obvious, such as the choices could be chosen in a more
> complex loop.
>
> randomEntries may not contain 10 items (or the generation may error
> due to having insufficent items, depending on implementation),
Indeed. Given a iterator with 100 items, there's a 9% chance that the
first choice will be in the last nine items, which implies that failures
here will be common.
> it will
> not contain duplicates, and will be sorted.
Right. In other words, its a highly biased, non-random "random sample".
It may be worth adding a "reservoir sample" function, but I think it is
a mistake to try modifying the existing functions to deal with
iterators. Its too hard to get the same characteristics when sampling
from a sequence and an iterator. Better to keep them as separate
functions so that the caller knows what they're getting.
Here's my first attempt at implementing Algorithm R from Wikipedia:
https://en.wikipedia.org/wiki/Reservoir_sampling#Algorithm_R
from random import randrange, shuffle
import itertools
def reservoir_sample(iterable, count=1):
"""Return a list of count items sampled without replacement
from iterable, using reservoir sampling "Algorithm R".
This will exhaust the iterable, which must be finite.
"""
it = iter(iterable)
reservoir = list(itertools.islice(it, count))
if len(reservoir) < count:
raise ValueError('iterator is too short to sample %d items' % count)
shuffle(reservoir)
for i, value in enumerate(it, count+1):
j = randrange(0, i)
if j < count:
reservoir[j] = value
return reservoir
--
Steve
More information about the Python-ideas
mailing list