[Python-ideas] Allow random.choice, random.sample to work on iterators

Steven D'Aprano steve at pearwood.info
Wed Nov 30 18:14:35 EST 2016


On Wed, Nov 30, 2016 at 11:57:46AM -0800, Chris Kaynor wrote:

> Consider that this code will not produce the "correct" results (for a
> reasonable definition of correct):
> 
> a = (i for i in range(100)) # Pretend this does something more
> interesting, and isn't a trivial generator - maybe a file object
> reading by line.
> randomEntries = [random.choice(a) for i in range(10)] # May not be
> quite as obvious, such as the choices could be chosen in a more
> complex loop.
> 
> randomEntries may not contain 10 items (or the generation may error
> due to having insufficent items, depending on implementation),

Indeed. Given a iterator with 100 items, there's a 9% chance that the 
first choice will be in the last nine items, which implies that failures 
here will be common.


> it will
> not contain duplicates, and will be sorted.

Right. In other words, its a highly biased, non-random "random sample".

It may be worth adding a "reservoir sample" function, but I think it is 
a mistake to try modifying the existing functions to deal with 
iterators. Its too hard to get the same characteristics when sampling 
from a sequence and an iterator. Better to keep them as separate 
functions so that the caller knows what they're getting.

Here's my first attempt at implementing Algorithm R from Wikipedia:

https://en.wikipedia.org/wiki/Reservoir_sampling#Algorithm_R


from random import randrange, shuffle
import itertools

def reservoir_sample(iterable, count=1):
    """Return a list of count items sampled without replacement
    from iterable, using reservoir sampling "Algorithm R".

    This will exhaust the iterable, which must be finite.
    """
    it = iter(iterable)
    reservoir = list(itertools.islice(it, count))
    if len(reservoir) < count:
        raise ValueError('iterator is too short to sample %d items' % count)
    shuffle(reservoir)
    for i, value in enumerate(it, count+1):
        j = randrange(0, i)
        if j < count:
            reservoir[j] = value
    return reservoir



-- 
Steve


More information about the Python-ideas mailing list