[Python-ideas] random.sample should work better with iterators

Tue Jun 26 21:05:26 EDT 2018

On Tue, Jun 26, 2018 at 05:36:51PM -0700, Abe Dillon wrote:
> The docs on random.sample indicate that it works with iterators:
> 
> > To choose a sample from a range of integers, use a range() 
> > <https://docs.python.org/3/library/stdtypes.html#range> object as an 
> > argument. This is especially fast and space efficient for sampling from a 
> > large population: sample(range(10000000),k=60).

That doesn't mention anything about iterators.

> However, when I try to use iterators other than range, like so:

range is not an iterator.

Thinking it is is a very common error, but it certainly is not. It is a 
lazily-generated *sequence*, not an iterator.

The definition of an iterator is that the object must have an __iter__ 
method returning *itself*, and a __next__ method (the "iterator 
protocol"):

py> obj = range(100)
py> hasattr(obj, '__next__')
False
py> obj.__iter__() is obj
False

However, it is a sequence:

py> import collections
py> isinstance(obj, collections.Sequence)
True

(Aside: I'm surprised there's no inspect.isiterator and .isiterable 
functions.)

> random.sample(itertools.product(range(height), range(with)), 
> 0.5*height*width)
> 
> I get:
> 
> TypeError: Population must be a sequence or set. For dicts, use list(d).
> 
> I don't know if Python Ideas is the right channel for this, but this seems 
> overly constrained. The inability to handle dictionaries is especially 
> puzzling.

Puzzling in what way?

If sample() supported dicts, should it return the keys or the values or 
both? Also consider this:

https://bugs.python.org/issue33098

> Randomly sampling from some population is often done because the entire 
> population is impractically large which is also a motivation for using 
> iterators, so it seems natural that one would be able to sample from an 
> iterator. A naive implementation could use a heap queue: 
>
> import heapq
> import random
> 
> def stream(): 
>     while True: yield random.random()
> 
> def sample(population, size):
>     q = [tuple()]*size
>     for el in zip(stream(), population):
>         if el > q[0]: heapq.heapreplace(q, el)
>     return [el[1] for el in q if el]

Is that an improvement over:

sample(list(itertools.slice(population, size)))

and if so, please explain.

> It would also be helpful to add a ratio version of the function: 
> 
> def sample(population, size=None, *, ratio=None):
>     assert None in (size, ratio), "can't specify both sample size and ratio"
>     if ratio:
>         return [el for el in population if random.random() < ratio]
>     ...

Helpful under what circumstances?

Don't let the source speak for itself. Explain what it means. I 
understand what sample(population, size=100) does. What would 
sample(population, ratio=0.25) do?

(That's not a rhetorical question, I genuinely don't understand the 
semantics of this proposed ratio argument.)

-- 
Steve