[Python-ideas] random.sample should work better with iterators
Steven D'Aprano
steve at pearwood.info
Tue Jun 26 21:05:26 EDT 2018
On Tue, Jun 26, 2018 at 05:36:51PM -0700, Abe Dillon wrote:
> The docs on random.sample indicate that it works with iterators:
>
> > To choose a sample from a range of integers, use a range()
> > <https://docs.python.org/3/library/stdtypes.html#range> object as an
> > argument. This is especially fast and space efficient for sampling from a
> > large population: sample(range(10000000),k=60).
That doesn't mention anything about iterators.
> However, when I try to use iterators other than range, like so:
range is not an iterator.
Thinking it is is a very common error, but it certainly is not. It is a
lazily-generated *sequence*, not an iterator.
The definition of an iterator is that the object must have an __iter__
method returning *itself*, and a __next__ method (the "iterator
protocol"):
py> obj = range(100)
py> hasattr(obj, '__next__')
False
py> obj.__iter__() is obj
False
However, it is a sequence:
py> import collections
py> isinstance(obj, collections.Sequence)
True
(Aside: I'm surprised there's no inspect.isiterator and .isiterable
functions.)
> random.sample(itertools.product(range(height), range(with)),
> 0.5*height*width)
>
> I get:
>
> TypeError: Population must be a sequence or set. For dicts, use list(d).
>
> I don't know if Python Ideas is the right channel for this, but this seems
> overly constrained. The inability to handle dictionaries is especially
> puzzling.
Puzzling in what way?
If sample() supported dicts, should it return the keys or the values or
both? Also consider this:
https://bugs.python.org/issue33098
> Randomly sampling from some population is often done because the entire
> population is impractically large which is also a motivation for using
> iterators, so it seems natural that one would be able to sample from an
> iterator. A naive implementation could use a heap queue:
>
> import heapq
> import random
>
> def stream():
> while True: yield random.random()
>
> def sample(population, size):
> q = [tuple()]*size
> for el in zip(stream(), population):
> if el > q[0]: heapq.heapreplace(q, el)
> return [el[1] for el in q if el]
Is that an improvement over:
sample(list(itertools.slice(population, size)))
and if so, please explain.
> It would also be helpful to add a ratio version of the function:
>
> def sample(population, size=None, *, ratio=None):
> assert None in (size, ratio), "can't specify both sample size and ratio"
> if ratio:
> return [el for el in population if random.random() < ratio]
> ...
Helpful under what circumstances?
Don't let the source speak for itself. Explain what it means. I
understand what sample(population, size=100) does. What would
sample(population, ratio=0.25) do?
(That's not a rhetorical question, I genuinely don't understand the
semantics of this proposed ratio argument.)
--
Steve
More information about the Python-ideas
mailing list