[Python-ideas] random.sample should work better with iterators
Stephen J. Turnbull
turnbull.stephen.fw at u.tsukuba.ac.jp
Tue Jun 26 23:07:58 EDT 2018
Steven D'Aprano writes:
> > I don't know if Python Ideas is the right channel for this, but this seems
> > overly constrained. The inability to handle dictionaries is especially
> > puzzling.
>
> Puzzling in what way?
Same misconception, I suppose.
> If sample() supported dicts, should it return the keys or the values or
> both?
I argue below that *if* we were going to make the change, it should be
to consistently try list() on non-sequences. But "not every
one-liner" and EIBTI:
d = {'a': 1, 'b': 2}
>>> sample(d.keys(),1)
['a']
>>> sample(d.items(),1)
[('a', 1)]
But this is weird:
>>> sample(d.values(),1)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/opt/local/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/random.py", line 314, in sample
raise TypeError("Population must be a sequence or set. For dicts, use list(d).")
TypeError: Population must be a sequence or set. For dicts, use list(d).
Oh, I see. Key views are "set-like", item views *may* be set-like,
but value views are *not* set-like.
Since views are all listable, why not try "list" on them? In general,
I would think it makes sense to define this as "Population must be a
sequence or convertible to a sequence using list()." And for most of
the applications I can think of in my own use, sample(list(d)) is not
particularly useful because it's a sample of keys. I usually want
sample(list(d.values())).
The ramifications are unclear to me, but I guess it's too late to
change this because of the efficiency implications Tim describes in
issue33098 (so EIBTI; thanks for the reference!) On the other hand,
that issue says sets can't be sampled efficiently, so the current
behavior seems to *promote* inefficient usage?
I would definitely change the error message. I think "Use list(d)" is
bad advice because I believe it's not even "almost always" what you'll
want, and if keys and values are of the same type, it won't be obvious
from the output that you're *not* getting a sample from d.values() if
that's what you wanted and thought you were getting.
> Don't let the source speak for itself. Explain what it means. I
> understand what sample(population, size=100) does. What would
> sample(population, ratio=0.25) do?
I assume sample(pop, ratio=0.25) == sample(pop, size=0.25*len(pop)).
More information about the Python-ideas
mailing list