[docs] [issue33114] random.sample() behavior is unexpected/unclear from docs

Scott Eilerman report at bugs.python.org
Mon Mar 26 11:01:36 EDT 2018


Scott Eilerman <scott.j.eilerman at gmail.com> added the comment:

Raymond, Tim, thanks for your replies so far. I understand (and for the most part, agree with) your points about not being able to list every behavior, and not wanting to cause uncertainty in users. However, let me argue my case one more time, and if you still disagree, feel free to close this.

1. It is expected (in fact, one might argue it's the entire point) that initializing random.seed() with a fixed value will produce a repeatable set of results for a traditional random number generator. An user expects that calling the following should always produce the same sequence of numbers:

random.seed(22)
random.random()
random.random()
random.random()

2. Based on that behavior for one of the most typical/traditional functions in the random module, a naive user (me) might assume that random.sample() is drawing from its population in a similar manner (i.e. that sequence of returned items, regardless of how many you ask the function to return, is uniquely determined by the seed). While this is certainly an assumption...

2a. This assumption is somewhat validated by the introductory section of the random module docs, which states "Almost all module functions depend on the basic function random()..."

2b. More importantly, an user can "validate" this assumption by doing some simple tests, e.g.:

choices = range(100)
random.seed(22)
random.sample(choices,1)
random.seed(22)
random.sample(choices,2)
random.seed(22)
random.sample(choices,3)
... and so on

Because of the nature of the set/list optimization, it is VERY possible that an user could do due diligence in testing like this (a few different seeds, a few different sets of "choices", testing up to k=10) and never uncover the problematic behavior. You'd pretty much have to set up some loops like I did earlier in this thread, which I don't think many users would do unless the expect to find a problem. Even then, with certain selections of "choices", you might still get the "expected" results.

2c. If you suspected a problem, or really wanted to be sure the function does what you assume it will do, obviously you can open up random.py and take a look. However, I doubt many users do this for every built-in module and function they use; clearly the point of documentation is to avoid this scenario.

3. As Raymond mentioned, this does not appear to be a "common" problem, and perhaps that is enough to not add anything to the docs. However, due to the somewhat elusive nature of the behavior, it could certainly go undetected in many cases, potentially causing problems without anyone noticing. Perhaps I chose a very unorthodox implementation to get the results I desired; I easily could have used random.shuffle() or random.sample(pop, len(pop)) and picked the nth element. However, one could imagine cases in which you have a very large population and you want to optimize by using sample() to get the nth random draw rather than randomizing the entire list, so I don't think it's an entirely unjustified approach.

4. Given the above points, I'd argue that a one-line insertion into the docs would help users steer clear of a hard-to-anticipate, potentially costly pitfall. My suggested language is a more direct identification of the possible consequences, though I agree that it it perhaps too worry-inducing without specifying the "cause" of the problem. Raymond's algorithmic note may be a better choice and would have been enough of an indicator for me to avoid the mistake I made.

----------

_______________________________________
Python tracker <report at bugs.python.org>
<https://bugs.python.org/issue33114>
_______________________________________


More information about the docs mailing list