surprisingly poor performance of random.sample(set)
![](https://secure.gravatar.com/avatar/49e7036e064eae04ef31f1d7a79068b8.jpg?s=120&d=mm&r=g)
Hi, I noticed recently that running random.sample() on a set is measurably slower than doing so on a list or tuple with the same contents. I suspect that this is due to the "implementation detail" of converting the population argument to a tuple internally; that may not matter much for a single call, but when doing i.e. permutation testing, it is not unreasonable to call random.sample() on the same population 1000s of times in a tight loop. In that scenario, letting it cast the set to a tuple inside the loop makes it ~50x slower than casting once outside the loop and passing a tuple to random.sample() instead of a set. Assuming there is no easy way for random.sample() to work efficiently with a set, would it at least be possible to mention this behavior in the documentation? Right now there is no indication that the data structure used for the population makes any difference, which may lead people to just assume that random.sample() is slow, not realizing they could get a 50x boost with a single cast outside their loop. Regards, Alex Frase
![](https://secure.gravatar.com/avatar/3c46e59585526b90c088845402bc8407.jpg?s=120&d=mm&r=g)
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On 08/12/2014 10:46 PM, Alex Frase wrote:
Hi,
I noticed recently that running random.sample() on a set is measurably slower than doing so on a list or tuple with the same contents. I suspect that this is due to the "implementation detail" of converting the population argument to a tuple internally; that may not matter much for a single call, but when doing i.e. permutation testing, it is not unreasonable to call random.sample() on the same population 1000s of times in a tight loop. In that scenario, letting it cast the set to a tuple inside the loop makes it ~50x slower than casting once outside the loop and passing a tuple to random.sample() instead of a set.
Assuming there is no easy way for random.sample() to work efficiently with a set, would it at least be possible to mention this behavior in the documentation? Right now there is no indication that the data structure used for the population makes any difference, which may lead people to just assume that random.sample() is slow, not realizing they could get a 50x boost with a single cast outside their loop.
Hi Alex, please open an issue on bugs.python.org to discuss this issue with the maintainers. cheers, Georg -----BEGIN PGP SIGNATURE----- Version: GnuPG v2 iEYEARECAAYFAlPq/LIACgkQN9GcIYhpnLBoZgCgpyUeE2j5dTR4r4yvSwiZQsNQ bDUAnR0zMqC1Op/YlP4TotvGfQMCzCqY =3CY/ -----END PGP SIGNATURE-----
participants (2)
-
Alex Frase
-
Georg Brandl