Default behavior for random.sample when no k
When writing some code now, I needed to produce a shuffled version of `range(10, 10 ** 5)`. This is one way to do it: shuffled_numbers = list(range(10, 10 ** 5)) random.shuffle(shuffled_numbers) I don't like it because (1) it's too imperative and (2) I'm calling the list "shuffled" even before it's shuffled. Another solution is this: shuffled_numbers = random.sample(range(10, 10 ** 5), k=len(range(10, 10 ** 5))) This is better because it solves the 2 points above. However, it is quite cumbersome. I notice that the `random.sample` function doesn't have a default behavior set when you don't specify `k`. This is fortunate, because we could make that behavior just automatically take the length of the first argument. So we could do this: shuffled_numbers = random.sample(range(10, 10 ** 5)) What do you think? Thanks, Ram.
I agree that calling random.shuffle imperatively is annoying. But I don't think your proposed solution is readable. You're not taking a sample. A sample generally implies a strict subset, usually quite a small one. I've often thought there should just be a `random.shuffled()` function which returns a shuffled copy, similar to `.sort()` and `sorted()` or `.reverse()` and `reversed()`. On Sat, Aug 1, 2020 at 7:59 PM Ram Rachum <ram@rachum.com> wrote:
When writing some code now, I needed to produce a shuffled version of `range(10, 10 ** 5)`.
This is one way to do it:
shuffled_numbers = list(range(10, 10 ** 5)) random.shuffle(shuffled_numbers)
I don't like it because (1) it's too imperative and (2) I'm calling the list "shuffled" even before it's shuffled.
Another solution is this:
shuffled_numbers = random.sample(range(10, 10 ** 5), k=len(range(10, 10 ** 5)))
This is better because it solves the 2 points above. However, it is quite cumbersome.
I notice that the `random.sample` function doesn't have a default behavior set when you don't specify `k`. This is fortunate, because we could make that behavior just automatically take the length of the first argument. So we could do this:
shuffled_numbers = random.sample(range(10, 10 ** 5))
What do you think?
Thanks, Ram. _______________________________________________ Python-ideas mailing list -- python-ideas@python.org To unsubscribe send an email to python-ideas-leave@python.org https://mail.python.org/mailman3/lists/python-ideas.python.org/ Message archived at https://mail.python.org/archives/list/python-ideas@python.org/message/OHLXVK... Code of Conduct: http://python.org/psf/codeofconduct/
I would also prefer a `random.shuffled` function. The reason I didn't propose it is because there's usually more resistance for adding new functions. But in my view that'll be the best solution. On Sat, Aug 1, 2020 at 9:17 PM Alex Hall <alex.mojaki@gmail.com> wrote:
I agree that calling random.shuffle imperatively is annoying. But I don't think your proposed solution is readable. You're not taking a sample. A sample generally implies a strict subset, usually quite a small one.
I've often thought there should just be a `random.shuffled()` function which returns a shuffled copy, similar to `.sort()` and `sorted()` or `.reverse()` and `reversed()`.
On Sat, Aug 1, 2020 at 7:59 PM Ram Rachum <ram@rachum.com> wrote:
When writing some code now, I needed to produce a shuffled version of `range(10, 10 ** 5)`.
This is one way to do it:
shuffled_numbers = list(range(10, 10 ** 5)) random.shuffle(shuffled_numbers)
I don't like it because (1) it's too imperative and (2) I'm calling the list "shuffled" even before it's shuffled.
Another solution is this:
shuffled_numbers = random.sample(range(10, 10 ** 5), k=len(range(10, 10 ** 5)))
This is better because it solves the 2 points above. However, it is quite cumbersome.
I notice that the `random.sample` function doesn't have a default behavior set when you don't specify `k`. This is fortunate, because we could make that behavior just automatically take the length of the first argument. So we could do this:
shuffled_numbers = random.sample(range(10, 10 ** 5))
What do you think?
Thanks, Ram. _______________________________________________ Python-ideas mailing list -- python-ideas@python.org To unsubscribe send an email to python-ideas-leave@python.org https://mail.python.org/mailman3/lists/python-ideas.python.org/ Message archived at https://mail.python.org/archives/list/python-ideas@python.org/message/OHLXVK... Code of Conduct: http://python.org/psf/codeofconduct/
Can you not just use https://more-itertools.readthedocs.io/en/stable/api.html#more_itertools.rand... ? On Saturday, August 1, 2020 at 2:26:23 PM UTC-4 Ram Rachum wrote:
I would also prefer a `random.shuffled` function. The reason I didn't propose it is because there's usually more resistance for adding new functions. But in my view that'll be the best solution.
On Sat, Aug 1, 2020 at 9:17 PM Alex Hall <alex....@gmail.com> wrote:
I agree that calling random.shuffle imperatively is annoying. But I don't think your proposed solution is readable. You're not taking a sample. A sample generally implies a strict subset, usually quite a small one.
I've often thought there should just be a `random.shuffled()` function which returns a shuffled copy, similar to `.sort()` and `sorted()` or `.reverse()` and `reversed()`.
On Sat, Aug 1, 2020 at 7:59 PM Ram Rachum <r...@rachum.com> wrote:
When writing some code now, I needed to produce a shuffled version of `range(10, 10 ** 5)`.
This is one way to do it:
shuffled_numbers = list(range(10, 10 ** 5)) random.shuffle(shuffled_numbers)
I don't like it because (1) it's too imperative and (2) I'm calling the list "shuffled" even before it's shuffled.
Another solution is this:
shuffled_numbers = random.sample(range(10, 10 ** 5), k=len(range(10, 10 ** 5)))
This is better because it solves the 2 points above. However, it is quite cumbersome.
I notice that the `random.sample` function doesn't have a default behavior set when you don't specify `k`. This is fortunate, because we could make that behavior just automatically take the length of the first argument. So we could do this:
shuffled_numbers = random.sample(range(10, 10 ** 5))
What do you think?
Thanks, Ram. _______________________________________________ Python-ideas mailing list -- python...@python.org To unsubscribe send an email to python-id...@python.org https://mail.python.org/mailman3/lists/python-ideas.python.org/ Message archived at https://mail.python.org/archives/list/python...@python.org/message/OHLXVKIBM... <https://mail.python.org/archives/list/python-ideas@python.org/message/OHLXVK...> Code of Conduct: http://python.org/psf/codeofconduct/
On Sat, Aug 01, 2020 at 08:54:16PM +0300, Ram Rachum wrote:
When writing some code now, I needed to produce a shuffled version of `range(10, 10 ** 5)`.
This is one way to do it:
shuffled_numbers = list(range(10, 10 ** 5)) random.shuffle(shuffled_numbers)
I don't like it because (1) it's too imperative and (2) I'm calling the list "shuffled" even before it's shuffled.
This is easily solved with a three-line helper: def shuffled(iterable): L = list(iterable) random.shuffle(L) return L I have implemented this probably a half a dozen times, and I expect others have too. I agree with Alex that this would make a nice addition to the random module. -- Steven
I submitted a patch now, but Serhiy showed me that it's already been proposed before, and rejected by Raymond Hettinger and Terry Reedy in issues 26393 and 27964. On Sun, Aug 2, 2020 at 8:05 AM Steven D'Aprano <steve@pearwood.info> wrote:
On Sat, Aug 01, 2020 at 08:54:16PM +0300, Ram Rachum wrote:
When writing some code now, I needed to produce a shuffled version of `range(10, 10 ** 5)`.
This is one way to do it:
shuffled_numbers = list(range(10, 10 ** 5)) random.shuffle(shuffled_numbers)
I don't like it because (1) it's too imperative and (2) I'm calling the list "shuffled" even before it's shuffled.
This is easily solved with a three-line helper:
def shuffled(iterable): L = list(iterable) random.shuffle(L) return L
I have implemented this probably a half a dozen times, and I expect others have too. I agree with Alex that this would make a nice addition to the random module.
-- Steven _______________________________________________ Python-ideas mailing list -- python-ideas@python.org To unsubscribe send an email to python-ideas-leave@python.org https://mail.python.org/mailman3/lists/python-ideas.python.org/ Message archived at https://mail.python.org/archives/list/python-ideas@python.org/message/47JMNM... Code of Conduct: http://python.org/psf/codeofconduct/
Steven D'Aprano wrote:
This is easily solved with a three-line helper: def shuffled(iterable): ... I have implemented this probably a half a dozen times, and I expect others have too.
FWIW, we've already documented a clean way to do it, https://docs.python.org/3/library/random.html#random.shuffle , "To shuffle an immutable sequence and return a new shuffled list, use sample(x, k=len(x)) instead."
data = 'random module' ''.join(sample(data, len(data))) 'uaemdor odmln'
Given that we already have shuffle() and sample(), I really don't think we need a third way to it. How about we save API extensions for ideas that add genuine new, useful capabilities. Raymond
On Sun, Aug 2, 2020 at 8:05 PM <raymond.hettinger@gmail.com> wrote:
FWIW, we've already documented a clean way to do it, https://docs.python.org/3/library/random.html#random.shuffle , "To shuffle an immutable sequence and return a new shuffled list, use sample(x, k=len(x)) instead."
one downside of this is that it won't work on a non-sized iterable -- but I suppose that's not really an important use-case. It Is a use case, though, 'cause while a shuffled collection is going to be sized by definition, the source could be a generator or some other non-sized iterable. But not hard to "realize" the iterable first by making it a list or tuple. My other question was about performance. Without looking at the code, I thought it *might* be faster to shuffle than build up a list with multiple samples. but in profiling, the sample version is only about 30% slower. (for this one example :-) ) In [13]: def shuffled_1(it): ...: result = list(it) ...: random.shuffle(result) ...: return result In [14]: def shuffled_2(it): ...: return random.sample(it, k=len(it)) In [15]: %timeit shuffled_1(population) 3.71 ms ± 30.9 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) In [16]: %timeit shuffled_2(population) 4.23 ms ± 23.9 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) where: In [17]: population Out[17]: range(0, 10000) So yeah, this is a fine solution. -CHB -- Christopher Barker, PhD Python Language Consulting - Teaching - Scientific Software Development - Desktop GUI and Web Development - wxPython, numpy, scipy, Cython
On Mon, Aug 03, 2020 at 03:04:40AM -0000, raymond.hettinger@gmail.com wrote:
Steven D'Aprano wrote:
This is easily solved with a three-line helper: def shuffled(iterable): ... I have implemented this probably a half a dozen times, and I expect others have too.
FWIW, we've already documented a clean way to do it, https://docs.python.org/3/library/random.html#random.shuffle , "To shuffle an immutable sequence and return a new shuffled list, use sample(x, k=len(x)) instead."
Yes, I remember the last time I played poker with some friends, and the dealer handed me the deck of cards and asked me to take a sample of 52 cards *wink* While you are technically correct that a sample of N from a sequence of length N is equivalent to shuffling, that's not a particularly obvious thing to do, and the semantics of shuffling and sampling are not the same. Hence the need to document it. According to my testing in Python 3.8, the version with sample is about 10% slower than the "shuffled" helper I gave. That wouldn't be too bad if the operation was fast, but for a sequence of 30,000 items on my computer, that takes nearly half a second. So a 10% slowdown is quite significant. I think I'll continue using my shuffled helper function, and while I personally won't re-raise this issue, I'll continue to give it my support next time somebody raises it. -- Steven
On Mon, Aug 3, 2020 at 5:50 PM Steven D'Aprano <steve@pearwood.info> wrote:
According to my testing in Python 3.8, the version with sample is about 10% slower than the "shuffled" helper I gave.
I got similar results, but my conclusion was that 10% isn’t significant:-) Is it likely to be run in an inner loop? - CHB -- Christopher Barker, PhD Python Language Consulting - Teaching - Scientific Software Development - Desktop GUI and Web Development - wxPython, numpy, scipy, Cython
On Tue, Aug 4, 2020 at 10:54 AM Steven D'Aprano <steve@pearwood.info> wrote:
Yes, I remember the last time I played poker with some friends, and the dealer handed me the deck of cards and asked me to take a sample of 52 cards *wink*
Most dealers want you to shuffle the deck *in place*. Although I'd be highly amused to watch a group of computer scientists playing poker, and starting out with a Fisher-Yates... For the case of "create a new list that is a random permutation of these items", I don't personally see a problem with (1) create a new list, and then (2) shuffle that new list. If the naming bothers you, don't call it shuffled_numbers at all - call it something based on its purpose later on! ChrisA
Ram Rachum wrote:.
I notice that the random.sample function doesn't have a default behavior set when you don't specify k. This is fortunate, because we could make that behavior just automatically take the length of the first argument. So we could do this: shuffled_numbers = random.sample(range(10, 10 ** 5)) What do you think?
This is bad API design. The most likely user mistake is to omit the *k* argument. We want that to be an error. It is common to sample from large populations, we don't want the default to do anything terrible — for example, you're in a Jupyter notebook and type "sample(range(10_000_000))" and forget to enter the sample size. Also, having *k* default to the population size would be surprisingly inconsistent given that choices() has a default k=1. API design principle: don't have unexpectedly different defaults in related functions. Lastly, the use for in-line shuffling is not the primary use case. If there were a default argument, it should cater to the principal use case,. API design principle: don't do anything weird or unexpected by default. IMO you're trying too hard to jam a round peg into a square hole. There isn't a substantive problem being solved — being explicit by writing "sample(p, len(p))" instead of "sample(p)" isn't an undue burden. Please also consider that we thought about all of this when sample() was first created. The current API is intentional. As you noted, this suggestion was also already rejected on the bug tracker. So, this thread seems like an attempt to second guess that outcome as well as the original design decision. If you're going to do something like that, save it for something important :-) Raymond
On Mon, Aug 3, 2020 at 3:20 PM <raymond.hettinger@gmail.com> wrote:
Ram Rachum wrote:.
I notice that the random.sample function doesn't have a default behavior set when you don't specify k. This is fortunate, because we could make that behavior just automatically take the length of the first argument. So we could do this: shuffled_numbers = random.sample(range(10, 10 ** 5)) What do you think?
This is bad API design. The most likely user mistake is to omit the *k* argument. We want that to be an error. It is common to sample from large populations, we don't want the default to do anything terrible — for example, you're in a Jupyter notebook and type "sample(range(10_000_000))" and forget to enter the sample size.
Also, having *k* default to the population size would be surprisingly inconsistent given that choices() has a default k=1. API design principle: don't have unexpectedly different defaults in related functions.
Hmm, yes, I agree with both these points. I do think that `sample(x, k=len(x))` is cumbersome when `x` is not a variable but defined inline. But I guess I'll let this one go.
On Mon, Aug 3, 2020 at 8:26 AM Ram Rachum <ram@rachum.com> wrote:
On Mon, Aug 3, 2020 at 3:20 PM <raymond.hettinger@gmail.com> wrote:
Ram Rachum wrote:.
I notice that the random.sample function doesn't have a default behavior set when you don't specify k. This is fortunate, because we could make that behavior just automatically take the length of the first argument. So we could do this: shuffled_numbers = random.sample(range(10, 10 ** 5)) What do you think?
This is bad API design. The most likely user mistake is to omit the *k* argument. We want that to be an error. It is common to sample from large populations, we don't want the default to do anything terrible — for example, you're in a Jupyter notebook and type "sample(range(10_000_000))" and forget to enter the sample size.
Also, having *k* default to the population size would be surprisingly inconsistent given that choices() has a default k=1. API design principle: don't have unexpectedly different defaults in related functions.
Hmm, yes, I agree with both these points.
I do think that `sample(x, k=len(x))` is cumbersome when `x` is not a variable but defined inline. But I guess I'll let this one go.
I've found it cumbersome in the past myself, but an easy way around that now is the walrus: `sample(_:=[1,2,3], len(_))` --- Ricky. "I've never met a Kentucky man who wasn't either thinking about going home or actually going home." - Happy Chandler
On Mon, Aug 03, 2020 at 08:50:32AM -0000, raymond.hettinger@gmail.com wrote:
Please also consider that we thought about all of this when sample() was first created. The current API is intentional. As you noted, this suggestion was also already rejected on the bug tracker. So, this thread seems like an attempt to second guess that outcome as well as the original design decision. If you're going to do something like that, save it for something important :-)
The difficulty is judging when something is important or not, and that's part of the purpose of posting here :-) -- Steven
participants (8)
-
Alex Hall
-
Chris Angelico
-
Christopher Barker
-
Neil Girdhar
-
Ram Rachum
-
raymond.hettinger@gmail.com
-
Ricky Teachey
-
Steven D'Aprano