"Greg" == Greg Ewing greg.ewing@canterbury.ac.nz writes:
Greg> *All* shuffling algorithms are limited by that.
Greg> Think about it: A shuffling algorithm is a function from a random Greg> number to a permutation. There's no way you can get more permutations Greg> out than there are random numbers to put in.
Hi Greg
Maybe we should put a note to that effect in random.shuffle.__doc__ :-)
http://mail.python.org/pipermail/python-dev/2006-June/065815.html
Regards, Terry
Greg> Think about it: A shuffling algorithm is a function from a random Greg> number to a permutation. There's no way you can get more permutations Greg> out than there are random numbers to put in.
If our random number generator can produce more possible shuffles than there are atoms in the universe, I say you don't worry about it.
Raymond
On Wed, 25 Mar 2009 12:59:27 pm Raymond Hettinger wrote:
Greg> Think about it: A shuffling algorithm is a function from a random Greg> number to a permutation. There's no way you can get more permutations Greg> out than there are random numbers to put in.
If our random number generator can produce more possible shuffles than there are atoms in the universe, I say you don't worry about it.
No, I'm afraid that is a fallacy, because what is important is the number of permutations in the list, and that grows as the factorial of the number of items.
The Mersenne Twister has a period of 2**19937-1, which sounds huge, but it takes a list of only 2081 items for the number of permutations to exceed that.
To spell it out in tedious detail: that means that random.shuffle() can produce every permutation of a list of 2080 items, but for 2081 items approximately 98% of the possibilities can't be reached. For 2082 items, approx 99.999% will never be reached. And so on.
Don't get me wrong, random.shuffle() is perfectly adequate for any use-case I can think of. But beyond 2080 items in the list, it becomes greatly biased, and I think that's important to note in the docs. Those who need to know about it will be told, and those who don't care can continue to not care.
On Wed, Mar 25, 2009 at 13:28, Steven D'Aprano steve@pearwood.info wrote:
Don't get me wrong, random.shuffle() is perfectly adequate for any use-case I can think of. But beyond 2080 items in the list, it becomes greatly biased, and I think that's important to note in the docs. Those who need to know about it will be told, and those who don't care can continue to not care.
Why anyone would care? Orderings possible to obtain from a given good random number generator are quite uniformly distributed among all orderings. I bet you can't even predict any particular ordering which is impossible to obtain. There is no time to generate all orderings. The factorial of large numbers is just huge.
On Thu, 26 Mar 2009 08:41:27 am Marcin 'Qrczak' Kowalczyk wrote:
On Wed, Mar 25, 2009 at 13:28, Steven D'Aprano steve@pearwood.info
wrote:
Don't get me wrong, random.shuffle() is perfectly adequate for any use-case I can think of. But beyond 2080 items in the list, it becomes greatly biased, and I think that's important to note in the docs. Those who need to know about it will be told, and those who don't care can continue to not care.
Why anyone would care? Orderings possible to obtain from a given good random number generator are quite uniformly distributed among all orderings.
Yes, that holds true for n <= 2080, since Fisher-Yates is an unbiased shuffler. But I don't think it remains true for n > 2080 since the vast majority of possible permutations have probability zero. I'm not saying that this absolutely *will* introduce statistical bias into the shuffled lists, but it could, and those who care about that risk shouldn't have to read the source code to learn this.
I bet you can't even predict any particular ordering which is impossible to obtain.
A moral dilemma... should I take advantage of your innumeracy by taking you up on that bet, or should I explain why that bet is a sure thing for me? *wink*
Since the chances of me collecting on the bet is essentially near zero, I'll explain.
For a list with 2082 items, shuffle() chooses from a subset of approximately 0.001% of all possible permutations. This means that if I give you a list of 2082 items and tell you to shuffle it, and then guess that such-and-such a permutation of it will never be reached, I can only lose if by chance I guessed on the 1 in 100,000 permutations that shuffle() can reach. I have 99,999 chances to win versus 1 to lose: that's essentially a sure thing.
In practical terms, beyond (say) 2085 or so, it would be a bona fide miracle if I didn't win such a bet.
On Wed, Mar 25, 2009 at 5:58 PM, Steven D'Aprano steve@pearwood.info wrote:
On Thu, 26 Mar 2009 08:41:27 am Marcin 'Qrczak' Kowalczyk wrote:
Why anyone would care? Orderings possible to obtain from a given good random number generator are quite uniformly distributed among all orderings.
Yes, that holds true for n <= 2080, since Fisher-Yates is an unbiased shuffler. But I don't think it remains true for n > 2080 since the vast majority of possible permutations have probability zero. I'm not saying that this absolutely *will* introduce statistical bias into the shuffled lists, but it could, and those who care about that risk shouldn't have to read the source code to learn this.
If random.shuffle() is broken for lists more than 2080 then it should raise an error. Claiming it "might" be broken in the docs for moderately sized lists, without researching such a claim, is pointless fear mongering.
I bet you can't even predict any particular ordering which is impossible to obtain.
A moral dilemma... should I take advantage of your innumeracy by taking you up on that bet, or should I explain why that bet is a sure thing for me? *wink*
Since the chances of me collecting on the bet is essentially near zero, I'll explain.
For a list with 2082 items, shuffle() chooses from a subset of approximately 0.001% of all possible permutations. This means that if I give you a list of 2082 items and tell you to shuffle it, and then guess that such-and-such a permutation of it will never be reached, I can only lose if by chance I guessed on the 1 in 100,000 permutations that shuffle() can reach. I have 99,999 chances to win versus 1 to lose: that's essentially a sure thing.
In practical terms, beyond (say) 2085 or so, it would be a bona fide miracle if I didn't win such a bet.
Go ahead, pick a combination, then iterate through all 2**19937-1 permutations to prove you're correct. Don't worry, we can wait.
Of course a stronger analysis technique can prove it much quicker than brute force, but it's not a cryptographically secure PRNG, there's LOTS of information that can be found through such techniques.
So far the 2080 limit is random trivia, nothing more. It has no real significance, imposes no new threats, and does not change how correct code is written.
Adam Olsen writes:
If random.shuffle() is broken for lists more than 2080 then it should raise an error.
Not really. Assuming the initial state is drawn from a uniform distribution on all possible states, if all 2080-shuffles are equiprobable, then any statistic you care to calculate based on that will come out the same as if you had 2080 statistically independent draws without replacement. Another way to put it is "if you need a random shuffle, this one is good enough for *any* such purpose".
However, once you exceed that limit you have to ask whether it's good enough for the purpose at hand. For some purposes, the distribution of (2**19937-1)-shuffles might be good enough, even though they make up only 1/(2**19937-2) of the possible shuffles. (Yeah, I know, you can wait....)
Claiming it "might" be broken in the docs for moderately sized lists, without researching such a claim, is pointless fear mongering.
How about if it's phrased the way I did above? Ie, this is good enough for any N-shuffle for *any purpose whatsoever*, for N < 2081.
On Thu, Mar 26, 2009 at 4:31 AM, Stephen J. Turnbull stephen@xemacs.org wrote:
Adam Olsen writes:
> If random.shuffle() is broken for lists more than 2080 then it should > raise an error.
Not really. Assuming the initial state is drawn from a uniform distribution on all possible states, if all 2080-shuffles are equiprobable, then any statistic you care to calculate based on that will come out the same as if you had 2080 statistically independent draws without replacement. Another way to put it is "if you need a random shuffle, this one is good enough for *any* such purpose".
However, once you exceed that limit you have to ask whether it's good enough for the purpose at hand. For some purposes, the distribution of (2**19937-1)-shuffles might be good enough, even though they make up only 1/(2**19937-2) of the possible shuffles. (Yeah, I know, you can wait....)
Is it or is it not broken? That's all I want to know. "maybe" isn't good enough. "Not broken for small lists" implies it IS broken for large lists.
Disabling it (raising an exception for large lists) is of course just a stopgap measure. Better would be a PRNG with a much larger period.. but of course that'd require more CPU time and more seed.
On Thu, Mar 26, 2009 at 11:43:35AM -0600, Adam Olsen wrote:
Is it or is it not broken? That's all I want to know. "maybe" isn't good enough. "Not broken for small lists" implies it IS broken for large lists.
Disabling it (raising an exception for large lists) is of course just a stopgap measure. Better would be a PRNG with a much larger period.. but of course that'd require more CPU time and more seed.
It's only broken in a theoretical sense. It's fun to think about, but I wouldn't lose any sleep over it.
Is it or is it not broken? That's all I want to know. "maybe" isn't good enough. "Not broken for small lists" implies it IS broken for large lists.
Disabling it (raising an exception for large lists) is of course just a stopgap measure. Better would be a PRNG with a much larger period.. but of course that'd require more CPU time and more seed.
It's only broken in a theoretical sense. It's fun to think about, but I wouldn't lose any sleep over it.
It's not even broken in a theoretical sense. It does exactly what it says it does.
Besides, this whole conversation is somewhat senseless. You can't get any more randomness out of a generator than you put into the seed in the first place. If you're not putting thousands of digits in your seed, then no PRNG is going to give you an equal chance of producing every possible shuffle for a large list.
Raymond
"Anyone who considers arithmetical methods of producing random digits is, of course, in a state of sin." -- John von Neumann
On Thu, Mar 26, 2009 at 12:36 PM, Raymond Hettinger python@rcn.com wrote:
It's not even broken in a theoretical sense. It does exactly what it says it does.
Besides, this whole conversation is somewhat senseless. You can't get any more randomness out of a generator than you put into the seed in the first place. If you're not putting thousands of digits in your seed, then no PRNG is going to give you an equal chance of producing every possible shuffle for a large list.
Indeed, a million item list requires over 2 megabytes of entropy.
On Thu, Mar 26, 2009 at 11:36:28AM -0700, Raymond Hettinger wrote:
It's only broken in a theoretical sense. It's fun to think about, but I wouldn't lose any sleep over it.
It's not even broken in a theoretical sense. It does exactly what it says it does.
Besides, this whole conversation is somewhat senseless. You can't get any more randomness out of a generator than you put into the seed in the first place. If you're not putting thousands of digits in your seed, then no PRNG is going to give you an equal chance of producing every possible shuffle for a large list.
I agree with you--I just didn't want to make too strong of a statement. I certainly believe that any comment in the docs about this issue would be distracting and unhelpful.
[Adam Olsen]
Is it or is it not broken? That's all I want to know.
Then you first need to define what "broken" means to you. Anything short of a source of /true/ random numbers is "broken" for /some/ purposes.
Python's current generator is not broken for any purposes I care about, so my answer to your question is "no" -- but only if I ask your question of myself ;-)
Better would be a PRNG with a much larger period..
Not really. A larger period is necessary but not sufficient /if/ you're concerned about generating all permutations of bigger lists with equal probability -- see the old thread someone else pointed to for more info on that. The Mersenne Twister's provably superb "high-dimensional equidistribution" properties are far more important than its long period in this respect (the former is sufficient; the latter is merely necessary).
On Thu, Mar 26, 2009 at 11:43:35AM -0600, Adam Olsen wrote:
Is it or is it not broken? That's all I want to know. "maybe" isn't good enough. "Not broken for small lists" implies it IS broken for large lists.
Practicality beats purity, IMHO. If shuffle cannot process a list I cannot even fit into virtual memory - I don't care, really.
Oleg.
Oleg Broytmann wrote:
On Thu, Mar 26, 2009 at 11:43:35AM -0600, Adam Olsen wrote:
Is it or is it not broken? That's all I want to know. "maybe" isn't good enough. "Not broken for small lists" implies it IS broken for large lists.
Practicality beats purity, IMHO. If shuffle cannot process a list I cannot even fit into virtual memory - I don't care, really.
Oleg.
A list of 2081 items certainly fits into the memory of my machine. There is a very clear sense in which it is broken for lists > 2080 items. It *may* be broken in other ways and for certain use cases for shorter lists, I don't know enough about the properties of the the PRNG to say anything about that. I *think* someone mentioned that it was equidistributed in 623 dimensions and that this should mean it is as good as possible for any PRNG for any list up to 623 items. If this is true, it would be nice to have a note about it in the docs.
Documented limitations are always better than undocumented ones. (Explicit is better than implicit...)
- Jacob
Adam Olsen writes:
Is it or is it not broken?
What is so hard to understand about "depending on the statistical properties you demand, it may be broken and then again it may not?"
That's all I want to know. "maybe" isn't good enough.
"If you have to ask, you can't afford it."
Ie, you've defined your own answer: it's broken *for you*. The rest of us would like to be allowed to judge for ourselves, though.
"Not broken for small lists" implies it IS broken for large lists.
You're being contentious. It logically implies no such thing, nor is it idiomatically an implication among consenting adults. And in any case, the phrasing I recommended is "guaranteed to have uniform distribution of shuffles up to N". The implication of "no guarantee" is "have a mechanic inspect it before you buy", not "this is a lemon".
On Thu, Mar 26, 2009 at 9:17 PM, Stephen J. Turnbull stephen@xemacs.org wrote:
Adam Olsen writes: > "Not broken for small lists" implies it IS broken for large lists.
You're being contentious. It logically implies no such thing, nor is it idiomatically an implication among consenting adults. And in any case, the phrasing I recommended is "guaranteed to have uniform distribution of shuffles up to N". The implication of "no guarantee" is "have a mechanic inspect it before you buy", not "this is a lemon".
We'll have to agree to disagree there.
The irony is that we only seed with 128 bits, so rather than 2**19937 combinations, there's just 2**128. That drops our "safe" list size down to 34. Weee!
On Fri, Mar 27, 2009, Adam Olsen wrote:
The irony is that we only seed with 128 bits, so rather than 2**19937 combinations, there's just 2**128. That drops our "safe" list size down to 34. Weee!
That's probably worth a bug report or RFE if one doesn't already exist.
On Sat, Mar 28, 2009 at 9:40 AM, Aahz aahz@pythoncraft.com wrote:
On Fri, Mar 27, 2009, Adam Olsen wrote:
The irony is that we only seed with 128 bits, so rather than 2**19937 combinations, there's just 2**128. That drops our "safe" list size down to 34. Weee!
That's probably worth a bug report or RFE if one doesn't already exist.
It seems sufficient to me. We don't want to needlessly drain the system's entropy pool.
How about a counter proposal? We add an orange or red box in the random docs that explain a few things together:
* What a cryptographically secure RNG is, that ours isn't it, and that ours is unacceptable any time money or security is involved. * Specifically, 624 "iterates" allows you to predict the full state, and thus all future (and past?) output * The limitations of our default seed, and how it isn't a practical problem, overshadowed by the above two things * The limitations on shuffling a large list, how equidistance means it's not a practical problem, and is overshadowed by all of the above
Some of that already exists, but is inline. IMO, security issues deserve a few flashing lights. The context of other problems also gives the proper light to shuffling's limitations.
Just out of curiosity, would doing
l = range(2082) random.shuffle(l) random.shuffle(l)
give me (with a high probability) one of those permutations that is unreachable with a single shuffle? If so, I'd presume you could get any shuffle (in case you really cared) by calling random.shuffle repeatedly and reseeding the prng in between.
On Thu, Mar 26, 2009 at 2:19 PM, Jan Kanis jan.kanis@phil.uu.nl wrote:
Just out of curiosity, would doing
l = range(2082) random.shuffle(l) random.shuffle(l)
give me (with a high probability) one of those permutations that is unreachable with a single shuffle? If so, I'd presume you could get any shuffle (in case you really cared) by calling random.shuffle repeatedly and reseeding the prng in between.
If you reseed, yes. That injects new entropy into the system. As I said though, you can end up needing megabytes of entropy.
Jan Kanis wrote:
I'd presume you could get any shuffle (in case you really cared) by calling random.shuffle repeatedly and reseeding the prng in between.
But how are you going to reseed the prng? To get an equal likelihood of any shuffle, you need another prng with a big enough state to generate seeds for your first prng. But then you might just as well shuffle based on that larger prng in the first place.
On Thu, Mar 26, 2009 at 11:19 PM, Jan Kanis jan.kanis@phil.uu.nl wrote:
Just out of curiosity, would doing
l = range(2082) random.shuffle(l) random.shuffle(l)
give me (with a high probability) one of those permutations that is unreachable with a single shuffle? If so, I'd presume you could get any shuffle (in case you really cared) by calling random.shuffle repeatedly and reseeding the prng in between.
I'm a bit rusty on the math, but that doesn't have to be the case. If all the permutations produced by random.shuffle() form a subgroup, or lie in a subgroup, then what you'll get is just another permutation from that subgroup, regardless of the randomness you put inside.
On 27 Mar 2009, at 06:59, Imri Goldberg wrote:
On Thu, Mar 26, 2009 at 11:19 PM, Jan Kanis jan.kanis@phil.uu.nl wrote: Just out of curiosity, would doing
l = range(2082) random.shuffle(l) random.shuffle(l)
give me (with a high probability) one of those permutations that is unreachable with a single shuffle? If so, I'd presume you could get any shuffle (in case you really cared) by calling random.shuffle repeatedly and reseeding the prng in between.
I'm a bit rusty on the math, but that doesn't have to be the case. If all the permutations produced by random.shuffle() form a subgroup, or lie in a subgroup, then what you'll get is just another permutation from that subgroup, regardless of the randomness you put inside.
There is no reason that the set of shuffled permutations(S_n) will form a subgroup of the set of permutations (P_n) and it may well generate the whole of P_n. In fact you only need n transpositions to generate the whole of P_n.
However, any function generates a random permutation is a function from the set of possible states of the PRNG to the set of permutations. Whatever tricks you use, if there are fewer states in the PRNG than there are permutations, you won't be able to reach them all.
Arnaud Delobelle wrote:
On 27 Mar 2009, at 06:59, Imri Goldberg wrote:
On Thu, Mar 26, 2009 at 11:19 PM, Jan Kanis jan.kanis@phil.uu.nl wrote: Just out of curiosity, would doing
l = range(2082) random.shuffle(l) random.shuffle(l)
give me (with a high probability) one of those permutations that is unreachable with a single shuffle? If so, I'd presume you could get any shuffle (in case you really cared) by calling random.shuffle repeatedly and reseeding the prng in between.
I'm a bit rusty on the math, but that doesn't have to be the case. If all the permutations produced by random.shuffle() form a subgroup, or lie in a subgroup, then what you'll get is just another permutation from that subgroup, regardless of the randomness you put inside.
There is no reason that the set of shuffled permutations(S_n) will form a subgroup of the set of permutations (P_n) and it may well generate the whole of P_n. In fact you only need n transpositions to generate the whole of P_n.
True, it is extremely likely that the group G_n generated by S_n is P_n and not a subgroup.
However, any function generates a random permutation is a function from the set of possible states of the PRNG to the set of permutations. Whatever tricks you use, if there are fewer states in the PRNG than there are permutations, you won't be able to reach them all.
You are right, you won't be able to reach them all in a single call to shuffle. However, by repeated shuffling and reseeding like the OP suggested, you can in theory get to all elements of G_n *if you keep shuffling long enough*. Unfortunately you will need at least |G_n|/|S_n| shuffles which means it is not even remotely practical.
- Jacob
Jacob Holm wrote:
However, by repeated shuffling and reseeding like the OP suggested, you can in theory get to all elements of G_n
But then you need a sufficient number of distinct seed values, so you're back to the original problem.
Greg Ewing wrote:
Jacob Holm wrote:
However, by repeated shuffling and reseeding like the OP suggested, you can in theory get to all elements of G_n
But then you need a sufficient number of distinct seed values, so you're back to the original problem.
Ehr, no. Suppose my PRNG only has period two and the shuffle based on it can only generate the permutations [1, 0, 2] and [2, 1, 0] from [0, 1, 2]. Each time I reseed from a truly random source, the next shuffle will use one of those permutations at random. By shuffling and reseeding enough times I can get all combinations of those two permutations. This happens to be all 6 possible permutations of 3 elements.
- Jacob
Jacob Holm wrote:
Greg Ewing wrote:
Jacob Holm wrote:
However, by repeated shuffling and reseeding like the OP suggested, you can in theory get to all elements of G_n
But then you need a sufficient number of distinct seed values, so you're back to the original problem.
Ehr, no. Suppose my PRNG only has period two and the shuffle based on it can only generate the permutations [1, 0, 2] and [2, 1, 0] from [0, 1, 2]. Each time I reseed from a truly random source, the next shuffle will use one of those permutations at random. By shuffling and reseeding enough times I can get all combinations of those two permutations. This happens to be all 6 possible permutations of 3 elements.
Ok, I may have misinterpreted your statement. Yes, you need to reseed a lot. You just don't need the seeds to be different.
- Jacob
Jacob Holm wrote:
Each time I reseed from a truly random source,
If you have a "truly random source" on hand, then you have an infinite amount of entropy available and there is no problem. Just feed your truly random numbers straight into the shuffling algorithm.
We're talking about the case where you *don't* have truly random numbers, but only a PRNG with a limited amount of internal state.
Greg Ewing wrote:
Jacob Holm wrote:
Each time I reseed from a truly random source,
If you have a "truly random source" on hand, then you have an infinite amount of entropy available and there is no problem. Just feed your truly random numbers straight into the shuffling algorithm.
Of course.
We're talking about the case where you *don't* have truly random numbers, but only a PRNG with a limited amount of internal state.
As it happens, you don't really need the random source. As long as the set of shuffles you can get after reseeding generates the full set of permutations, all you need is to reseed in a way that will eventually have used all long enough sequences of possible seed values.
No this is not even remotely practical, and it has very little to do with randomness, but I think I said that right from the start. I was just reacting to the statement that you wouldn't be able to generate all permutations using shuffle+reseed. You almost certainly can, but it is a silly thing to do. If you want large random permutations, you need a PRNG with an *extremely* long period, and if you have that there is no need for repeated shuffles.
- Jacob
2009/3/25 Steven D'Aprano steve@pearwood.info:
On Thu, 26 Mar 2009 08:41:27 am Marcin 'Qrczak' Kowalczyk wrote:
On Wed, Mar 25, 2009 at 13:28, Steven D'Aprano steve@pearwood.info
wrote:
I bet you can't even predict any particular ordering which is impossible to obtain.
A moral dilemma... should I take advantage of your innumeracy by taking you up on that bet, or should I explain why that bet is a sure thing for me? *wink*
Your challenge was to exhibit a particular permutation which the algorithm will not generate. For good measure I think you should also join a proof that it won't be generated (since there isn't enough time or, probably, space, to test it).
Raymond Hettinger wrote:
Greg> Think about it: A shuffling algorithm is a function from a random Greg> number to a permutation. There's no way you can get more permutations Greg> out than there are random numbers to put in.
If our random number generator can produce more possible shuffles than there are atoms in the universe, I say you don't worry about it.
The "long int too large to convert to float" error that I got on my first attempt at printing 2080! in scientific notation is also something of a hint :)
The decimal module came to my rescue though:
+Decimal(math.factorial(2080))
Decimal('1.983139957541900373849131897E+6000')
That's one heck of a big number!
Cheers, Nick.