a note in random.shuffle.__doc__ ...
...claims: Note that for even rather small len(x), the total number of permutations of x is larger than the period of most random number generators; this implies that "most" permutations of a long sequence can never be generated. Now -- why would the behavior of "most" random number generators be relevant here? The module's docs claim, for its specific Mersenne Twister generator, a period of 2**19997-1, which is (e.g.) a comfortable 130128673800676351960752618754658780303412233749552410245124492452914636 028095467780746435724876612802011164778042889281426609505759158196749438 742986040468247017174321241233929215223326801091468184945617565998894057 859403269022650639413550466514556014961826309062543 times larger than the number of permutations of 2000 items, which doesn't really feel to me like a "rather small len(x)" in this context (what I'm most often shuffling is just a pack of cards -- len(x)==52 -- for example). I suspect that the note is just a fossil from a time when the default random number generator was Whichman-Hill, with a much shorter period. Should this note just be removed, or instead somehow reworded to point out that this is not in fact a problem for the module's current default random number generator? Opinions welcome! Alex
Alex Martelli <aleaxit@gmail.com> wrote:
...claims:
Note that for even rather small len(x), the total number of permutations of x is larger than the period of most random number generators; this implies that "most" permutations of a long sequence can never be generated.
[snip]
I suspect that the note is just a fossil from a time when the default random number generator was Whichman-Hill, with a much shorter period. Should this note just be removed, or instead somehow reworded to point out that this is not in fact a problem for the module's current default random number generator? Opinions welcome!
I'm recovering from a migraine, but here are my thoughts on the topic... The number of permutations of n items is n!, which is > (n/2)^(n/2). Solve: 2**19997 < (n/2)^(n/2) log_2(2**19997) < log_2((n/2)^(n/2)) 19997 < (n/2)*log(n/2) Certainly with n >= 4096, the above holds (2048 * 11 = 22528) - Josiah
Josiah Carlson <jcarlson@uci.edu> wrote:
Alex Martelli <aleaxit@gmail.com> wrote:
...claims:
Note that for even rather small len(x), the total number of permutations of x is larger than the period of most random number generators; this implies that "most" permutations of a long sequence can never be generated.
[snip]
I suspect that the note is just a fossil from a time when the default random number generator was Whichman-Hill, with a much shorter period. Should this note just be removed, or instead somehow reworded to point out that this is not in fact a problem for the module's current default random number generator? Opinions welcome!
I'm recovering from a migraine, but here are my thoughts on the topic...
The number of permutations of n items is n!, which is > (n/2)^(n/2). Solve: 2**19997 < (n/2)^(n/2) log_2(2**19997) < log_2((n/2)^(n/2)) 19997 < (n/2)*log(n/2)
Certainly with n >= 4096, the above holds (2048 * 11 = 22528)
- Josiah
I would also point out that even if MT had a larger period, there would still be no guarantee that all permutations of a given sequence would be able to be generated from the PRNG given some aribtrary internal state. - Josiah
Alex Martelli wrote:
...claims:
Note that for even rather small len(x), the total number of permutations of x is larger than the period of most random number generators; this implies that "most" permutations of a long sequence can never be generated.
Now -- why would the behavior of "most" random number generators be relevant here? The module's docs claim, for its specific Mersenne Twister generator, a period of 2**19997-1, which is (e.g.) a comfortable 130128673800676351960752618754658780303412233749552410245124492452914636 028095467780746435724876612802011164778042889281426609505759158196749438 742986040468247017174321241233929215223326801091468184945617565998894057 859403269022650639413550466514556014961826309062543 times larger than the number of permutations of 2000 items, which doesn't really feel to me like a "rather small len(x)" in this context (what I'm most often shuffling is just a pack of cards -- len(x)==52 -- for example).
I wouldn't be too comfortable with that margin. The fun thing about factorials is that they add up really quickly. The crossover point is between 2080 and 2081. In [28]: from scipy import * In [29]: def f(x): ....: return special.gammaln(x-1) - 19937*log(2) ....: In [30]: optimize.brentq(f, 2000, 3000) Out[30]: 2082.4031300820125 In [31]: import gmpy In [32]: mtperiod = 2**19937 - 1 In [33]: for i in range(2075, 2085): ....: if gmpy.fac(i) > mtperiod: ....: print i ....: break ....: ....: 2081 I think that documenting this boundary might be worthwhile rather than making vague references to "small len(x)." A note along the lines of Josiah wrote about there being no guarantees despite period size would also be useful. OTOH, isn't the exact PRNG algorithm considered an implementation detail? It certainly was when the module migrated from Wichmann-Hill to the Mersenne Twister. OTGH, I don't foresee the random module ever using an algorithm with worse characteristics than the Mersenne Twister. -- Robert Kern "I have come to believe that the whole world is an enigma, a harmless enigma that is made terrible by our own mad attempt to interpret it as though it had an underlying truth." -- Umberto Eco
[Alex Martelli]
...claims:
Note that for even rather small len(x), the total number of permutations of x is larger than the period of most random number generators; this implies that "most" permutations of a long sequence can never be generated.
Now -- why would the behavior of "most" random number generators be relevant here? The module's docs claim, for its specific Mersenne Twister generator, a period of 2**19997-1, which is (e.g.)
Oops! That's wrong. The period is 2**19937-1.
a comfortable 130128673800676351960752618754658780303412233749552410245124492452914636 028095467780746435724876612802011164778042889281426609505759158196749438 742986040468247017174321241233929215223326801091468184945617565998894057 859403269022650639413550466514556014961826309062543 times larger than the number of permutations of 2000 items, which doesn't really feel to me like a "rather small len(x)" in this context (what I'm most often shuffling is just a pack of cards -- len(x)==52 -- for example).
I suspect that the note is just a fossil from a time when the default random number generator was Whichman-Hill, with a much shorter period. Should this note just be removed, or instead somehow reworded to point out that this is not in fact a problem for the module's current default random number generator? Opinions welcome!
It should be removed now. I'll do that :-) WH's period was indeed so short that it couldn't generate even a tiny fraction of the permutations of a deck of cards, and that's why the note was added. While a long period is necessary to get a shot at all permutations, it's not sufficient, and I don't know what the true story is wrt the Twister. For example, a miserable PRNG that returns 0.1, 0.1, 0.2, 0.1, 0.2, 0.2, 0.1, 0.2, 0.2, 0.2, 0.1, 0.2, 0.2, 0.2, 0.2, ... has infinite period, but has few (O(N)) distinct subsequences of length N. That's a failure of so-called equidistribution in N dimensions (for sufficiently large N, some N-vectors appear more often than others across the whole period). "A long" period is necessary but not sufficient for high-dimensional equidistribution. Off the top of my head, then, since the Twister is provably equidistributed in 623 dimensions to 32-bit accuracy, I expect it should be able to "fairly" generate all permutations of a sequence of <= 623 elements (equidistribution in N dimensions implies equidistribution in all dimensions <= N). So I'm happy to leave a warning out until the casinos switch to 12-deck blackjack ;-)
On Jun 10, 2006, at 1:08 PM, Josiah Carlson wrote:
Josiah Carlson <jcarlson@uci.edu> wrote:
Alex Martelli <aleaxit@gmail.com> wrote:
...claims:
Note that for even rather small len(x), the total number of permutations of x is larger than the period of most random number generators; this implies that "most" permutations of a long sequence can never be generated.
[snip]
I suspect that the note is just a fossil from a time when the default random number generator was Whichman-Hill, with a much shorter period. Should this note just be removed, or instead somehow reworded to point out that this is not in fact a problem for the module's current default random number generator? Opinions welcome!
I'm recovering from a migraine, but here are my thoughts on the topic...
The number of permutations of n items is n!, which is > (n/2)^(n/2). Solve: 2**19997 < (n/2)^(n/2) log_2(2**19997) < log_2((n/2)^(n/2)) 19997 < (n/2)*log(n/2)
Certainly with n >= 4096, the above holds (2048 * 11 = 22528)
- Josiah
I would also point out that even if MT had a larger period, there would still be no guarantee that all permutations of a given sequence would be able to be generated from the PRNG given some aribtrary internal state.
Sure. And an n of 2081 happens to suffice:
period = 2**19937 while gmpy.fac(i) < period: i = i +1 ... i 2081
Still, the note, as worded, is misleading -- it strongly suggests that for "even small len(x)" (no mention of whether that means dozens or thousands) the RNG can't generate all permutations, with no proof either way and just a misleading hint. "The values of N such that the RNG can generate all permutations of a sequence of len(N) are not precisely known at this time" might be closer to the truth (if this is, indeed, the state of our collective knowledge). Alex
Robert Kern wrote:
OTOH, isn't the exact PRNG algorithm considered an implementation detail?
It's questionable whether the PRNG being used *should* be an implementation detail. To anyone who cares even a little bit about its quality, knowing the algorithm (or at least some info about it, such as the period) is vital. PRNGs are not like sorting algorithms, where different ones all give the same result in the end. Different PRNGs have *wildly* different characteristics. -- Greg
Tim Peters wrote:
Off the top of my head, then, since the Twister is provably equidistributed in 623 dimensions to 32-bit accuracy, I expect it should be able to "fairly" generate all permutations of a sequence of <= 623 elements (equidistribution in N dimensions implies equidistribution in all dimensions <= N). So I'm happy to leave a warning out until the casinos switch to 12-deck blackjack ;-)
But isn't the problem with the Twister that for *some initial states* the period could be much *shorter* than the theoretical maximum? Or is the probability of getting such an initial state too small to worry about? -- Greg
[Greg Ewing]
But isn't the problem with the Twister that for *some initial states* the period could be much *shorter* than the theoretical maximum?
Or is the probability of getting such an initial state too small to worry about?
The Twister's state is held in a vector of 624 32-bit words. 31 of the bits aren't actually used, and the number of meaningful state bits is actually 624 * 32 - 31 = 19937. There are exactly 2 orbits in the state space under the state transformation operation (STO): 1. A trivial orbit of length 1, consisting of the state in which all meaningful bits are 0. That's a fixed point for the STO. There are no other fixed points. 2. All not-entirely-0 states are in the other orbit, of length 2**19937 - 1. All not-0 states are reachable from all other not-0 states, and you get back to the non-zero state you start from for the first time after exactly 2**19937 - 1 iterations of the STO. So as long as you don't start with the all-0 state, you're in the second orbit, and see the advertised period (2**19937 - 1).
From Python code, it's impossible to get the all-0 state. All Python-visible initialization methods guarantee there's at least one bit set in the meaningful bits of the state vector, so the probability of not seeing a period of 2**19937 - 1 from Python is exactly 0.
Hmm. Well, there _is_ a way to screw yourself here, but you have to work at it: you can force the all-0 state by hand-crafting the right input to random.setstate().
Alex Martelli wrote:
...claims:
Note that for even rather small len(x), the total number of permutations of x is larger than the period of most random number generators; this implies that "most" permutations of a long sequence can never be generated.
Now -- why would the behavior of "most" random number generators be relevant here? The module's docs claim, for its specific Mersenne Twister generator, a period of 2**19997-1, which is (e.g.) a comfortable 130128673800676351960752618754658780303412233749552410245124492452914636 028095467780746435724876612802011164778042889281426609505759158196749438 742986040468247017174321241233929215223326801091468184945617565998894057 859403269022650639413550466514556014961826309062543 times larger than the number of permutations of 2000 items, which doesn't really feel to me like a "rather small len(x)" in this context (what I'm most often shuffling is just a pack of cards -- len(x)==52 -- for example).
I suspect that the note is just a fossil from a time when the default random number generator was Whichman-Hill, with a much shorter period. Should this note just be removed, or instead somehow reworded to point out that this is not in fact a problem for the module's current default random number generator? Opinions welcome!
I think the note is still useful, but the "rather small" wording should be replaced by something most precise (such as the value of n=len(x) where n! > 2**19997). Raymond
[Raymond Hettinger]
I think the note is still useful, but the "rather small" wording should be replaced by something most precise (such as the value of n=len(x) where n! > 2**19997).
Note that I already removed it, and I'm not putting it back. The period of W-H was "so short" you could get into trouble, based on period alone, with a list of only 16 elements. The Twister is so much more capable in respect of both period and high-dimensional equidistribution properties that I think anyone sophisticated enough to _understand_ an accurate warning correctly would have no need to be warned. Everyone else would find it a mix of confusing and scary, to no real end. None of this is to say it couldn't be useful to have a digestible introduction to issues raised by use of deterministic PRNGs. But I don't think that one note in one docstring is actually "better than nothing" in that regard ;-)
participants (6)
-
Alex Martelli -
Greg Ewing -
Josiah Carlson -
Raymond Hettinger -
Robert Kern -
Tim Peters