Re: [Python-ideas] Should our default random number generator be secure?

On 10.09.2015 19:04, Xavier Combelle wrote:
I think this is the major misunderstanding here:
The random module never suggested that it generates pseudo-random data of crypto quality.
I'm pretty sure people doing crypto will know and most others simply don't care :-)
Evidence: We used a Wichmann-Hill PRNG as default in random for a decade and people still got their work done. Mersenne was added in Python 2.3 and bumped the period from 6,953,607,871,644 (13 digits) to 2**19937-1 (6002 digits).
It is not a evidence, I have an evidence of the opposite: some people can and does use random.random() for generating session key or csrf tokens and it's an insecure default.
It all depends on what you consider "secure" or "secure enough" and points directly to another misunderstanding: that "secure" is a well-defined term :-) The random module seeds its global Random instance using urandom (if available on the system), so while the generator itself is deterministic, the seed used to kick off the pseudo-random series is not. For many purposes, this is secure enough. It's also easy to make the output of the random instance more secure by passing it through a crypto hash function. But back to the original question: What is "secure" ? In crypto terms, "secure" usually refers to "computationally infeasible to calculate before the sun goes dark" (to take one variant). More realistically, it can be defined as: Based on the public knowledge known today, it's impossible to run a program which allows converting the output of a crypto function back to its inputs within a reasonable time span. And this property will - based on today's knowledge - hold for at least the next 5-10 years. You may notice the many parameters in these definition attempts. It all depends on who you ask. With the advent of new technologies like quantum computers, it's not at all clear that any of those definitions will still hold in a couple of years. It's well possible that only quantum computers will be able to implement the necessary programs and it'll take a while for mobile phones to catch up and come with chips implementing those ;-) Now, leaving aside this bright future, what's reasonable today ? If you look at tools like untwister: https://github.com/bishopfox/untwister you can get a feeling for how long it takes to deduce the seed from an output sequence. Bare in mind, that in order to be reasonably sure that the seed is correct, the available output sequence has to be long enough. That's a known plain text attack, so you need access to lots of session keys to begin with. The tools is still running on an example set of 1000 32-bit numbers and it says it'll be done in 1.5 hours, i.e. before the sun goes down in my timezone. I'll leave it running to see whether it can find my secret key. Untwister is only slightly smarter than bruteforce. Given that MT has a seed size of 32 bits, it's not surprising that a tool can find the seed within a day. Perhaps it's time to switch to a better version of MT, e.g. a 64-bit version (with 64-bit internal state): http://www.math.sci.hiroshima-u.ac.jp/~m-mat/MT/emt64.html or an even faster SIMD variant with better properties and 128 bit internal state: http://www.math.sci.hiroshima-u.ac.jp/~m-mat/MT/SFMT/index.html Esp. the latter will help make brute force attacks practically impossible. Tim ? BTW: Looking at the sources of the _random module, I found that the seed function uses the hash of non-integers such as e.g. strings passed to it as seeds. Given the hash randomization for strings this will create non-deterministic results, so it's probably wise to only use 32-bit integers as seed values for portability, if you need to rely on seeding the global Python RNG. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Sep 11 2015)
Python Projects, Coaching and Consulting ... http://www.egenix.com/ mxODBC Plone/Zope Database Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/
2015-09-18: PyCon UK 2015 ... 7 days to go ::::: Try our mxODBC.Connect Python Database Interface for free ! :::::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/

On 11 September 2015 at 13:56, M.-A. Lemburg <mal@egenix.com> wrote:
The random module seeds its global Random instance using urandom (if available on the system), so while the generator itself is deterministic, the seed used to kick off the pseudo-random series is not. For many purposes, this is secure enough.
Secure enough for what purposes? Certainly not generating a password, or anything that is 'password equivalent' (e.g. session cookies). As you acknowledge in the latter portion of your email, one can predict the future output of a Mersenne Twister by observing lots of previous values. If I get to see the output of your RNG, I can dramatically constrain the search space of other things it generated. It is not hard to see how you can mount a pretty trivial attack against web software using this,
It's also easy to make the output of the random instance more secure by passing it through a crypto hash function.
Or...just use a CSPRNG and save yourself the computation overhead of the hash? Besides, anyone who knows enough to hash their random numbers surely knows enough to use a CSPRNG, so who does this help?
Perhaps it's time to switch to a better version of MT, e.g. a 64-bit version (with 64-bit internal state):
http://www.math.sci.hiroshima-u.ac.jp/~m-mat/MT/emt64.html
or an even faster SIMD variant with better properties and 128 bit internal state:
http://www.math.sci.hiroshima-u.ac.jp/~m-mat/MT/SFMT/index.html
Esp. the latter will help make brute force attacks practically impossible.
Or, we can move to a CSPRNG and stop trying to move the goalposts on MT? Or, do both? Using a better Mersenne Twister does not mean we shouldn't switch the default.

On 11.09.2015 16:34, Cory Benfield wrote:
On 11 September 2015 at 13:56, M.-A. Lemburg <mal@egenix.com> wrote:
The random module seeds its global Random instance using urandom (if available on the system), so while the generator itself is deterministic, the seed used to kick off the pseudo-random series is not. For many purposes, this is secure enough.
Secure enough for what purposes? Certainly not generating a password, or anything that is 'password equivalent' (e.g. session cookies).
As you acknowledge in the latter portion of your email, one can predict the future output of a Mersenne Twister by observing lots of previous values. If I get to see the output of your RNG, I can dramatically constrain the search space of other things it generated. It is not hard to see how you can mount a pretty trivial attack against web software using this,
In theory, yes, in practice it's not all that easy. I suggest to give untwister a try... it started with telling me it needs about 1.5 hours, then flipped to more than a year, now it's back to 6 hours. I'll leave it running for while to see whether it finishes today :-)
It's also easy to make the output of the random instance more secure by passing it through a crypto hash function.
Or...just use a CSPRNG and save yourself the computation overhead of the hash? Besides, anyone who knows enough to hash their random numbers surely knows enough to use a CSPRNG, so who does this help?
There's a difference between taking a pseudo random number generator and applying a hash to it vs. using a CPRNG: A CPRNG will add entropy to its state at regular intervals, so there's no such thing as a seeded sequence. A RNG + hash still has the nice property of allowing to reproduce the sequence given the seed, but makes it much harder to determine the seed (brute force can be made arbitrarily hard via the hash function). The entropy in the output of the second variant is constant (only defined by the initial seed and the hash parameters), while it constantly increases in the CPRNG. Some more background on this: https://en.wikipedia.org/wiki/Entropy_%28information_theory%29
Perhaps it's time to switch to a better version of MT, e.g. a 64-bit version (with 64-bit internal state):
http://www.math.sci.hiroshima-u.ac.jp/~m-mat/MT/emt64.html
or an even faster SIMD variant with better properties and 128 bit internal state:
http://www.math.sci.hiroshima-u.ac.jp/~m-mat/MT/SFMT/index.html
Esp. the latter will help make brute force attacks practically impossible.
Or, we can move to a CSPRNG and stop trying to move the goalposts on MT? Or, do both? Using a better Mersenne Twister does not mean we shouldn't switch the default.
I think it's worthwhile exposing the CPRNG from OpenSSL via the ssl module (see one of my earlier posts in this thread). People who need something as secure as their SSL implementation, can then get secure random numbers, while kids implementing coin flipping games can continue to use the well established API of the random module. Switching to a CPRNG in random would break the API, since some of the functions in the API would no longer be available (e.g. random.seed(), random.getstate(), random.setstate()). PS: Apart from the API issue, the default RNG in random would also have to be equidistributed and uniform, otherwise, the derivatives available in the module would no longer satisfy their expected distribution qualities. This is not needed when all you're interested in is to get some non-predictable random number for use in a session key :-) -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Sep 11 2015)
Python Projects, Coaching and Consulting ... http://www.egenix.com/ mxODBC Plone/Zope Database Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/
2015-09-18: PyCon UK 2015 ... 7 days to go ::::: Try our mxODBC.Connect Python Database Interface for free ! :::::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/

[M.-A. Lemburg <mal@egenix.com>]
... Now, leaving aside this bright future, what's reasonable today ?
If you look at tools like untwister:
https://github.com/bishopfox/untwister
you can get a feeling for how long it takes to deduce the seed from an output sequence. Bare in mind, that in order to be reasonably sure that the seed is correct, the available output sequence has to be long enough.
That's a known plain text attack, so you need access to lots of session keys to begin with.
The tools is still running on an example set of 1000 32-bit numbers and it says it'll be done in 1.5 hours, i.e. before the sun goes down in my timezone. I'll leave it running to see whether it can find my secret key.
I'm only going to talk about current Python 3, because _any_ backward incompatible change is off limits for a bugfix release. So: 1. untwister appears _mostly_ to be probing for poor seeding schemes. Python 3's default "by magic" seeding is unimpeachable ;-) It's computationally infeasible to attack it. 2. If they knew they were targeting MT, and had 624 consecutive 32-bit outputs, they could compute MT's full internal state essentially instantly. #2 is hard to get, though. These "pick a passward" examples are only using a relative handful of bits from each 32-bit MT output. Attacks with such spotty info are "exponentially harder".
Untwister is only slightly smarter than bruteforce. Given that MT has a seed size of 32 bits, it's not surprising that a tool can find the seed within a day.
No no no. MT's state is 19937 bits, and current .seed() implementations use every bit you pass to .seed(). By default, current Python seeds the state with 2500 bytes (20000 bits) from the system .urandom() (if available). That's why it's computationally infeasible for "poor seeding" searches to attack the default seeding: they have a space of 2**19937-1 (the all-0 state can't occur) to search through, each of which is equally likely (assuming the system .urandom() is doing _its_ job). Of course the user can screw that up by using their _own_ seed. But, by default, current Pythons already do the best possible seeding job.
Perhaps it's time to switch to a better version of MT, e.g. a 64-bit version (with 64-bit internal state):
http://www.math.sci.hiroshima-u.ac.jp/~m-mat/MT/emt64.html
or an even faster SIMD variant with better properties and 128 bit internal state:
http://www.math.sci.hiroshima-u.ac.jp/~m-mat/MT/SFMT/index.html
Esp. the latter will help make brute force attacks practically impossible.
Tim ?
We already have a 19937-bit internal state, and current seeding schemes don';t hide that. I would like to move to a different generator entirely someday, but not before some specific better-than-MT alternative gains significant traction outside the Python world ("better a follower than a leader" in this area).
BTW: Looking at the sources of the _random module, I found that the seed function uses the hash of non-integers such as e.g. strings passed to it as seeds. Given the hash randomization for strings this will create non-deterministic results, so it's probably wise to only use 32-bit integers as seed values for portability, if you need to rely on seeding the global Python RNG.
None of that applies to Python 3. `seed()` string inputs go through this path now: if isinstance(a, (str, bytes, bytearray)): if isinstance(a, str): a = a.encode() a += _sha512(a).digest() a = int.from_bytes(a, 'big') super().seed(a) IOW, a crypto hash is _appended_ to the string, but no info from the original string is lost (but, if you ask me, this particular step is useless - it adds no "new entropy"). The whole mess is converted to a giant integer, again with no loss of input information. And every bit of the giant integer affects what `super().seed(a) does`.

On 11.09.2015 18:36, Tim Peters wrote:
[M.-A. Lemburg <mal@egenix.com>]
... Now, leaving aside this bright future, what's reasonable today ?
If you look at tools like untwister:
https://github.com/bishopfox/untwister
you can get a feeling for how long it takes to deduce the seed from an output sequence. Bare in mind, that in order to be reasonably sure that the seed is correct, the available output sequence has to be long enough.
That's a known plain text attack, so you need access to lots of session keys to begin with.
The tools is still running on an example set of 1000 32-bit numbers and it says it'll be done in 1.5 hours, i.e. before the sun goes down in my timezone. I'll leave it running to see whether it can find my secret key.
I'm only going to talk about current Python 3, because _any_ backward incompatible change is off limits for a bugfix release.
So:
1. untwister appears _mostly_ to be probing for poor seeding schemes. Python 3's default "by magic" seeding is unimpeachable ;-) It's computationally infeasible to attack it.
2. If they knew they were targeting MT, and had 624 consecutive 32-bit outputs, they could compute MT's full internal state essentially instantly.
How would they do that ? MT's period is too large for things like rainbow tables.
#2 is hard to get, though. These "pick a passward" examples are only using a relative handful of bits from each 32-bit MT output. Attacks with such spotty info are "exponentially harder".
Untwister is only slightly smarter than bruteforce. Given that MT has a seed size of 32 bits, it's not surprising that a tool can find the seed within a day.
No no no. MT's state is 19937 bits, and current .seed() implementations use every bit you pass to .seed().
Ah, right. I was looking at init_genrand() in the C implementation which only takes a single 32-bit unsigned int as value. The init_by_array() function does take seeds which use all available bits. I guess untwister indeed only tries the 32-bit unsigned int seeding approach, as it keeps listing things like: Progress: 50.99% [2190032137 / 4294967295] ~128296.63/sec 4 hours 33 minutes 30 [-]
By default, current Python seeds the state with 2500 bytes (20000 bits) from the system .urandom() (if available). That's why it's computationally infeasible for "poor seeding" searches to attack the default seeding: they have a space of 2**19937-1 (the all-0 state can't occur) to search through, each of which is equally likely (assuming the system .urandom() is doing _its_ job).
Of course the user can screw that up by using their _own_ seed. But, by default, current Pythons already do the best possible seeding job.
Perhaps it's time to switch to a better version of MT, e.g. a 64-bit version (with 64-bit internal state):
http://www.math.sci.hiroshima-u.ac.jp/~m-mat/MT/emt64.html
or an even faster SIMD variant with better properties and 128 bit internal state:
http://www.math.sci.hiroshima-u.ac.jp/~m-mat/MT/SFMT/index.html
Esp. the latter will help make brute force attacks practically impossible.
Tim ?
We already have a 19937-bit internal state, and current seeding schemes don';t hide that.
Ouch. I confused internal state with the output size. Sorry. So we're more than fine already and it's only the cracking tools that are apparently broken :-)
I would like to move to a different generator entirely someday, but not before some specific better-than-MT alternative gains significant traction outside the Python world ("better a follower than a leader" in this area).
Another candidate is the new WELL family: http://www.iro.umontreal.ca/~panneton/WELLRNG.html This has some nicer properties w/r to booting out of zeroland (as they call it: too many 0 bits in the seed).
BTW: Looking at the sources of the _random module, I found that the seed function uses the hash of non-integers such as e.g. strings passed to it as seeds. Given the hash randomization for strings this will create non-deterministic results, so it's probably wise to only use 32-bit integers as seed values for portability, if you need to rely on seeding the global Python RNG.
None of that applies to Python 3.
Well, it still does for the .seed() C implementation in _random.c, but since that's overridden in Python 3's Random class, you can't access it anymore :-)
`seed()` string inputs go through this path now:
if isinstance(a, (str, bytes, bytearray)): if isinstance(a, str): a = a.encode() a += _sha512(a).digest() a = int.from_bytes(a, 'big') super().seed(a)
IOW, a crypto hash is _appended_ to the string, but no info from the original string is lost (but, if you ask me, this particular step is useless - it adds no "new entropy"). The whole mess is converted to a giant integer, again with no loss of input information. And every bit of the giant integer affects what `super().seed(a) does`.
As far as I'm concerned this maps to case closed. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Sep 11 2015)
Python Projects, Coaching and Consulting ... http://www.egenix.com/ mxODBC Plone/Zope Database Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/
2015-09-18: PyCon UK 2015 ... 7 days to go ::::: Try our mxODBC.Connect Python Database Interface for free ! :::::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/

[Tim]
... 2. If they knew they were targeting MT, and had 624 consecutive 32-bit outputs, they could compute MT's full internal state essentially instantly.
[Marc-Andre]
How would they do that ? MT's period is too large for things like rainbow tables.
It's not trivial to figure out how to do this, but once you do, it works ;-) No search, or tables, of any kind are required. It's just simple (albeit non-obvious!) bit-fiddling to invert MT's state-to-output transformations to get the state back. Here's a very nice writeup: https://jazzy.id.au/2010/09/22/cracking_random_number_generators_part_3.html

On 11.09.2015 20:52, Tim Peters wrote:
[Tim]
... 2. If they knew they were targeting MT, and had 624 consecutive 32-bit outputs, they could compute MT's full internal state essentially instantly.
[Marc-Andre]
How would they do that ? MT's period is too large for things like rainbow tables.
It's not trivial to figure out how to do this, but once you do, it works ;-) No search, or tables, of any kind are required. It's just simple (albeit non-obvious!) bit-fiddling to invert MT's state-to-output transformations to get the state back. Here's a very nice writeup:
https://jazzy.id.au/2010/09/22/cracking_random_number_generators_part_3.html
Indeed very nice. Thanks for the pointer. I wonder why untwister doesn't use this. I gave it 1000 32-bit integers, so it should have enough information to recover the seed in a short while, but it's still trying to find the seed. Oh, and it now shows: 5 days 21 hours left. I stopped it there. Anyone up for a random.recover_seed() function ? ;-) -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Sep 11 2015)
Python Projects, Coaching and Consulting ... http://www.egenix.com/ mxODBC Plone/Zope Database Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/
2015-09-18: PyCon UK 2015 ... 7 days to go ::::: Try our mxODBC.Connect Python Database Interface for free ! :::::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/

On 11.09.2015 22:44, M.-A. Lemburg wrote:
On 11.09.2015 20:52, Tim Peters wrote:
[Tim]
... 2. If they knew they were targeting MT, and had 624 consecutive 32-bit outputs, they could compute MT's full internal state essentially instantly.
[Marc-Andre]
How would they do that ? MT's period is too large for things like rainbow tables.
It's not trivial to figure out how to do this, but once you do, it works ;-) No search, or tables, of any kind are required. It's just simple (albeit non-obvious!) bit-fiddling to invert MT's state-to-output transformations to get the state back. Here's a very nice writeup:
https://jazzy.id.au/2010/09/22/cracking_random_number_generators_part_3.html
Indeed very nice. Thanks for the pointer.
I wonder why untwister doesn't use this. I gave it 1000 32-bit integers, so it should have enough information to recover the seed in a short while, but it's still trying to find the seed. Oh, and it now shows: 5 days 21 hours left. I stopped it there.
Anyone up for a random.recover_seed() function ? ;-)
Turns out this will have to be named random.recover_state(). Getting at the seed is too difficult, esp. for strings in Python 3, and not really worth the effort anyway. While implementing this, I found that there's a bit more trickery involved due to the fact that the MT RNG in Python writes the 624 words internal state in batches - once every 624 times the .getrandbits() function is called. So you may need up to 624*2 - 1 output values to determine a correct array of internal state values. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Sep 12 2015)
Python Projects, Coaching and Consulting ... http://www.egenix.com/ mxODBC Plone/Zope Database Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/
2015-09-18: PyCon UK 2015 ... 6 days to go 2015-10-21: Python Meeting Duesseldorf ... 39 days to go ::::: Try our mxODBC.Connect Python Database Interface for free ! :::::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/

[Tim, on recovering MT state from outputs]
https://jazzy.id.au/2010/09/22/cracking_random_number_generators_part_3.html
[Marc-Andre]
Indeed very nice. Thanks for the pointer.
I wonder why untwister doesn't use this. I gave it 1000 32-bit integers, so it should have enough information to recover the seed in a short while, but it's still trying to find the seed. Oh, and it now shows: 5 days 21 hours left. I stopped it there.
As you went on to discover, while the writeup gives enough to convince you it's possible, there are always details ;-)
Turns out this will have to be named random.recover_state().
Getting at the seed is too difficult, esp. for strings in Python 3, and not really worth the effort anyway.
It's flatly impossible to ever know what the seed was, unless you _also_ know exactly how many times MT was invoked before the first output you captured. Think about that a bit, and I'm sure you'll see that's obvious. Even if you did know how many times, it would still be impossible without more assumptions, since seed arguments can contain any number of bits.
While implementing this, I found that there's a bit more trickery involved due to the fact that the MT RNG in Python writes the 624 words internal state in batches - once every 624 times the .getrandbits() function is called.
So you may need up to 624*2 - 1 output values to determine a correct array of internal state values.
Don't be too sure about that. From an information-theoretic view, "it's obvious" that 624 32-bit outputs is enough - indeed, that's 31 more bits than the internal state actually has. You don't need to reproduce Python's current internal MT state exactly, you only need to create _a_ MT state that will produce the same values forever after. Specifically, the index of the "current" vector element is an artifact of the implementation, and doesn't need to be reproduced. You're free to set that index to anything you like in _your_ MT state - the real goal is to get the same results.

[Marc-Andre] ...
While implementing this, I found that there's a bit more trickery involved due to the fact that the MT RNG in Python writes the 624 words internal state in batches - once every 624 times the .getrandbits() function is called.
So you may need up to 624*2 - 1 output values to determine a correct array of internal state values.
[Tim]
Don't be too sure about that. From an information-theoretic view, "it's obvious" that 624 32-bit outputs is enough - indeed, that's 31 more bits than the internal state actually has. You don't need to reproduce Python's current internal MT state exactly, you only need to create _a_ MT state that will produce the same values forever after. Specifically, the index of the "current" vector element is an artifact of the implementation, and doesn't need to be reproduced. You're free to set that index to anything you like in _your_ MT state - the real goal is to get the same results.
Concrete proof of concept. First code to reconstruct state from 624 consecutive 32-bit outputs: def invert(transform, output, n=100): guess = output for i in range(n): newguess = transform(guess) if newguess == output: return guess guess = newguess raise ValueError("%r not invertible in %s tries" % (output, n)) t1 = lambda y: y ^ (y >> 11) t2 = lambda y: y ^ ((y << 7) & 0x9d2c5680) t3 = lambda y: y ^ ((y << 15) & 0xefc60000) t4 = lambda y: y ^ (y >> 18) def invert_mt(y): y = invert(t4, y) y = invert(t3, y) y = invert(t2, y) y = invert(t1, y) return y def guess_state(vec): assert len(vec) == 624 return (3, tuple(map(invert_mt, vec)) + (624,), None) Now we can try it: import random for i in range(129): random.random() That loop was just to move MT into "the middle" of its internal vector. Now grab values: vec = [random.getrandbits(32) for i in range(624)] Note that the `guess_state()` function above _always_ sets the index to 624. When it becomes obvious _why_ it does so, all mysteries will vanish ;-) Now create a distinct generator and force its state to the deduced state: newrand = random.Random() newrand.setstate(guess_state(vec)) And some quick sanity checks: for i in range(1000000): assert random.random() == newrand.random() for i in range(1000000): assert random.getrandbits(32) == newrand.getrandbits(32) The internal states are _not_ byte-for-byte identical. But they don't need to be. The artificial `index` bookkeeping variable allows hundreds of distinct spellings of _semantically_ identical states.

My final thoughts on this entire topic is this: The suggestions made here, and in the other thread, are pointless api breaking changes that do no effect the stated target audience (people who actually need secure random numbers but are not getting them correctly - they will still find a way to do it wrong, changing the api wont fix that). The net effect is a longer support burden on 2.x - this proposes another porting headache for NO reason.

On 12.09.2015 05:23, Tim Peters wrote:
[Marc-Andre] ...
While implementing this, I found that there's a bit more trickery involved due to the fact that the MT RNG in Python writes the 624 words internal state in batches - once every 624 times the .getrandbits() function is called.
So you may need up to 624*2 - 1 output values to determine a correct array of internal state values.
[Tim]
Don't be too sure about that. From an information-theoretic view, "it's obvious" that 624 32-bit outputs is enough - indeed, that's 31 more bits than the internal state actually has. You don't need to reproduce Python's current internal MT state exactly, you only need to create _a_ MT state that will produce the same values forever after. Specifically, the index of the "current" vector element is an artifact of the implementation, and doesn't need to be reproduced. You're free to set that index to anything you like in _your_ MT state - the real goal is to get the same results.
Concrete proof of concept. First code to reconstruct state from 624 consecutive 32-bit outputs:
def invert(transform, output, n=100): guess = output for i in range(n): newguess = transform(guess) if newguess == output: return guess guess = newguess raise ValueError("%r not invertible in %s tries" % (output, n))
t1 = lambda y: y ^ (y >> 11) t2 = lambda y: y ^ ((y << 7) & 0x9d2c5680) t3 = lambda y: y ^ ((y << 15) & 0xefc60000) t4 = lambda y: y ^ (y >> 18)
def invert_mt(y): y = invert(t4, y) y = invert(t3, y) y = invert(t2, y) y = invert(t1, y) return y
def guess_state(vec): assert len(vec) == 624 return (3, tuple(map(invert_mt, vec)) + (624,), None)
Now we can try it:
import random for i in range(129): random.random()
That loop was just to move MT into "the middle" of its internal vector. Now grab values:
vec = [random.getrandbits(32) for i in range(624)]
Note that the `guess_state()` function above _always_ sets the index to 624. When it becomes obvious _why_ it does so, all mysteries will vanish ;-)
Now create a distinct generator and force its state to the deduced state:
newrand = random.Random() newrand.setstate(guess_state(vec))
And some quick sanity checks:
for i in range(1000000): assert random.random() == newrand.random() for i in range(1000000): assert random.getrandbits(32) == newrand.getrandbits(32)
The internal states are _not_ byte-for-byte identical. But they don't need to be. The artificial `index` bookkeeping variable allows hundreds of distinct spellings of _semantically_ identical states.
It's a rolling index, yes, but when creating the vector of output values, the complete internal state array will have undergone a recalc at one of the iterations. The guess_state(vec) function will thus return an internal state vector that is half state of the previous recalc run, half new recalc run, it is not obvious to me why you would still be able to get away with not synchronizing to the next recalc in order to have a complete state from the current recalc. Let's see... The values in the state array are each based on a) previous state[i] b) state[(i + 1) % 624] c) state[(i + 397) % 624] Since the calculation is forward looking, your trick will only work if you can make sure that i + 397 doesn't wrap around into the previous state before you trigger the recalc in newrand. Which is easy, of course, since you can control the current index of newrand and force it to do a recalc with the next call to .getrandbits() by setting it to 624. Clever indeed :-) Here's a better way to do the inversion without guess work: # 32-bits all set ALL_BITS_SET = 0xffffffffL def undo_bitshift_right_xor(value, shift, mask=ALL_BITS_SET): # Set shift high order bits; there's probably a better way to # do this, but this does the trick for now decoding_mask = (ALL_BITS_SET << (32 - shift)) & ALL_BITS_SET decoded_part = 0 result = 0 while decoding_mask > 0: decoded_part = (value ^ (decoded_part & mask)) & decoding_mask result |= decoded_part decoded_part >>= shift decoding_mask >>= shift return result def undo_bitshift_left_xor(value, shift, mask=ALL_BITS_SET): # Set shift low order bits decoding_mask = ALL_BITS_SET >> (32 - shift) decoded_part = 0 result = 0 while decoding_mask > 0: decoded_part = (value ^ (decoded_part & mask)) & decoding_mask result |= decoded_part decoded_part = (decoded_part << shift) & ALL_BITS_SET decoding_mask = (decoding_mask << shift) & ALL_BITS_SET return result def recover_single_state_value(value): value = undo_bitshift_right_xor(value, 18) value = undo_bitshift_left_xor(value, 15, 0xefc60000L) value = undo_bitshift_left_xor(value, 7, 0x9d2c5680L) value = undo_bitshift_right_xor(value, 11) return value def guess_state(data): if len(data) < 624: raise TypeError('not enough data to recover state') # Only work with the 624 last entries data = data[-624:] state = [recover_single_state_value(x) for x in data] return (3, tuple(state) + (624,), None) This is inspired by the work of James Roper, but uses a slightly faster approach for the undo functions. Not that it matters much. It was fun, that's what matters :-) Oh, in Python 3, you need to remove the 'L' after the constants. Too bad that it doesn't recognize those old annotations anymore. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Sep 12 2015)
Python Projects, Coaching and Consulting ... http://www.egenix.com/ mxODBC Plone/Zope Database Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/
2015-09-18: PyCon UK 2015 ... 6 days to go 2015-10-21: Python Meeting Duesseldorf ... 39 days to go ::::: Try our mxODBC.Connect Python Database Interface for free ! :::::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/

On 12.09.2015 13:31, M.-A. Lemburg wrote:
On 12.09.2015 05:23, Tim Peters wrote:
[Marc-Andre] ...
While implementing this, I found that there's a bit more trickery involved due to the fact that the MT RNG in Python writes the 624 words internal state in batches - once every 624 times the .getrandbits() function is called.
So you may need up to 624*2 - 1 output values to determine a correct array of internal state values.
[Tim]
Don't be too sure about that. From an information-theoretic view, "it's obvious" that 624 32-bit outputs is enough - indeed, that's 31 more bits than the internal state actually has. You don't need to reproduce Python's current internal MT state exactly, you only need to create _a_ MT state that will produce the same values forever after. Specifically, the index of the "current" vector element is an artifact of the implementation, and doesn't need to be reproduced. You're free to set that index to anything you like in _your_ MT state - the real goal is to get the same results.
Concrete proof of concept. First code to reconstruct state from 624 consecutive 32-bit outputs:
def invert(transform, output, n=100): guess = output for i in range(n): newguess = transform(guess) if newguess == output: return guess guess = newguess raise ValueError("%r not invertible in %s tries" % (output, n))
t1 = lambda y: y ^ (y >> 11) t2 = lambda y: y ^ ((y << 7) & 0x9d2c5680) t3 = lambda y: y ^ ((y << 15) & 0xefc60000) t4 = lambda y: y ^ (y >> 18)
def invert_mt(y): y = invert(t4, y) y = invert(t3, y) y = invert(t2, y) y = invert(t1, y) return y
def guess_state(vec): assert len(vec) == 624 return (3, tuple(map(invert_mt, vec)) + (624,), None)
Now we can try it:
import random for i in range(129): random.random()
That loop was just to move MT into "the middle" of its internal vector. Now grab values:
vec = [random.getrandbits(32) for i in range(624)]
Note that the `guess_state()` function above _always_ sets the index to 624. When it becomes obvious _why_ it does so, all mysteries will vanish ;-)
Now create a distinct generator and force its state to the deduced state:
newrand = random.Random() newrand.setstate(guess_state(vec))
And some quick sanity checks:
for i in range(1000000): assert random.random() == newrand.random() for i in range(1000000): assert random.getrandbits(32) == newrand.getrandbits(32)
The internal states are _not_ byte-for-byte identical. But they don't need to be. The artificial `index` bookkeeping variable allows hundreds of distinct spellings of _semantically_ identical states.
It's a rolling index, yes, but when creating the vector of output values, the complete internal state array will have undergone a recalc at one of the iterations.
The guess_state(vec) function will thus return an internal state vector that is half state of the previous recalc run, half new recalc run, it is not obvious to me why you would still be able to get away with not synchronizing to the next recalc in order to have a complete state from the current recalc.
Let's see...
The values in the state array are each based on
a) previous state[i] b) state[(i + 1) % 624] c) state[(i + 397) % 624]
Since the calculation is forward looking, your trick will only work if you can make sure that i + 397 doesn't wrap around into the previous state before you trigger the recalc in newrand.
Which is easy, of course, since you can control the current index of newrand and force it to do a recalc with the next call to .getrandbits() by setting it to 624.
Clever indeed :-)
Here's a better way to do the inversion without guess work:
# 32-bits all set ALL_BITS_SET = 0xffffffffL
def undo_bitshift_right_xor(value, shift, mask=ALL_BITS_SET):
# Set shift high order bits; there's probably a better way to # do this, but this does the trick for now decoding_mask = (ALL_BITS_SET << (32 - shift)) & ALL_BITS_SET decoded_part = 0 result = 0 while decoding_mask > 0: decoded_part = (value ^ (decoded_part & mask)) & decoding_mask result |= decoded_part decoded_part >>= shift decoding_mask >>= shift return result
def undo_bitshift_left_xor(value, shift, mask=ALL_BITS_SET):
# Set shift low order bits decoding_mask = ALL_BITS_SET >> (32 - shift) decoded_part = 0 result = 0 while decoding_mask > 0: decoded_part = (value ^ (decoded_part & mask)) & decoding_mask result |= decoded_part decoded_part = (decoded_part << shift) & ALL_BITS_SET decoding_mask = (decoding_mask << shift) & ALL_BITS_SET return result
def recover_single_state_value(value):
value = undo_bitshift_right_xor(value, 18) value = undo_bitshift_left_xor(value, 15, 0xefc60000L) value = undo_bitshift_left_xor(value, 7, 0x9d2c5680L) value = undo_bitshift_right_xor(value, 11) return value
def guess_state(data):
Hmm, the name doesn't fit anymore, better call it: def recover_state(data):
if len(data) < 624: raise TypeError('not enough data to recover state')
# Only work with the 624 last entries data = data[-624:] state = [recover_single_state_value(x) for x in data] return (3, tuple(state) + (624,), None)
This is inspired by the work of James Roper, but uses a slightly faster approach for the undo functions. Not that it matters much. It was fun, that's what matters :-)
Oh, in Python 3, you need to remove the 'L' after the constants. Too bad that it doesn't recognize those old annotations anymore.
-- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Sep 12 2015)
Python Projects, Coaching and Consulting ... http://www.egenix.com/ mxODBC Plone/Zope Database Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/
2015-09-18: PyCon UK 2015 ... 6 days to go 2015-10-21: Python Meeting Duesseldorf ... 39 days to go ::::: Try our mxODBC.Connect Python Database Interface for free ! :::::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/

[Marc-Andre, puzzling over Tim's MT state-recovering hack]
... Since the calculation is forward looking, your trick will only work if you can make sure that i + 397 doesn't wrap around into the previous state before you trigger the recalc in newrand.
Which is easy, of course, since you can control the current index of newrand and force it to do a recalc with the next call to .getrandbits() by setting it to 624.
Clever indeed :-)
I'll suggest a different way to look at it: suppose you wanted to reproduce the state at _the start_ of the 624 values captured instead. Well, we'd do exactly the same thing, except set the index to 0. Then it's utterly obvious that your MT instance would spit out exactly the same 624 outputs as the ones captured. That's all the internals do when the index starts at 0: march through the state vector one word at a time, spitting out the tempered version of whichever 32-bit word is current. And increment the index each time (the only _mutation_ of any part of the MT internals). At the end of that, the only change to the internals is that the index would be left at 624. Which is exactly what "my code" sets it to. It acts exactly the same as if we had just finished generating the 624 captured outputs. Since we (in our heads) just reproduced enough bits to cover the entire internal state, it must be the case that we'll continue to reproduce all future outputs too. The "wrap around" is a red herring ;-)
Here's a better way to do the inversion without guess work:
"Better" depends. Despite the variable named "guess" in the code, it's not guessing about anything ;-) It's a single function that doesn't care (and can't even be told) whether a left or right shift is being used, what the shift count is, whether a mask is in use, or even what the word size is. In those senses it's "better": it can be used without change for "anything like this", including, e.g., the 64-bit variant of MT. Just paste the C tempering lines into the lambdas. Nothing about the inversion function needs to change. But why it works efficiently is far from obvious. It _can_ take as many (but not more) iterations as there are bits in a word, but that's almost never needed. IIRC, it can never require more than 8 iterations to invert any of the tempering functions in the 32-bit MT, and, e.g., always inverts the very weak "lambda y: y ^ (y >> 18)" 32-bit MT transform on the first try. Nevertheless, you can - as you did - be more efficient by writing distinct inversion functions for "left shift" and "right shift" cases, and wiring in the word size. But the expense of deducing the state is just plain trivial here either way. We're not consuming days or hours here, we're not even consuming an appreciable fraction of a second at Python speed :-)
participants (4)
-
Alexander Walters
-
Cory Benfield
-
M.-A. Lemburg
-
Tim Peters