Should our default random number generator be secure?

I've received several long emails from Theo de Raadt (OpenBSD founder) about Python's default random number generator. This is the random module, and it defaults to a Mersenne Twister (MT) seeded by 2500 bytes of entropy taken from os.urandom(). Theo's worry is that while the starting seed is fine, MT is not good when random numbers are used for crypto and other security purposes. I've countered that it's not meant for that (you should use random.SystemRandom() or os.urandom() for that) but he counters that people don't necessarily know that and are using the default random.random() setup for security purposes without realizing how wrong that is. There is already a warning in the docs for the random module that it's not suitable for security, but -- as the meme goes -- nobody reads the docs. Theo then went into technicalities that went straight over my head, concluding with a strongly worded recommendation of the OpenBSD version of arc4random() (which IIUC is based on something called "chacha", not on "RC4" despite that being in the name). He says it is very fast (but I don't know what that means). I've invited Theo to join this list but he's too busy. The two core Python experts on the random module have given me opinions suggesting that there's not much wrong with MT, so here I am. Who is right? What should we do? Is there anything we need to do? -- --Guido van Rossum (python.org/~guido)

On September 9, 2015 at 12:36:16 PM, Guido van Rossum (guido@python.org) wrote:
Everyone is right :) MT is a fine algorithm for random numbers when you don't need them to be cryptographically safe, it is a disastrous algorithm if you do need them to be safe. As long as you only use MT (and the default ``random``) implementation for things where the fact the numbers you get aren't going to be quite random (e.g. they are actually predictable) and you use os.urandom/random.SystemRandom for everything where you need actual random then everything is fine. The problem boils down to, are people going to accidently use the default random module when they really should use os.urandom or random.SystemRandom. It is my opinion (and I believe Theo's) that they are going to use the MT backed random functions in random.py when they shouldn't be. However I don't have a great solution to what we should do about it. One option is to add a new, random.FastInsecureRandom class, and switch the "bare" random functions in that module over to using random.SystemRandom by default. Then if people want to opt into a faster random that isn't crpytographically secure by default they can use that class. This would essentially be inverting the relationship today, where it defaults to insecure and you have to opt in to secure. ----------------- Donald Stufft PGP: 0x6E3CBCE93372DCFA // 7C6B 7C5D 5E2B 6356 A926 F04F 6E3C BCE9 3372 DCFA

On 09.09.2015 18:53, Donald Stufft wrote:
Not being an expert on this but I agree with this assessment. You can determine easily whether your program runs fast enough. If not, you can fix it. You cannot determine easily whether something you made is cryptographically secure. The default should be as secure as possible. Best, Sven

{Guido]
There is nothing _right_ about MT in a crypto context - it's entirely unsuitable for any such purpose, and always was. Just to be clear about that ;-) But it's an excellent generator for almost all other purposes. So the real question is: whose use cases do you want to cater to by default? If you answer "crytpo", then realize the Python generator will have to change every time the crypto community changes its mind about what's _currently_ "good enough". There's a long history of that already. Indeed, there are already numerous "chacha" variants. For a brief overview, scroll down to the ChaCha20 section of this exceptionally readable page listing pros and cons of various generators: http://www.pcg-random.org/other-rngs.html There are no answers to vital pragmatic questions (like "is it possible to supply a seed to get reproducible results?") without specifying whose implementation of which chacha variant you're asking about. I've always thought Python should be a follower rather than a leader in this specific area. For example, I didn't push for the Twister before it was well on its way to becoming a de facto standard. Anyway, it's all moot until someone supplies a patch - and that sure ain't gonna be me ;-) On Wed, Sep 9, 2015 at 11:35 AM, Guido van Rossum <guido@python.org> wrote:

On September 9, 2015 at 1:11:22 PM, Tim Peters (tim.peters@gmail.com) wrote:
This is not really true in that sense that Python would need to do anything if the blessed generator changed. We'd use /dev/urandom, one of the syscalls that do the same thing, or the CryptGen API on Windows. Python should not have it's own userland CSPRNG. Then it's up to the platform to follow what generator they are going to provide. ----------------- Donald Stufft PGP: 0x6E3CBCE93372DCFA // 7C6B 7C5D 5E2B 6356 A926 F04F 6E3C BCE9 3372 DCFA

[Tim]
[Donald Stufft <donald@stufft.io>]
This is not really true in that sense that Python would need to do anything if the blessed generator changed.
I read Guido's message as specifically asking about Theo's "strongly worded recommendation of [Python switching to] the OpenBSD version of arc4random()" as its default generator. In which, case, yes, when that specific implementation falls out of favor, Python would need to change.
I read Guido's message as asking whether Python should indeed do just that.

On September 9, 2015 at 1:28:46 PM, Tim Peters (tim.peters@gmail.com) wrote:
arc4random changes as the underlying implementation changes too, the name is a historical accident really. arc4random no longer uses arc4 it uses chacha, and when/if chacha needs to be replaced, arc4random will still be the name. ----------------- Donald Stufft PGP: 0x6E3CBCE93372DCFA // 7C6B 7C5D 5E2B 6356 A926 F04F 6E3C BCE9 3372 DCFA

On Wed, Sep 9, 2015, at 13:43, Donald Stufft wrote:
The issue is, what should Python do, if the decision is made to not provide its own RNG [presumably would be a forked copy of OpenBSD's current arc4random] on systems that do not provide a function named arc4random? Use /dev/urandom (or CryptGenRandom) every time [more expensive, performs I/O]? rand48? random? rand? I don't see the issue with Python providing its own implementation. If the state of the art changes, we can have another discussion then.

[random832@fastmail.us]
I don't see the issue with Python providing its own implementation. If the state of the art changes,
It will. Over & over again. That's why it's called "art" ;-)
we can have another discussion then.
Also over & over again. If you volunteer to own responsibility for updating all versions of Python each time it changes (in a crypto context, an advance in the state of the art implies the prior state becomes "a bug"), and post a performance bond sufficient to pay someone else to do it if you vanish, then a major pragmatic objection would go away ;-)

Tim Peters <tim.peters@...> writes:
The OpenBSD devs could also publish arc4random as a library that works everywhere (like OpenSSH). That would be a nicer solution for everyone (except for the devs perhaps :). Stefan Krah

Stefan Krah <skrah@...> writes:
And naturally they're already doing that. I missed this in Theo's first mail: https://github.com/libressl-portable/openbsd/tree/master/src/lib/libcrypto/c... https://github.com/libressl-portable/portable/tree/master/crypto/compat So I guess the whole thing also depends on how popular these libraries will be. Stefan Krah

On Wed, Sep 9, 2015, at 14:31, Tim Peters wrote:
I don't see how "Changing Python's RNG implementation today to arc4random as it exists now" necessarily implies "Making a commitment to guarantee the cryptographic suitability of Python's RNG for all time". Those are two separate things.

[<random832@fastmail.us>]
Disagree. The _only_ point to switching today is "to guarantee the cryptographic suitability of Python's RNG" today. It misses the intent of the switch entirely to give a "but tomorrow? eh - that'[s a different issue" dodge. No, no rules of formal logic would be violated by separating the two - it would be a violation of the only _sense_ in making a switch at all. If you don't believe me, try asking Theo ;-)

On Wed, Sep 09, 2015 at 02:55:01PM -0400, random832@fastmail.us wrote:
Not really. Look at the subject line. It doesn't say "should we change from MT to arc4random?", it asks if the default random number generator should be secure. The only reason we are considering the change from MT to arc4random is to make the PRNG cryptographically secure. "Secure" is a moving target, what is secure today will not be secure tomorrow. Yes, in principle, we could make the change once, then never again. But why bother? We don't gain anything from changing to arc4random if there is no promise to be secure into the future. Question, aimed at anyone, not necessarily random832 -- one desirable property of PRNGs is that you can repeat a sequence of values if you re-seed with a known value. Does arc4random keep that property? I think that it is important that the default RNG be deterministic when given a known seed. (I'm happy for the default seed to be unpredictable.) -- Steve

[Steven D'Aprano <steve@pearwood.info>]
"arc4random" is ill-defined. From what I gathered, it's the case that "pure chacha" variants can be seeded to get a reproducible sequence "in theory", but that not all implementations support that. Specifically, the OpenBSD implementation being "sold" here does not and cannot: http://www.openbsd.org/cgi-bin/man.cgi/OpenBSD-current/man3/arc4random.3 "Does not" because there is no API to either request or set a seed. "Cannot" because: The subsystem is re-seeded from the kernel random number subsystem using getentropy(2) on a regular basis Other variants skip that last part.

On Sep 9, 2015 12:21 PM, "Tim Peters" <tim.peters@gmail.com> wrote:
cannot:
http://www.openbsd.org/cgi-bin/man.cgi/OpenBSD-current/man3/arc4random.3
Another reason why it is important *not* to provide a seeding api for a crypto rng is that this means you can later swap out the underlying algorithms easily as the state of the art improves. By contrast, if you have a deterministic seeded mode, then swapping out the algorithm becomes a compatibility break. (You can provide a "mix this extra entropy into the pool" api, which looks rather similar to seeding, but has fundamentally different semantics.) The only real problem that I see with switching the random module to use a crypto rng is exactly this backwards compatibility issue. For scientific users, reproducibility of output streams is really important. (Ironically, this is a variety of "important" that crypto people are very familiar with: universally acknowledged to be the right thing by everyone who's thought about it, a minority do religiously and rely on, and most people ignore out of ignorance. Education is ongoing...) OTOH python has never made strong guarantees of output stream reproducibility -- 3.2 broke all seeds by default (you have to add 'version=1' to your seed call to get the same results on post-3.2 pythons -- which of course gives an error on older versions). And 99% of the methods are documented to be unstable across versions -- the only method that's guaranteed to produce reproducible results across versions is random.random(). In practice the other methods usually don't change so people get away with it, but. See: https://docs.python.org/3/library/random.html#notes-on-reproducibility So in practice the stdlib random module is not super suitable for scientific work anyway. Not that this stops anyone from using it for this purpose... see above. (And to be fair even this limited determinism is still enough to be somewhat useful -- not all projects require reproducibility across years of different python versions.) Plus even a lot of people who know about the importance of seeding don't realize that the stdlib's support has these gotchas. (Numpy unsurprisingly puts higher priority on these issues -- our random module guarantees exact reproducibility of seeded outputs modulo rounding, across versions and systems, except for bugfixes necessary for correctness. This means that we carry around a bunch of old inefficient implementations of the distribution methods, but so be it...) So, all that considered: I can actually see an argument for removing the seeding methods from the the stdib entirely, and directing those who need reproducibility to a third party library like numpy (/ pygsl / ...). This would be pretty annoying for some cases where someone really does have simple needs and wants just a little determinism without a binary extension, but on net it might even be an improvement, given how easy it is to misread the current api as guaranteeing more than it actually promises. OTOH this would actually break the current promise, weak as it is. Keeping that promise in mind, an alternative would be to keep both generators around, use the cryptographically secure one by default, and switch to MT when someone calls seed(1234, generator="INSECURE LEGACY MT") But this would justifiably get us crucified by the security community, because the above call would flip the insecure switch for your entire program, including possibly other modules that were depending on random to provide secure bits. So if we were going to do this then I think it would have to be by switching the global RNG over unconditionally, and to fulfill the promise, provide the MT option as a separate class that the user would have to instantiate explicitly if they wanted it for backcompat. Document that you should replace import random random.seed(12345) if random.whatever(): ... with from random import MTRandom random = MTRandom(12345) if random.whatever(): ... As part of this transition I would also suggest making the seed method on non-seedable RNGs raise an error when given an explicit seed, instead of silently doing nothing like the current SystemRandom. -n

On Wed, Sep 9, 2015, at 17:02, Nathaniel Smith wrote:
Ideally, neither the crypto bits nor the science bits of a big program should be using the module-level functions. A small program either hasn't got both kinds of bits, or won't be using them at the same time. And if you've got non-science bits doing stuff with your RNG then your results probably aren't going to be reproducible anyway. Which suggests a solution: How about exposing a way to switch out the Random instance used by the module-level functions? The instance itself exists now as random._inst, but the module just spews its bound methods all over its namespace. (Long-term, it might make sense to deprecate the module-level functions)

On Wed, Sep 9, 2015, at 17:02, Nathaniel Smith wrote:
I just realized, OpenBSD has precisely this functionality, for the rand/random/rand48 functions, in the "_deterministic" versions of their respective seed functions. So that's probably not a terrible path to go down: Make a Random class that uses a CSPRNG and/or os.urandom until/unless it is explicitly seeded. Use that class for the global instance. We could probably skip the "make a separate function name to show you really mean it" because unlike C, Python has never encouraged explicitly seeding with the {time, pid, four bytes from /dev/random} when one doesn't actually want determinism. (The default seed in C for rand/random is *1*; for rand48 it is an implementation-defined, but specified to be constant, value). For completeness, have getstate return a tuple of a boolean (for which mode it is in) and whatever state Random returns. setstate can accept either this tuple, or for compatibility whatever Random uses.

On Fri, Sep 11, 2015 at 11:12 PM, <random832@fastmail.us> wrote:
Calling getstate() means yoy want to call setstate() at some point in the future, and have deterministic results. Getting the CSRNG state is dangerous (since it would allow replaying), and it's not even useful (since system entropy gets mixed in occasionally). Instead, in this scheme, getstate() should activate the deterministic RNG (seeding it if it's the first use), and return its state. setstate() would then also switch to the Twister, and seed it.

On Fri, Sep 11, 2015, at 17:48, Petr Viktorin wrote:
My thinking was that "CSRNG is enabled" should be regarded as a single state of the "magic switching RNG". The alternative would be that calling getstate on a magic switching RNG that is not already in deterministic mode is an error.

On 9 September 2015 at 20:33, Stefan Krah <skrah@bytereef.org> wrote:
I use a RNG quite often. Typically for simulations (games, dierolls, card draws, that sort of thing). Sometimes for many millions of results (Monte Carlo simulations, for example). I would always just use the default RNG supplied by the stdlib - I view my use case as "normal use" and wouldn't go looking for specialist answers. I'd occasionally look for reproducibility, although it's not often a key requirement for me (I would expect it as an option from the stdlib RNG, though). Anyone doing crypto who doesn't fully appreciate that it's a specialist subject and that they should be looking for a dedicated RNG suitable for crypto, is probably going to make a lot of *other* mistakes as well. Leading them away from this one probably isn't going to be enough to make their code something I'd want to use... So as a user, I'm against making a change like this. Let the default RNG in the stdlib be something suitable for simulations, "pick a random question", and similar situations, and provide a crypto-capable RNG for those who need it, but not as the default. (I am, of course, assuming that it's not possible to have a single RNG that is the best option for both uses - nobody on this thread seems to have suggested that I'm wrong in this assumption). Paul

On Wed, Sep 9, 2015 at 9:33 PM, Stefan Krah <skrah@bytereef.org> wrote:
The OpenBSD implementation does not allow any kind of reproducible results. Reading http://www.pcg-random.org/other-rngs.html, I see that arc4random is not built for is statistical quality and k-dimensional equidistribution, which are also properties you might not need for crypto, but do want for simulations. So there are two quite different use cases (plus a lot of grey area where any solution is okay). The current situation may be surprising to people who didn't read the docs. Switching away from MT might be a disservice to users that did read and understand them.

Petr Viktorin <encukou@...> writes:
I can't find much at all when searching for "chacha20 equidistribution". Contrast that with "mersenne twister equidistribution" and it seems that chacha20 hasn't been studied very much in that respect (except for the pcg-random site). So I also think this should preclude us from replacing the current random() functions. Adding an arc4random module with the caveat that its quality will be as good as the current OpenBSD libcrypto/libressl(?) would be okay. Stefan Krah

[Stefan Krah <skrah@bytereef.org>]
Well, most arguments about random functions rely on fantasies ;-) For example, yes, the Twister is provably equidistributed to 32 bits across 623 dimensions, but ... does it make a difference to anything? That's across the Twister's _entire period_, which couldn't actually be generated across the age of the universe. What may really matter to an application is whether it will see rough equidistribution across the infinitesimally small slice (of the Twister's full period) it actually generates. And you'll find very little about _that_ (last time I searched, I found nothing). For assurances about that, people rely on test suites developed to test generators. The Twister's provably perfect equidistribution across its whole period also has its scary sides. For example, run random.random() often enough, and it's _guaranteed_ you'll eventually reach a state where the output is exactly 0.0 hundreds of times in a row. That will happen as often as it "should happen" by chance, but that's scant comfort if you happen to hit such a state. Indeed, the Twister was patched relatively early in its life to try to prevent it from _starting_ in such miserable states. Such states are nevertheless still reachable from every starting state. But few people know any of that, so they take "equidistribution" as meaning a necessary thing rather than as an absolute guarantee of eventual disaster ;-) What may really matter for most simulations is that the Twister never reaches a state where, in low dimensions, k-tuples fall on "just a few" regularly-spaced hyperplanes forever after. That's a systematic problem with old flavors of linear congruential generators. But that problem is _so_ old that no new generator proposed over the last few decades suffers it either.
Adding an arc4random module with the caveat that its quality will be as good as the current OpenBSD libcrypto/libressl(?) would be okay.
Anyone is welcome to supply such a module today, and distribute it via the usual channels. Python already supplies the platform spelling of `urandom`, and a very capable random.SystemRandom class based on `urandom`, for those needing crypto-strength randomness (relying on what their OS believed that means, and supplied via their `urandom`). Good enough for me. But, no, I'm not a security wonk at heart.

On Wed, Sep 9, 2015 at 3:19 PM, Tim Peters <tim.peters@gmail.com> wrote:
Yeah, equidistribution is not a guarantee of anything on its own. For example, an integer counter modulo 2**(623*32) is equidistributed to 32 bits across 623 dimensions, just like the Mersenne Twister. Mostly people talk about equidistribution because for a deterministic RNG, (a) being non-equidistributed would be bad, (b) equidistribution is something you can reasonably hope to prove for simple non-cryptographic generators, and mathematicians like writing proofs. OTOH equidistribution is not even well-defined for the OpenBSD "arc4random" generator, because it is genuinely non-deterministic -- it regularly mixes new entropy into its state as it goes -- and equidistribution by definition requires determinism. So it "fails" this test of "randomness" because it is too random... In practice, the chances that your Monte Carlo simulation is going to give bad results because of systematic biases in arc4random are much, much lower than the chances that it will give bad results because of subtle hardware failures that corrupt your simulation. And hey, if arc4random *does* mess up your simulation, then congratulations, your simulation is publishable as a cryptographic attack and will probably get written up in the NYTimes :-). The real reasons to prefer non-cryptographic RNGs are the auxiliary features like determinism, speed, jumpahead, multi-thread friendliness, etc. But the stdlib random module doesn't really provide any of these (except determinism in strictly limited cases), so I'm not sure it matters much.
This criticism seems a bit unfair though -- even a true stream of random bits (e.g. from a true unbiased quantum source) has this property, and trying to avoid this happening would introduce bias that really could cause problems in practice. A good probabilistic program is one that has a high probability of returning some useful result, but they always have some low probability of returning something weird. So this is just saying that most people don't understand probability. Which is true, but there isn't much that the random module can do about it :-) -n -- Nathaniel J. Smith -- http://vorpus.org

On Wed, Sep 09, 2015 at 04:15:31PM -0700, Nathaniel Smith wrote:
The default MT is certainly deterministic, and although only the output of random() itself is guaranteed to be reproducible, the other methods are *usually* stable in practice. There's a jumpahead method too, and for use with multiple threads, you can (and should) create your own instances that don't share state. I call that "multi-thread friendliness" :-) I think Paul Moore's position is a good one. Anyone writing crypto code without reading the docs and understanding what they are doing are surely making more mistakes than just using the wrong PRNG. There may be a good argument for adding arc4random support to the stdlib, but making it the default (with the disadvantages discussed, breaking backwards compatibility, surprising non-crypto users, etc.) won't fix the broken crypto code. It will just give people a false sense of security and encourage them to ignore the docs and write broken crypto code. -- Steve

[Nathaniel Smith]
[Steven D'Aprano]
Not in Python. There was for the ancient Wichmann-Hill generator, but not for MT. A detailed sketch of ways to implement efficient jumpahead for MT are given here: A Fast Jump Ahead Algorithm for Linear Recurrences in a Polynomial Space http://www.math.sci.hiroshima-u.ac.jp/~m-mat/MT/ARTICLES/jump-seta-lfsr.pdf But because MT isn't a _simple_ recurrence, they're all depressingly complex :-( For Wichmann-Hill it was just a few integer modular exponentiations.
and for use with multiple threads, you can (and should) create your own instances that don't share state. I call that "multi-thread friendliness" :-)
That's what people do, but MT's creators don't recommend it anymore (IIRC, their FAQ did recommend it some years ago). Then they switched to recommending using jumpahead with a large value (to guarantee different instances' states would never overlap). Now (well, last I saw) they recommend a parameterized scheme creating a distinct variant of MT per thread (not just different state, but a different (albeit related) algorithm): http://www.math.sci.hiroshima-u.ac.jp/~m-mat/MT/DC/dgene.pdf So I'd say it's clear as mud ;-)
...

[Sorry to Tim and Steven if they get multiple copies of this... Gmail recently broke their Android app's handling of from addresses, so resending, sigh] On Sep 9, 2015 7:24 PM, "Tim Peters" <tim.peters@gmail.com> wrote: [...]
Yeah, the independent-seed-for-each-thread approach is an option with any RNG, but just like people feel better if they have a 100% certified guarantee that the RNG output in a single thread will pass through every combination of possible values (if you wait some cosmological time), they also feel better if there is some 100% certified guarantee that the RNG values in two threads will also be uncorrelated with each other. With something like MT, if two threads did end up with nearby seeds, then that would be bad: each thread individually would see values that looked like high quality randomness, but if you compared across the two threads, they would be identical modulo some lag. So all the nice theoretical analysis of the single threaded stream falls apart. However, for two independently seeded threads to end up anywhere near each other in the MT state space requires that you have picked two numbers between 0 and 2**19937 and gotten values that were "close". Assuming your seeding procedure is functional at all, then this is not a thing that will ever actually happen in this universe. So AFAICT the rise of explicitly multi-threaded RNG designs is one of those fake problems that exists only so people can write papers about solving it. (Maybe this is uncharitable.) So there exist RNG designs that handle multi-threading explicitly, and it shows up on feature comparison checklists. I don't think it should really affect Python's decisions at all though. -n

[Steven D'Aprano]
[Tim]
Not in Python.
[Steve]
It is there, up to Python 2.7. I hadn't noticed it was gone in Python 3.
Yes, there's something _called_ `,jumpahead()`, for backward compatibility with the old WIchmann-Hill generator. But what it does for MT is "eh - no idea what to do, so let's just make stuff up": def jumpahead(self, n): """Change the internal state to one that is likely far away from the current state. This method will not be in Py3.x, so it is better to simply reseed. """ # The super.jumpahead() method uses shuffling to change state, # so it needs a large and "interesting" n to work with. Here, # we use hashing to create a large n for the shuffle. s = repr(n) + repr(self.getstate()) n = int(_hashlib.new('sha512', s).hexdigest(), 16) super(Random, self).jumpahead(n) I doubt there's anything that can be proved about the result of doing that - except that it's almost certain it won't bear any relationship to what calling the generator `n` times instead would have done ;-)

On 10/09/15 04:23, Tim Peters wrote:
The DCMT use the same algorithm (Mersenne Twister) but with different polynomials. The choice of polynomial is more or less arbitrary. You can search for a set of N polynomials that are (almost) prime to each other, and thus end up with totally independent sequences. Searching for such a set can take some time, so you need to do that in advance and save the result. But once you have a set, each one of them is just as valid as the vanilla MT. PCG also provides independent streams. Sturla

[Nathaniel Smith <njs@pobox.com>]
The analogy is almost exact. If you view MT's state as a single 19937-bit integer (=623*32 + 1), then MT's state "simply" cycles through a specific permutation of range(1, 2**19937) (with a different orbit consisting solely of 0). That was the "hard part" to prove. Everything about equidistribution was more of an observation following from that. Doesn't say anything about distribution "in the small" (across small slices), alas.
Heh. In the NYT or a security wonk's blog, maybe. But why would a reputable journal believe me? By design, the results of using the OpenBSD arc4random can't be reproduced ;-)
Python's implementation of MT has never changed anything about the sequence produced from a given seed state, and indeed gives the same sequence from the same seed state as every other correct implementation of the same flavor of MT. That is "strictly limited", to perfection ;-) At a higher level, depends on the app. People are quite creative at defeating efforts to be helpful ;-)
This criticism seems a bit unfair though
Those are facts, not criticisms. I like the Twister very much. But those who have no fear of it are dreaming - while those who have significant fear of it are also dreaming. It's my job to ensure nobody is either frightened or complacent ;-)
-- even a true stream of random bits (e.g. from a true unbiased quantum source) has this property,
But good generators with astronomically smaller periods do not. In a sense, MT made it possible to get results "far more random" than any widely used deterministic generator before it. The patch I mentioned above was to fix real problems in real life, where people using simple seeding schemes systematically ended up with results so transparently ludicrous nobody could possibly accept them for any purpose. "The fix" consisted of using scrambling functions to spray the bits in the user-*supplied* seed all over the place, in a pseudo-random way, to probabilistically ensure "the real" state wouldn't end up with "too many" zero bits. "A lot of zero bits" tends to persist across MT state transitions for a long time.
and trying to avoid this happening would introduce bias that really could cause problems in practice.
For example, nobody _needs_ a generator capable of producing hundreds of 0.0 in a row. Use a variant even of MT with a much smaller period, and that problem goes away, with no bad effects on any app. Push that more, and what many Monte Carlo applications _really_ want is "low discrepancy": some randomness is essential, but becomes a waste of cycles and assaults "common sense" if it gets too far from covering the domain more-or-less uniformly. So there are many ways known of generating "quasi-random" sequences instead, melding a notion of randomness with guarantees of relatively low discrepancy (low deviation from uniformity). Nothing works best for all purposes - except Python.
Or nobody does. Probability really is insufferably subtle. Guido should ban it.
Which is true, but there isn't much that the random module can do about it :-)
Python should just remove it. It's an "attractive nuisance" to everyone ;-)

On 2015-09-10 00:15, Nathaniel Smith wrote:
On Wed, Sep 9, 2015 at 3:19 PM, Tim Peters <tim.peters@gmail.com> wrote:
The MT actually does have a problem unique to it (or at least to its family of Generalized Feedback Shift Registers) where a state with a high proportion of 0 bits will get stuck in a region of successive states with high proportions of 0 bits. Other 623-dimensional equidistributed PRNGs will indeed come across the same states with high 0-bit sequences with the frequency that you expect from probability, but they will be surrounded by states with dissimilar 0-bit proportions. This problem isn't *due* to equidistribution per se, but I think Tim's point is that you are inevitably due to hit one such patch if you sample long enough. -- Robert Kern "I have come to believe that the whole world is an enigma, a harmless enigma that is made terrible by our own mad attempt to interpret it as though it had an underlying truth." -- Umberto Eco

[Robert Kern <robert.kern@gmail.com>]
Thank you for explaining it better than I did. I implied MT's "stuck in zero-land" problems were _due_ to perfect equidistribution, but of course they're not. It's just that MT's specific permutation of the equidistributed-regardless-of-order range(1, 2**19937) systematically puts integers with "lots of 0 bits" next to each other. And there are many such patches. But 2**19337 is so large you need to contrive the state "by hand" to get there at once. For example,
That's "impossible" ;-) (1 chance in 2***(53*12)) of seeing 0.0 twelve times in a row)

On Thu, Sep 10, 2015 at 9:23 AM, Greg Ewing <greg.ewing@canterbury.ac.nz> wrote:
If arc4random reseeds with entropy periodically, then jumping ahead past such a reseed is simply a matter of performing a reseed, isn't it? ChrisA

[Steven D'Aprano]
[Greg Ewing] [> Another property that's important for some applications is
No for "arc4random" based on RC4, yes for "arc4random" based on ChaCha20, "mostly yes" for "arc4random" in the OpenBSD implementation, wholly unknown for whatever functions that will may be_called_ "arc4random" in the future. The fly in the ointment for the OpenBSD version is that it periodically fiddles its internal state with "entropy" obtained from the kernel. It's completely unreproducible for that reason. However, you can still jump ahead in the state. It's just impossible to say that it's the same state you would have arrived at had you invoked the function that many times instead (the kernel could change the state in unpredictable ways any number of times while you were doing that).;

Reading this thread is fun, but it doesn't seem to be getting anywhere - perhaps that's part of the fun ;-) Realistically, I see two options: 1. Someone goes and implements the OpenBSD random function in C and put a package up on PyPI, updating it whenever OpenBSD thinks that a new algorithm is needed or a security issue has to be fixed (from my experience with other crypto software like OpenSSL, this should be on the order of every 2-6 months ;-)) 2. Ditto, but we put the module in the stdlib and then run around issuing patch level security releases every 2-6 months. Replacing our deterministic default PRNG with a non-deterministic one doesn't really fly, since we'd break an important feature of random.random(). You may remember that we already ran a similar stunt with the string hash function, with very mixed results. Calling the result of such a switch-over "secure" is even worse, since it's a promise we cannot keep (probably not even fully define). Better leave the promise at "insecure" - that's something we can promise forever and don't have to define :-) Regardless of what we end up with, I think Python land can do better than name it "arc4random". We're great at bike shedding, so how about we start the fun with "randomYMMV" :-) Overall, I think having more options for good PRNGs is great. Whether this "arc4random" is any good remains to be seen, but given that OpenBSD developed it, chances are higher than usual. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Sep 10 2015)
2015-09-18: PyCon UK 2015 ... 8 days to go ::::: Try our mxODBC.Connect Python Database Interface for free ! :::::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/

M.-A. Lemburg <mal@...> writes:
The sane option would be to use the OpenBSD libcrypto, which seems to be part of their OpenSSL fork (libressl), just like libcrypto is part of OpenSSL. Then the crypto maintenance would be delegated to the distributions. I would even be interested in writing such a package, but it would be external and non-redistributable for well-known reasons. :) Stefan Krah

On 10.09.2015 15:39, Stefan Krah wrote:
Well, we already link to OpenSSL for SSL and hashes. I guess exposing the OpenSSL RAND interface in a module would be the easiest way to go about this. pyOpenSSL already does this: http://www.egenix.com/products/python/pyOpenSSL/doc/pyopenssl.html/#document... More pointers: https://wiki.openssl.org/index.php/Random_Numbers https://www.openssl.org/docs/manmaster/crypto/rand.html What's nice about the API is that you can add entropy as you find it.
-- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Sep 10 2015)
2015-09-18: PyCon UK 2015 ... 8 days to go ::::: Try our mxODBC.Connect Python Database Interface for free ! :::::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/

M.-A. Lemburg <mal@...> writes:
Yes, my suggestion was based on the premise that OpenBSD's libcrypto (which should include the portable arc4(chacha20)random) is more secure, faster, etc. That's a big 'if', their PRNG had a couple of bugs on Linux last year, but OpenSSL also regularly has issues. Stefan Krah

On Thu, 10 Sep 2015 at 01:26 M.-A. Lemburg <mal@egenix.com> wrote:
I see a third: rename random.random() to be be something that gets the point across it is not crypto secure and then stop at that. I don't think the stdlib should get into the game of trying to provide a RNG that we claim is cryptographically secure as that will change suddenly when a weakness is discovered (this is one of the key reasons we chose not to consider adding requests to the stdlib, for instance). Theo's key issue is misuse of random.random(), not the lack of a crypto-appropriate RNG in the stdlib (that just happens to be his solution because he has an RNG that he is partially in charge of). So that means either we take a "consenting adults" approach and say we can't prevent people from using code without reading the docs or we try to rename the function. But then again that won't help with all of the other functions in the random module that implicitly use random.random() (if that even matters; not sure if the helper functions in the module have any crypto use that would lead to their misuse). Oh, and there is always the nuclear 4th option and we just deprecate the random module. ;) -Brett

[Brett Cannon <brett@python.org>]
The most likely "misuses" in idiomatic Python (not mindlessly translated low-level C) involve some spelling of getting or using random integers, like .choice(), .randrange(), .randint(), or even .sample() and .shuffle(). At least in Python 3, those don't normally ever invoke .random() (neither directly nor indirectly) - they normally use the (didn't always exist) "primitive" .getrandbits() instead (indirectly via the private ._randbelow()). So if something here does need to change, it's all or nothing.
Oh, and there is always the nuclear 4th option and we just deprecate the random module. ;)
I already removed it from the repository. Deprecating it would be a security risk, since it would give hackers information about our future actions ;-)

My belief is that doing the safe thing by default is a major plus of python. So in this point of view, using a cryptographic secure PRNG for random.random() should be done if possible. That would not change a lot the way of people creating insecure software by lack of knowledge (me the first) but could help a little I see a third: rename random.random() to be be something that gets the
This is in my opinion would not be a good idea. Having safe default is a major plus of python, it is not like not having default because one think it eventually it could become insecure. And comparing a cryptographic secure PNRG with openSSL for the expected security release time is not fair because the complexity of the both software is clearly different.

On 10.09.2015 17:46, Brett Cannon wrote:
I think this is the major misunderstanding here: The random module never suggested that it generates pseudo-random data of crypto quality. I'm pretty sure people doing crypto will know and most others simply don't care :-) Evidence: We used a Wichmann-Hill PRNG as default in random for a decade and people still got their work done. Mersenne was added in Python 2.3 and bumped the period from 6,953,607,871,644 (13 digits) to 2**19937-1 (6002 digits).
Why not add ssl.random() et al. (as interface to the OpenSSL rand APIs) ? By putting the crypto random stuff into the ssl module, even people who don't know about the difference, will recognize that the ssl version must be doing something more related to crypto than the regular random module one, which never promised this. Some background on why I think deterministic RNGs are more useful to have as default than non-deterministic ones: A common use case for me is to write test data generators for large database systems. For such generators, I don't keep the many GBs data around, but instead make the generator take a few parameters which then seed the RNGs, the time module and a few other modules via monkey-patching. This allows me to create reproducible test sets in a very efficient way. The feature to be able to reproduce a set is typically only needed when tracking down a bug in the system, but the whole setup avoids having to keep the whole test sets around on disk. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Sep 10 2015)
2015-09-18: PyCon UK 2015 ... 8 days to go ::::: Try our mxODBC.Connect Python Database Interface for free ! :::::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/

Executive summary: The question is, "what value is there in changing the default to be crypto strong to protect future security-sensitive applications from naive implementers vs. the costs to current users who need to rewrite their applications to explicitly invoke the current default?" M.-A. Lemburg writes:
I'm pretty sure people doing crypto will know and most others simply don't care :-)
Which is why botnets have millions of nodes. People who do web security evidently believe that inappropriate RNGs have something to do with widespread security issues. (That doesn't mean they're right, but it gives me pause for thought -- evidently, Guido thought so too!)
Evidence: We used a Wichmann-Hill PRNG as default in random for a decade and people still got their work done.
The question is not whether people get their work done. People work (unless they're seriously dysfunctional), that's what people do. Especially programmers (cf. GNU Manifesto). The question is whether the work of the *crackers* is made significantly easier by security holes that are opened by inappropriate use of random.random. I tend to agree with Steven d'A. (and others) that the answer is no: it doesn't matter if the kind of person who leaves a key under the third flowerpot from the left also habitually leaves the door unlocked (especially if "I'm only gonna be gone for 5 minutes"), and I think that's likely. IOW, installing crypto strong RNGs as default is *not* analogous to the changes to SSL support that were so important that they were backported to 2.7 in a late patch release. OTOH, why default to crypto weak if crypto strong is easily available? You might save a few million Debian users from having to regenerate all their SSH keys.[1] But the people who are "just getting work done" in new programs *won't notice*. I don't think that they care what's under the hood of random.random, as long as (1) the API stays the same, and (2) the documentation clearly indicates where to find PRNGs that support determinism, jumpahead, replicability, and all those other good things, for the needs they doesn't have now but know they probably will have some day. The rub is, as usual, existing applications that would have to be changed for no reason that is relevant to them. Note that arc4random is much simpler to use than random.random. No knobs to tweak or seeds to store for future reference. Seems perfectly suited to "just getting work" done to me. OTOH, if you have an application where you need replicability, jumpahead, etc, you're going to need to read the docs enough to find the APIs for seeding and so on. At design time, I don't see why it would hurt to select an RNG algorithm explicitly as well.
Why not add ssl.random() et al. (as interface to the OpenSSL rand APIs) ?
I like that naming proposal. I'm sure changing the nature of random.random would annoy the heck out of *many* users. An alternative would be to add random.crypto.
If you've gone to that much effort, you evidently have read the docs and it wouldn't have been a huge amount of trouble to use a non-default module with a specified PRNG -- if you were doing it now. But you have every right to be very peeved if you have a bunch of old test runs you want to replicate with a new version of Python, and we've changed the random.random RNG on you. Footnotes: [1] I hasten to add that a programmer who isn't as smart as he thinks he is who "improves" a crypto algorithm is far more likely than that the implementer of a crypto suite would choose an RNG that is inappropriate by design. Still, it's a theoretical possibility, and security is about eliminating every theoretical possibility you can think of.

On 11 September 2015 at 12:07, Stephen J. Turnbull <stephen@xemacs.org> wrote:
They're right. I used to be sanguine about this kind of thing because I spent a long time working in the defence sector, and assumed everyone else was as professionally paranoid as we were. I've been out of that world long enough now to realise that that assumption was deeply, and problematically, wrong*. In that world, you work on the following assumptions: 1) you're an interesting target; 2) the attackers' compute capacity is nigh infinite; 3) any weakness will be found; 4) any weakness will be exploited; 5) "other weaknesses exist" isn't a reason to avoid addressing the weaknesses you know about. As useful background, there's a recent Ars Technica article on the technical details of cracking the passwords in the Ashley Madison data dump, where security researchers found a NINE order of magnitude speedup due to a vulnerability in another part of the system which let them drastically reduce the search space for passwords: http://arstechnica.com/security/2015/09/once-seen-as-bulletproof-11-million-... That kind of reduction in search requirements means that searches that *should* have taken almost 3000 years (in the absence of the vulnerability) can instead be completed within a day. Weak random number generators have a similar effect of reducing the search space for attackers - if you know a weakly random source was used, rather than a cryptographically secure one, then you can use what you know about the random number generator to favour inputs it is *likely* to have produced, rather than having to assume equal probability for the entire search space. And if the target was using a deterministic RNG and you're able to figure out the seed that was used? You no longer need to search at all - you can recreate the exact series of numbers the target was using. Moving the default random source to a CSPRNG, and allowing folks to move a faster deterministic PRNG for known non-security related use cases, or to the system random number generator for known security-related ones is likely to prove a good way to provide safer defaults without reducing flexibility or raising barriers to entry too much. Regards, Nick. P.S. * As a case in point, it was only a couple of years ago that I realised most developers *haven't* read docs like the NIST crypto usage guidelines or the IEEE 802.11i WPA2 spec, and don't make a habit of even casually following the progress of block cipher and secure hash function design competitions. It's been an interesting exercise for me in learning the true meaning of "expertise is relative" :) -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

[M.-A. Lemburg]
I'm pretty sure people doing crypto will know and most others simply don't care :-)
[Stephen J. Turnbull <stephen@xemacs.org>]
Which is why botnets have millions of nodes.
I'm not a security wonk, but I'll bet a life's salary ;-) we'd have botnets just as pervasive if every non-crypto RNG in the world were banned - or had never existed. To start a botnet, the key skill is social engineering: tricking ordinary users into installing malicious software. So long as end users are allowed to run programs, that problem will never go away. Hell, I get offers to install malware each day on Facebook alone, although they're *spelled* like "Install Flash update to see this shocking video!". Those never end for the same reason I still routinely get Nigerian 419 spam: there are plenty of people gullible enough to fall for them outright. Technical wizardry isn't needed to get in the door on millions of machines. So if RNGs have something to do with security, it's not with botnets; let's not oversell this.
People who do web security evidently believe that inappropriate RNGs have something to do with widespread security issues.
Do they really? I casually follow news of the latest exploits, and I really don't recall any of them pinned on an RNG (as opposed to highly predictable default RNG _seeding_ from several years back). Mostly out-of-bounds crap in C, or exploiting holes in security models, or bugs in the implementations of those models (whether Microsoft's, Java's, Adobe Flash's ...).
(That doesn't mean they're right, but it gives me pause for thought -- evidently, Guido thought so too!)
Or it's that Theo can be very insistent, and Guido is only brusque with the non-Dutch ;-) Not saying switching is bad. Am saying I've seen no compelling justification for causing users (& book & course authors & ....) such pain. If this were Python 0.9.1 at issue, sure - but random.py's basic API really hasn't changed since then.

Tim Peters writes:
I am in violent agreement with you on that point and most others.[1] However, the analogy was not intended to be so direct as to imply that "insecure" RNGs are responsible for botnets, merely that not caring is. I agree I twisted MAL's words a bit -- he meant most people have no technical need for crypto, and so don't care, I suppose. But then, "doing crypto" (== security) is like "speaking prose": a lot of folks doing it don't realize that's what they're doing -- and they don't care, either.
So long as end users are allowed to run programs, that problem will never go away.
s/users/programmers/ and s/run/write/, and we get a different analogy that is literally correct -- but fails in an important dimension. One user's mistake adds one node to the botnet, and that's about as bad as one user's mistake gets in terms of harm to third parties. But one programmer's (or system administrator's) mistake can put many, perhaps millions, at risk. Personally I doubt that justifies an API break here, even if we can come up with attacks where breaking the PRNG would be cost-effective compared to "social engineering" or theft of physical media. I think it does justify putting quite a bit of thought into ways to make it easier for naive programmers to do the "safe" thing even if they technically don't need it. I will say that IMO the now-traditional API was a very unfortunate choice. If you have a CSPRNG that just generates "uniform random numbers" and has no user-visible APIs for getting or setting state, it's immediately obvious to the people who know they need access to state what they need to do -- change "RNG" implementation. The most it might cost them is rerunning an expensive base case simulation with a more appropriate implementation that provides the needed APIs. On the other hand, if you have something like the MT that "shouldn't be allowed anywhere near a password", it's easy to ignore the state access APIs, and call it the same way that you would call a CSPRNG. In fact that's what's documented as correct usage, as Paul Moore points out. Thus, programmers who are using a PRNG whose parameters can be inferred from its output, and should not be doing so, generally won't know it until the (potentially widespread) harm is done. It would be nice if it wasn't so easy for them to use the MT. Footnotes: [1] I think "agree with Tim" is a pretty safe default.<wink/>

... [Tim]
[Stephen J. Turnbull <stephen@xemacs.org>]
I don't know that it's true, though. Crypto wonks are like lawyers that way, always worrying about the worst possible case "in theory". In my personal life, I've had to tell lawyers "enough already - I'm not paying another N thousand dollars to insert another page about what happens in case of nuclear war". Crypto wonks have no limit on the costs they'd like to impose either - a Security State never does.
So long as end users are allowed to run programs, that problem will never go away.
Not really. The best social engineering is for a bot to rummage through your email address book and send copies of itself to people you know, appearing to be a thoroughly legitimate email from you. Add a teaser to invite the recipient to click on the attachment, and response rate can be terrific. And my ISP (like many others) happens to provide a free industrial-strength virus/malware scanner/cleaner program. I doubt that's because they actually care about me ;-) Seems more likely they don't want to pay the costs of hosting millions of active bots.
But one programmer's (or system administrator's) mistake can put many, perhaps millions, at risk.
What I question is whether this has anything _plausible_ to do with Python's PRNG.
I remain unclear on what "the danger" is thought to be, such that replacing with a CSPRNG could plausibly prevent it. For example, I know how to deduce the MT's internal state from outputs (and posted working code for doing so, requiring a small fraction of a second given 624 consecutive 32-bit outputs). But it's not an easy problem _unless_ you have 624 consecutive 32-bit outputs. It's far beyond the ken of script kiddies. If it's the NSA, they can demand you turn over everything anyway, or just plain steal it ;-) Consider one of these "password" examples: import string, random alphabet = string.ascii_letters + string.digits print(random.choice(alphabet)) Suppose that prints 'c'. What have you learned? Surprisingly, perhaps, very little. You learned that one 32-bit output of MT had 0b000010 as its first 6 bits. You know nothing about its other 26 bits. And you don't know _which_ MT 32-bit output: internally, .choice() consumes as many 32-bit outputs as it needs until it finds one whose first six bits are less than 62 (len(alphabet)). So all you've learned about MT is that, at the time .choice() was called: - 0 or more 32-bit outputs x were such that (x >> 26) >= 62. - Then one 32-bit output x had (x >> 26) == 2. This isn't much to go on. To deduce the whole state easily, you need to know 19,968 consecutive output bits (624*32). Get more and more letters from the password generator, and you learn more and more about the first 6 bits of an unknowable number of MT outputs consumed, but learn nothing whatsoever about any of the lower 26 bits of any of them, and learn nothing but a range constraint on the first 6 bits of an unknowable number of outputs that were skipped over. Sure, every clue reveals _something_. In theory ;-) Note that, as explained earlier in these messages, Python's default _seeding_ of MT is already computationally infeasible to attack (urandom() is already used to set the entire massive internal state). _That_ I did worry about in the past. So in the above I'm not worried at all that an attacker exploited poor default seeding to know there were only a few billion possible MT states _before_ `c` was generated. All MT states are possible, and MT's state space is large beyond comprehension (let alone calculation). Would the user _really_ be better off using .urandom()? I don't know. Since a crypto wonk will rarely recommend doing anything _other_ than using urandom() directly, I bet they'd discourage using .choice() at all, even if it is built on urandom(). Then the user will write their own implementation of .choice(), something like: u = urandom(n) # for some n letter = alphabet[int(u / 2.0**(8*n) * len(alphabet))] If they manage to get that much right, _now_ they've introduced a statistical bias unless len(alphabet) is a power of 2. If we're assuming they're crypto morons, chances are good they're not rock stars at this kind of thing either ;-)
I will say that IMO the now-traditional API was a very unfortunate choice.
Ah, but I remember 1990 ;-) Python's `random` API was far richer than "the norm" in languages like C and FORTRAN at the time. It was a delight! Judging it by standards that didn't become trendy until much later is only fair now ;-)
I don't recall any language _at the time_ that did so.
As above, I'm not sure a real crypto wonk would endorse a module that provided any more than a bare-bones API, forcing the user to build everything else out of one or two core primitives. Look at, e.g., how tiny the OpenBSD arc4random API is. I do applaud it for offering arc4random_uniform(uint32_t upper_bound) That's exactly what's needed to, e.g., build a bias-free .choice() (provided you have fewer than 2**32-1 elements to choose from).
On the other hand, if you have something like the MT that "shouldn't be allowed anywhere near a password",
As above, I think that's a weak claim. The details matter. As an example of a strong claim: you should never, ever use MT to produce the keystream for a stream cipher. But only a crypto wonk would be trying to generate a keystream to begin with. Or a user who did use MT for that purpose is probably so clueless they'd forget the xor and just send the plaintext ;-)
And yet nobody so far has a produced a single example of any harm done in any of the near-countless languages that supply non-crypto RNGs. I know, my lawyer gets annoyed too when I point out there hasn't been a nuclear war ;-) Anyway, if people want to pursue this, I suggest adding a _new_ module doing exactly whatever it is certified crypto experts say is necessary. We can even give it a name shorter than "random" to encourage its use. That's all most users really care about anyway ;-)
Footnotes: [1] I think "agree with Tim" is a pretty safe default.<wink/>
It's not always optimal, but, yes, you could do worse ;-)

On Sun, Sep 13, 2015 at 10:29 PM, Tim Peters <tim.peters@gmail.com> wrote:
Here you go: https://media.blackhat.com/bh-us-12/Briefings/Argyros/BH_US_12_Argyros_PRNG_... They present real-world attacks on PHP applications that use something like the "password generation" code we've been talking about as a way to generate cookies and password reset nonces, including in particular the case of applications that use a strongly-seeded Mersenne Twister as their RNG: "We develop a suite of new techniques and tools to mount attacks against all PRNGs of the PHP core system even when it is hardened with the Suhosin patch [which adds strong seeding] and apply our techniques to create practical exploits for a number of the most popular PHP applications (including Mediawiki, Gallery, osCommerce and Joomla) focusing on the password reset functionality. Our exploits allow an attacker to completely take over arbitrary user accounts." "Section 5.3: ... In this section we give a description of the Mersenne Twister generator and present an algorithm that allows the recovery of the internal state of the generator even when the output is truncated. Our algorithm also works in the presence of non consecutive outputs ..." Out of curiosity, I tried searching github for "random cookie language:python". The 5th hit (out of ~100k) was a web project that appears to use this insecure method to generate session cookies: https://github.com/bytasv/bbapi/blob/34e294becb22bae6e685f2e742b7ffdb53a83bc... https://github.com/bytasv/bbapi/blob/34e294becb22bae6e685f2e742b7ffdb53a83bc... (Fortunately this project doesn't appear to actually have any login or permissions functionality, so I don't think this is an actual CVE-worthy bug, but that's just a luck -- I'm sure there are plenty of projects that start out looking like this one and then add security features without revisiting how they generate session ids.) There's a reason security people are so Manichean about these kinds of things. If something is not intended to be secure or used in security-sensitive ways, then fine, no worries. But if it is, then there's no point in trying to mess around with "probably mostly secure" -- either solve the problem right or don't bother. (See also: the time Python wasted trying to solve hash randomization without actually solving hash randomization [1].) If Tim Peters can get fooled into thinking something like using MT to generate session ids is "probably mostly secure", then what chance do the rest of us have? <wink> NB this isn't an argument for *whether* we should make random cryptographically strong by default; it's just an argument against wasting time debating whether it's already "secure enough". It's not secure. Maybe that's okay, maybe it's not. For the record though I do tend to agree with the idea that it's not okay, because it's an increasingly hostile world out there, and secure-random-by-default makes whole classes of these issues just disappear. It's not often that you get to fix thousands of bugs with one commit, including at least some with severity level "all your users' private data just got uploaded to bittorrent". I like Nick's proposal here: https://code.activestate.com/lists/python-ideas/35842/ as probably the most solid strategy for implementing that idea -- the only projects that would be negatively affected are those that are using the seeding functionality of the global random API, which is a tiny fraction, and the effect on those projects is that they get nudged into using the superior object-oriented API. -n [1] https://lwn.net/Articles/574761/ -- Nathaniel J. Smith -- http://vorpus.org

On 14.09.2015 08:38, Nathaniel Smith wrote:
I don't think that Tim can get fooled into believing he is a crypto wonk ;-) The thread reveals another misunderstanding: Broken code doesn't get any better when you change the context in which it is run. By fixing the RNG used in such broken code and making it harder to run attacks, you are only changing the context in which the code is run. The code itself still remains broken. Code which uses the output from an RNG as session id without adding any additional security measures is broken, regardless of what kind of RNG you are using. I bet such code will also take any session id it receives as cookie and trust it without applying extra checks on it. Rather than trying to fix up the default RNG in Python by replacing it with a crypto RNG, it's better to open bug reports to get the broken software fixed. Replacing the default Python RNG with a new unstudied crypto one, will likely introduce problems into working code which rightly assumes the proven statistical properties of the MT. Just think of the consequences of adding unwanted bias to simulations. This is far more likely to go unnoticed than a session highjack due to a broken system and can easily cost millions (or earn you millions - it's all probability after all :-)). Now, pointing people who write broken code to a new module which provides a crypto RNG probably isn't much better either. They'd feel instantly secure because it says "crypto" on the box and forget about redesigning their insecure protocol as well. Nothing much you can do about that, I'm afraid. Too easy sometimes is too easy indeed ;-) -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Experts (#1, Sep 14 2015)
2015-09-14: Released mxODBC Plone/Zope DA 2.2.3 http://egenix.com/go84 2015-09-18: PyCon UK 2015 ... 4 days to go 2015-09-26: Python Meeting Duesseldorf Sprint 2015 12 days to go eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/

Quick fix! The problem with MT would be someone having all 624 32-byte numbers from the state. So, every now and then, the random generator should run twice and discard one of the outputs. Do this about 20 times for each 624 calls and no brute force can find the state. Thanks for your attention ;) João Bernardo

On 14 September 2015 at 13:26, João Bernardo <jbvsmo@gmail.com> wrote:
'Every now and then': what's that? Is it a deterministic interval or a random one? If a random one, where does the random number come from: MT? If deterministic, it's trivial to include the effect in your calculations. More generally, what you're doing here is gaining *information* about the state. You don't have to know it perfectly, just to reduce the space of possible states down. Even if you threw 95% of the results of MT away, each time I watch I can reduce the space of possible states the MT is in. This is not a fix.

On Mon, Sep 14, 2015 at 09:26:33AM -0300, João Bernardo wrote:
No, that's not good enough. You can skip a few outputs, and still recover the internal state. The more outputs are skipped, the harder it becomes, but still possible.
Do this about 20 times for each 624 calls and no brute force can find the state. Thanks for your attention ;)
This is not brute force. The recovery attack does not try generating every possible internal state. The MT is a big, complicated equation (technically, it is called a recurrence relation), but being an equation, it is completely deterministic. Given enough values, we can build another equation which can be solved to give the internal state of the MT equation. Are you suggesting that every time you call random.random(), it should secretly generate 20 random numbers and throw away all but the last? (1) That would break backwards compatibility for those who need it. The output of random() is stable across versions: [steve@ando ~]$ for vers in 2.4 2.5 2.6 2.7 3.1 3.2 3.3 3.4; do
(There's a change in the printable output starting in 3.2, but the numbers themselves are the same.) (2) it would make the random number generator twenty times slower than it is now, and MT is already not very fast; (3) most importantly, I don't think that would even solve the problem. I personally don't know how, but I would predict that somebody with more maths skills than me would be able to still recover the internal state. They would just have to collect more values. -- Steve

On Mon, Sep 14, 2015 at 3:37 AM, M.-A. Lemburg <mal@egenix.com> wrote:
As an aphorism this sounds nice, but logically it makes no sense. If the broken thing about your code is that it assumes that the output of the RNG is unguessable, and you change the context by making the output of the RNG unguessable, then now the code it isn't broken. The code would indeed remain broken when run under e.g. older interpreters, but this is not an argument that we should make sure that it stays broken in the future.
Yes, that's... generally the thing you do with session cookies? They're shared secret string that you use as keys into some sort of server-side session database? What extra checks need to be applied?
I'm afraid you just don't understand what you're talking about here. When it comes to adding bias to simulations, all crypto RNGs have *better* statistical properties than MT. A crypto RNG which was merely as statistically-well-behaved as MT would be considered totally broken, because MT doesn't even pass black-box tests of randomness like TestU01.
Yes, improving the RNG only helps with some problems, not others; it might merely make a system harder to attack, rather than impossible to attack. But giving people unguessable random numbers by default does solve real problems. -n -- Nathaniel J. Smith -- http://vorpus.org

On Mon, Sep 14, 2015 at 10:26 PM, Nathaniel Smith <njs@pobox.com> wrote:
Some systems check to see if the session was created by the same IP address. That can help, but it also annoys legitimate users who change their IP addresses. ChrisA

[This is getting off-topic, so I'll stop after this reply] On 14.09.2015 14:26, Nathaniel Smith wrote:
It's still broken, because it's making wrong assumptions on the documented context and given that it did in the first place, suggests that this is not the only aspect of it being broken (pure speculation, but experience shows that bugs usually run around in groups ;-)).
You will at least want to add checks that the session id string was indeed generated by the server and not some bot trying to find valid session ids, e.g. by signing the session id and checking the sig on incoming requests. Other things you can do: fold timeouts into the id, add IP addresses, browser sigs, request sequence numbers. You also need to make sure that the session ids are taken from a large enough set to make it highly unlikely that someone can guess the id simply in case the number of active sessions is significant compared to the universe of possible ids, e.g. 32-bit ids are great for database indexes, but a pretty bad idea if you have millions of active sessions.
I am well aware that MT doesn't satisfy all empirical tests and also that it is not a CSPRNG (see the code Tim and I discussed in this thread showing how easy it is to synchronize to an existing MT RNG if you can gain knowledge of 624 output values). However, it has been extensively studied and it is proven to be equidistributed which is a key property needed for it to be used as basis for other derived probability distributions (as it done by the random module). For CSPRNGs you can empirically test properties, but due to their nature not prove e.g. them being equidistributed - even though they usually will pass standard frequency tests. For real-life purposes, you're probably right with them not being biased. I'm a mathematician, though, so like provable more than empirical :-) The main purpose of CSPRNGs is producing output which you cannot guess, not to produce output which has provable distribution qualities. They do this in a more efficient way than having to wait for enough entropy to be collected - basically making true random number generators practically usable. There's a new field which appears to be popular these days: "Chaotic Pseudo Random Number Generators" (CPRNGs). These are based on chaotic systems and are great for making better use of available entropy. I'm sure we'll have something similar to the MT for these chaotic systems come out of this research in a while and then Python should follow this by implementing it in a new module. Until then, I think it's worthwhile using the existing rand code in OpenSSL and exposing this through the ssl module: https://www.openssl.org/docs/man1.0.1/crypto/rand.html It interfaces to platform hardware true RNGs where available, falls back to an SHA-1 based 1k pool based generator where needed. It's being used for SSL session keys, key generation, etc., trusted by millions of people and passes the NIST tests. This paper explains the algorithm in more detail: http://webpages.uncc.edu/yonwang/papers/lilesorics.pdf The downside of the OpenSSL implementation is that it can fail if there isn't enough entropy available. Here's a slightly better algorithm, but it's just one of many which you can find when searching for CPRNGs: https://eprint.iacr.org/2012/471.pdf
Drop the "by default" and I agree, as will probably everyone else in this thread :-) -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Experts (#1, Sep 14 2015)
2015-09-14: Released mxODBC Plone/Zope DA 2.2.3 http://egenix.com/go84 2015-09-18: PyCon UK 2015 ... 4 days to go 2015-09-26: Python Meeting Duesseldorf Sprint 2015 12 days to go eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/

On Sep 14 2015, "M.-A. Lemburg" <mal-SVD0I98eSHvQT0dZR+AlfA@public.gmane.org> wrote:
The chance of a bot hitting a valid (randomly generated) session key by chance should be just as high as the bot generating a correctly signed session key by chance, if I'm not mistaken. (Assuming, of course, that the completely random key has the same number of bits as they other key + signature). Best, -Nikolaus -- GPG encrypted emails preferred. Key id: 0xD113FCAC3C4E599F Fingerprint: ED31 791B 2C5C 1613 AF38 8B8A D113 FCAC 3C4E 599F »Time flies like an arrow, fruit flies like a Banana.«

On 14/09/15 19:15, M.-A. Lemburg wrote:
I am well aware that MT doesn't satisfy all empirical tests and also that it is not a CSPRNG
And with this criterion, only MT and certain PCG generators are acceptable. Those are (to my knowledge) the only ones with proven equidistribution. Sturla

On 14/09/15 21:07, Sturla Molden wrote:
Just to explain, for those who do not know... Equidistribution means that the numbers are "uniformly distributed", or specifically that "the proportion of the sequence that fall within an interval is proportional to the length of the interval". With one-dimensional equidistribution, the deviates are uniformly distributed on a line. With two-dimensional equidistribution, the deviates are uniformly distributed in a square. With three-dimensional equidistribution, the deviates are uniformly distributed in a cube. k-dimensional equi-distribution generalizes this up to a k-dimensional space. Let us say you want to simulate a shooter firing a gun at a target. Every bullet is aimed at the target and hits in a sightly different place. The shooter is unbiased, but there will be some random jitter. The probability of hitting the target should be proportional to its size, right? Perhaps! Mersenne Twister MT19937 (used in Python) is proven to have 623 dimensional equidistribution. Certain PCG generators are proven to have equidistribution of arbitrary dimensionality. Your simulation of the shooter will agree with common sence if you pick one of these. With other generators, such there are no k-dimensional equidistribution. Your simulation of the shooter will disagree with common sence. Which is correct? Common sence. From a mathematical point of view, this is so important than anything else than Mersenne Twister or PCG is not worth considering in a Monte Carlo simulation. Sturla

On 2015-09-14 20:07, Sturla Molden wrote:
Do not confuse k-dimensional equidistribution with "equidistribution". The latter property (how uniformly a single draw is distributed) is the one that the derived probability distributions rely upon, not the former. Funny story: MT is provably *not* strictly equidistributed; it produces a exactly 624 fewer 0s than it does any other uint32 if you run it over its entire period. Not that it really matters, practically speaking. FWIW, lots of PRNGs can prove either property. To Nate's point, I think he is primarily thinking of counter-mode block ciphers when we talks of CSPRNGs, and they are trivially proved to be equidistributed. The counter is obviously equidistributed, and the symmetric encryption function is a bijective function from counter to output. However, not all CSPRNGs are constructed alike. In particular, ChaCha20 is a stream cipher rather than a block cipher, and I think Marc-Andre is right that it would be difficult to prove equidistribution. Proving substantial *non*-equidistribution could eventually happen though, as it did to ARC4, which prompted its replacement with ChaCha20 in OpenBSD, IIRC. And all that said, provable equidistribution (much less provable k-dimensional equidistribution) doesn't make a good PRNG. A simple counter satisfies equidistribution, but it is a terrible PRNG. The empirical tests are more important IMO. -- Robert Kern "I have come to believe that the whole world is an enigma, a harmless enigma that is made terrible by our own mad attempt to interpret it as though it had an underlying truth." -- Umberto Eco

On 2015-09-14 21:56, Sturla Molden wrote:
Yeah, but we do that every time we draw k numbers in a simulation at all. And we usually draw millions. In that case, perfect k=623-dimensional equidistribution is not really any better than k=1, provided that the PRNG is otherwise good. The requirement for a good PRNG for simulation work is that it be *well* distributed in reasonable dimensions, not that it be *exactly* equidistributed for some k. And well-distributedness is exactly what is tested in TestU01. It is essentially a collection of simulations designed to expose known statistical flaws in PRNGs. So to your earlier question as to which is more damning, failing TestU01 or not being perfectly 623-dim equidistributed, failing TestU01 is. -- Robert Kern "I have come to believe that the whole world is an enigma, a harmless enigma that is made terrible by our own mad attempt to interpret it as though it had an underlying truth." -- Umberto Eco

On 14.09.2015 23:19, Robert Kern wrote:
Depends on your use case, but the fact that you can prove it is what really matters - well, at least for me :-)
TestU01 includes tests which PRNGs of the MT type have trouble passing, since they are linear. This makes them poor choices for crypto applications, but does not have much effect on simulations using only a tiny part of the available period. For MT there's an enhanced version called SFMT which performs better in this respect: http://www.math.sci.hiroshima-u.ac.jp/~m-mat/MT/SFMT/M062821.pdf (the paper also discusses the linear dependencies) See http://www.jstatsoft.org/v50/c01/paper for a discussion of MT vs. SFMT. You can also trick TestU01 to have all tests pass by applying a non-linear transformation (though I don't really see the point in doing this). The WELL family of generators is an newer development, which provides even better characteristics: http://www.iro.umontreal.ca/~lecuyer/myftp/papers/wellrng.pdf Also note that by seeding the MT in Python with truly random data, the shortcomings of MT w/r to having problems escaping "zeroland" (many 0 bits in the seed) are mostly avoided. Anyway, it's been an interesting discussion, but I think it's time to let go :-) Here's a firt cut at an implementation of the idea to use OpenSSL's rand API as basis for an RNG. It even supports monkey patching the random module, though I don't think that's good design. """ RNG based on OpenSSL's rand API. Marc-Andre lemburg, 2015. License: MIT """ # Needs OpenSSL installed: pip install egenix-pyopenssl from OpenSSL import rand import random, struct, binascii # Number of bits in an IEEE float BITS_IN_FLOAT = 53 # Scale to apply to RNG output to make uniform UNIFORM_SCALING = 2 ** -BITS_IN_FLOAT ### Helpers # Unpacker def str2long(value): value_len = len(value) if value_len <= 4: if value_len < 4: value = '\0' * (4 - value_len) + value return struct.unpack('>L', value)[0] elif value_len <= 8: if value_len < 8: value = '\0' * (8 - value_len) + value return struct.unpack('>Q', value)[0] return long(binascii.hexlify(value), 16) ### class OpenSSLRandom(random.Random): """ RNG using the OpenSSL rand API, which provides a cross-platform cryptographically secure RNG. """ def random(self): """ Return a random float from [0.0, 1.0). """ return (str2long(rand.bytes(7)) >> 3) * UNIFORM_SCALING def getrandbits(self, bits): """ Return an integer with the given number of random bits. """ if bits <= 0: raise ValueError('bits must be >0') if bits != int(bits): raise TypeError('bits must be an integer') # Get enough bytes for the requested number of bits numbytes = (bits + 7) // 8 x = str2long(rand.bytes(numbytes)) # Truncate bits, if needed return x >> (numbytes * 8 - bits) def seed(self, value=None): """ Feed entropy to the RNG. value may be None, an integer or a string. If None, 2.5k bytes data are read from /dev/urandom and fed into the RNG. """ if value is None: try: value = random._urandom(2500) entropy = 2500 except NotImplementedError: return if isinstance(value, (int, long)): value = hexlify(value) entropy = len(value) else: value = str(value) entropy = len(value) # Let's be conservative regarding the available entropy in # value rand.add(value, entropy / 2) def _notimplemented(self, *args, **kwds): raise NotImplementedError( 'OpenSSL RNG does not implement this method') getstate = _notimplemented setstate = _notimplemented ### Testing def install_as_default_rng(): """ Monkey patch the random module """ _inst = OpenSSLRandom() random._inst = _inst for attr in ('seed', 'random', 'uniform', 'triangular', 'randint', 'choice', 'randrange', 'sample', 'shuffle', 'normalvariate', 'lognormvariate', 'expovariate', 'vonmisesvariate', 'gammavariate', 'gauss', 'betavariate', 'paretovariate', 'weibullvariate', 'getstate', 'setstate', 'jumpahead', 'getrandbits', ): setattr(random, attr, getattr(_inst, attr)) def _test(): # Install install_as_default_rng() # Now run the random module tests random._test() ### if __name__ == '__main__': _test() -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Experts (#1, Sep 14 2015)
2015-09-14: Released mxODBC Plone/Zope DA 2.2.3 http://egenix.com/go84 2015-09-18: PyCon UK 2015 ... 4 days to go 2015-09-26: Python Meeting Duesseldorf Sprint 2015 12 days to go eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/

On Mon, Sep 14, 2015 at 10:19:09PM +0100, Robert Kern wrote:
I'm confused here. Isn't "well-distributed" a less-strict test than "exactly equidistributed"? MT is (almost) exactly k-equidistributed up to k = 623, correct? So how does it fail the "well-distributed" test? -- Steve

On Mon, Sep 14, 2015 at 8:53 PM, Steven D'Aprano <steve@pearwood.info> wrote:
No, "well-distributed" means "distributed like true random stream", which is a strong and somewhat fuzzy requirement. "k-equidistributed" is one particular way of operationalizing this requirement, but it's not a very good one. The idea of k-equidistribution is historically important because it spurred the development of generators that avoided the really terrible flaws of early designs like Unix rand, but it's not terribly relevant to modern RNG design. Here's how it works. Formally speaking, a randomized algorithm or Monte Carlo simulation or similar can be understood as a function mapping an infinite bitstring to some output value, F : R -> O. If we sample infinite bitstrings R uniformly at random (using a theoretical "true" random number generator), and then apply F to each bitstring, then this produces some probability distribution over output values O. Now given some *pseudo* random number generator, we consider our generator to be successful if that repeatedly running F(sample from this RNG) gives us the same distribution over output values as if we had repeatedly run F(true random sample). So for example, you could have a function F that counts up how many times it sees a zero or a one in the first 20,000 entries in the bitstring, and you expect those numbers to come up at ~10,000 each with some distribution around that. If your RNG is such that you reproduce that distribution, then you pass this function. Note that this is a little counterintuitive: if your RNG is such that over the first 20,000 entries it always produces *exactly* 10,000 zeros and 10,000 ones, then it fails this test. The Mersenne Twister will pass this test. Or you could have a function F that takes the first 19937 bits from the random stream, uses it to construct a model of the internal state of a Mersenne Twister, predicts the next 1000 bits, and returns True if they match and False if they don't match. On a real random stream this function returns True with probability 2^-1000; on a MT random stream it returns True with probability 1. So the MT fails this test. OTOH this is obviously a pretty artificial example. The only case the scientists actually care about is the one where F is "this simulation right here that I'm trying to run before the conference deadline". But since scientists don't really want to design a new RNG for every simulation, we instead try to design our RNGs such that for all "likely" or "reasonable" functions F, they'll probably work ok. In practice this means that we write down a bunch of explicit test functions F inside a test battery like TestU01, run the functions on the RNG stream, and if their output is indistinguishable in distribution from what it would be for a true random stream then we say they pass. And we hope that this will probably generalize to the simulation we actually care about. Cryptographers are worried about the exact same issue -- they want RNGs that have the property that for all functions F, the behavior is indistinguishable from true randomness. But unlike the scientists, they're not content to say "eh, I checked a few functions and it seemed to work on those, probably the ones I actually care about are okay too". The cryptographers consider it a failure if an adversary with arbitrary computing power, full knowledge of the RNG algorithm, plus other magic powers like the ability to influence the RNG seeding, can invent *any* function F that acts differently (produces a different distribution over outputs) when fed the input from the RNG as compared to a true random stream. The only rule for them is that the function has to be one that you can actually implement on a computer that masses, like, less than Jupiter, and only has 1000 years to run. And as far as we can tell, modern crypto RNGs succeed at this very well. Obviously the thing the scientists worry about is a *strict* subset of what the cryptographers are worried about. This is why it is silly to worry that a crypto RNG will cause problems for a scientific simulation. The cryptographers take the scientists' real goal -- the correctness of arbitrary programs like e.g. a monte carlo simulation -- *much* more seriously than the scientists themselves do. (This is because scientists need RNGs to do their real work, whereas for cryptographers RNGs are their real work.) Compared to this, k-dimensional equidistribution is a red herring: it requires that you have a RNG that repeats itself after a while, and within each repeat it produces a uniform distribution over bitstrings of some particular length. By contrast, a true random bitstring does not repeat itself, and it gives a uniform distribution over bitstrings of *arbitrary* length. In this regard, crypto RNGs are like true random bitstrings, not like k-equidistributed RNGs. This is a good thing. k-equidistribution doesn't really hurt (theoretically it introduces flaws, but for realistic designs they don't really matter), but if randomness is what you want then crypto RNGs are better. I-should-really-get-a-blog-shouldn't-I'ly-yrs, -n -- Nathaniel J. Smith -- http://vorpus.org

On 15.09.2015 09:36, Nathaniel Smith wrote:
I think this explains why we cannot make ends meet: A scientist wants to be able to *repeat* a simulation in exactly the same way without having to store GBs of data (or send them to colleagues to have them very the results). Crypto RNGs cannot provide this feature per design. What people designing PRNGs are after is to improve the statistical properties of these PRNGs while still maintaining the repeatability of the output.
Yes, cryptographers are the better folks, understood. These arguments are not really helpful. They are not even arguments. It's really simple: If you don't care about being able to reproduce your simulation results, you can use a crypto RNG, otherwise you can't.
k-dim equidistribution is a way to measure how well your PRNG behaves, because it describes in analytical terms how far you can get with increasing the linear complexity of your RNG output. The goal is not to design an PRNG with specific k, but to increase k as far as possible, given the RNGs deterministic limitations. It's also not something you require of a PRNG, it's simply a form of analytical measurement, just like the tests in TestU01 or the NIST test set are statistical measurements for various aspects of RNGs. Those statistical tests are good in detecting flaws of certain kinds, but they are not perfect. If you know the tests, you can work around them and have your RNG appear to pass them, e.g. you can trick a statistical test for linear dependencies by applying a non-linear transform. That doesn't make the RNGs better, but it apparently is a way to convince some people of the quality of your RNG.
If you can come up with a crypto RNG that allows repeating the results, I think you'd have us all convinced, otherwise it doesn't really make sense to compare apples and oranges, and insisting that orange juice is better for you than apple juice ;-) -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Experts (#1, Sep 15 2015)
2015-09-14: Released mxODBC Plone/Zope DA 2.2.3 http://egenix.com/go84 2015-09-18: PyCon UK 2015 ... 3 days to go 2015-09-26: Python Meeting Duesseldorf Sprint 2015 11 days to go eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/

On Tue, Sep 15, 2015 at 1:45 AM, M.-A. Lemburg <mal@egenix.com> wrote:
Err... I think we're arguing past each other. (Hint: I'm a scientist, not a cryptographer ;-).) My email was *only* trying to clear up the argument that keeps popping up about whether or not a cryptographic RNG could introduce bias in simulations etc., as compared to the allegedly-better-behaved Mersenne Twister. (As in e.g. your comment upthread that "[MT] is proven to be equidistributed which is a key property needed for it to be used as basis for other derived probability distributions".) This argument is incorrect -- equidistribution is not a guarantee that an RNG will produce good results when deriving other probability distributions, and in general cryptographic RNGs will produce as-or-better results than MT in terms of correctness of output. On this particular axis, using a cryptographic RNG is not at all dangerous. Obviously this is only one of the considerations in choosing an RNG; the quality of the randomness is totally orthogonal to considerations like determinism. (Cryptographers also have deterministic RNGs -- they call them "stream ciphers" -- and these will also meet or beat MT in any practically relevant test of correctness for the same reasons I outlined, despite not being provably equidistributed. Of course there are then yet other trade-offs like speed. But that's not really relevant to this thread, because no-one is proposing replacing MT as the standard deterministic RNG in Python; I'm just trying to be clear about how one judges the quality of randomness that an RNG produces.) -n -- Nathaniel J. Smith -- http://vorpus.org

On 15.09.2015 13:41, Nathaniel Smith wrote:
Ok, thanks for the clarification.
You won't get me to agree on "statistical tests are better than mathematical proofs", so let's call it a day :-)
-- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Experts (#1, Sep 15 2015)
2015-09-14: Released mxODBC Plone/Zope DA 2.2.3 http://egenix.com/go84 2015-09-18: PyCon UK 2015 ... 3 days to go 2015-09-26: Python Meeting Duesseldorf Sprint 2015 11 days to go eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/

On 15/09/15 10:45, M.-A. Lemburg wrote:
Yes and no. Conceptually it means that k subsequent samples will have exactly zero correlation. But any PRNG that produces detectable correlation between samples 623 steps apart is junk anyway. The MT have proven equidistribution for k=623, but many have measured equidistribution for far longer periods than that. Numerical computations are subject to rounding error and truncation error whatever you do. The question is whether the deviation from k-dim equidistribution will show up in your simulation result or drown in the error terms. Sturla

On 15.09.2015 14:09, Sturla Molden wrote:
I guess the answer is: it depends :-) According to the SFMT paper: """ ...it requires 10**28 samples to detect an F2-linear relation with 15 (or more) terms among 521 bits, by a standard statistical test. If the number of bits is increased, the necessary sample size is increased rapidly. Thus, it seems that k(v) of SFMT19937 is sufficiently large, far beyond the level of the observable bias. On the other hand, the speed of the generator is observable. """ http://www.math.sci.hiroshima-u.ac.jp/~m-mat/MT/SFMT/M062821.pdf (which again refers to this paper: http://www.math.sci.hiroshima-u.ac.jp/~m-mat/MT/ARTICLES/HONGKONG/hong-fin4....) 10**28 is already a lot of data, but YMMV, of course. Here's a quote for the WELL family of PRNGs: """ The WELL generators mentioned in Table IV successfully passed all the statistical tests included ... TestU01 ..., except those that look for linear dependencies in a long sequence of bits, such as the matrix-rank test ... for very large binary matrices and the linear complexity tests ... This is in fact a limitation of all F2-linear generators, including the Mersenne twister, the TT800, etc. Because of their linear nature, the sequences produced by these generators just cannot have the linear complexity of a truly random sequence. This is definitely unacceptable in cryptology, for example, but is quite acceptable for the vast majority of simulation applications if the linear dependencies are of long range and high order. """ http://www.iro.umontreal.ca/~lecuyer/myftp/papers/wellrng.pdf -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Experts (#1, Sep 15 2015)
2015-09-14: Released mxODBC Plone/Zope DA 2.2.3 http://egenix.com/go84 2015-09-18: PyCon UK 2015 ... 3 days to go 2015-09-26: Python Meeting Duesseldorf Sprint 2015 11 days to go eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/

On 15/09/15 16:20, M.-A. Lemburg wrote:
You seem to be confusing the DCMT with the SFMT which is a fast SIMD friendly Mersenne Twister. The DCMT is intended for using the Mersenne Twister in parallel computing (i.e. one Mersenne Twister per processor). It is not a Mersenne Twister accelerated with parallel hardware. That would be the SFMT. http://www.math.sci.hiroshima-u.ac.jp/~m-mat/MT/DC/dc.html http://www.math.sci.hiroshima-u.ac.jp/~m-mat/MT/DC/dgene.pdf The period for the DC Mersenne Twisters they report are long enough, e.g. 2^127-1 or 2^521-1, but much shorter than the period of MT19937 (2^19937-1). This does not matter because the period of MT19937 is excessive. In scientific computing, the sequence is long enough for most practical purposes if it is larger than 2^64. 2^127-1 is more than enough, and this is the shortest period DCMT reported in the paper. So do we care? Probably not. Sturla

On 15.09.2015 16:54, Sturla Molden wrote:
I was talking about the SFMT, which is a variant of the MT for processors with SIMD instruction sets (most CPUs have these nowadays) and which has 32-, 64-bit or floating point output: http://www.math.sci.hiroshima-u.ac.jp/~m-mat/MT/SFMT/index.html But what I really wanted to reference was the discussion in the SFMT paper about the practical effects of the 623-dim equidistribution (see the end of the first paper I quoted; the discussion references the second paper).
Thanks for the pointers. I wasn't aware of a special MT variant for parallel computing. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Experts (#1, Sep 15 2015)
2015-09-14: Released mxODBC Plone/Zope DA 2.2.3 http://egenix.com/go84 2015-09-18: PyCon UK 2015 ... 3 days to go 2015-09-26: Python Meeting Duesseldorf Sprint 2015 11 days to go eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/

M.-A. Lemburg wrote:
According to http://www.pcg-random.org/other-rngs.html This chacha20 implementation is seedable and should be reproducible: https://gist.github.com/orlp/32f5d1b631ab092608b1 ...though I am concerned about the k-dimensional equidistribution as a scientist, and also that if the random number generator is changed without the interface changing, then it may screw up tests and existing codes which rely on a particular sequence of random numbers. J

[M.-A. Lemburg <mal@egenix.com>]
For example, run AES in CTR mode. Remember that we did something related on whatever mailing list it was ;-) discussing the PSF's voting system, to break ties in a reproducible-by-anyone way using some public info ("news") that couldn't be known until after the election ended. My understanding is that ChaCha20 (underlying currently-trendy implementations of arc4random) is not only deterministic, it even _could_ support an efficient jumpahead(n) operation. The specific OpenBSD implementation of arc4random goes beyond just using ChaCha20 by periodically scrambling the state with kernel-obtained "entropy" too, and that makes it impossible to reproduce its sequence. But it would remain a crytpo-strength generator without that extra scrambling step. Note that these _can_ be very simple to program. The "Blum Blum Shub" crypto generator from 30 years ago just iteratively squares a "big integer" modulo a (carefully chosen) constant. Not only deterministic, given any integer `i` it's efficient to directly compute the i'th output. It's an expensive generator, though (typically only 1 output bit is derived from each modular squaring operation).

On 15.09.2015 17:46, Tim Peters wrote:
Ah, now we're getting somewhere :-) If we accept that non-guessable, but deterministic is a good compromise, then adding a cipher behind MT sounds like a reasonable way forward, even as default. For full crypto strength, people would still have to rely on solutions like /dev/urandom or the OpenSSL one (or reseed the default RNG every now and then). All others get the benefit of non-guessable, but keep the ability to seed the default RNG in Python. Is there some research on this (MT + cipher or hash) ?
IMO, that's a different discussion and we should rely on existing well tested full entropy mixers (urandom or OpenSSL) until the researchers have come with something like MT for chaotic PRNGs. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Experts (#1, Sep 15 2015)
2015-09-14: Released mxODBC Plone/Zope DA 2.2.3 http://egenix.com/go84 2015-09-18: PyCon UK 2015 ... 3 days to go 2015-09-26: Python Meeting Duesseldorf Sprint 2015 11 days to go eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/

... [Marc-Andre]
I expect the only real reason "new entropy" is periodically mixed in to OpenBSD's arc4random() is to guard against that a weakness in ChaCha20 may be discovered later. If there were any known computationally feasible way whatsoever to distinguish bare-bones ChaCha20's output from a "truly random" sequence, it wouldn't be called "crypto" to begin with. But reseeding MT every now and again is definitely not suitable for crypto purposes. You would need to reseed at least every 624 outputs, and from a crypto-strength seed source. In which case, why bother with MT at all? You could just as well use the crypto source directly.
Is there some research on this (MT + cipher or hash) ?
Oh, sure. MT's creators noted from the start that it would suffice to run MT's outputs through a crypto hash (like your favorite flavor of SHA). That's just as vulnerable to "poor seeding" attacks as plain MT, but it's computationally infeasible to deduce the state from any number of hashed outputs (unless your crypto hash is at least partly invertible, in which case it's not really a crypto hash ;-) ).; For other approaches, search for CryptMT. MT's creators suggested a number of other schemes over the years. The simplest throws away the "tempering" part of MT (the 4 lines that map the raw state word into a mildly scrambled output word - not because it needs to be thrown away, but because they think it would no longer be needed given what follows). Then one byte is obtained via grabbing the next MT 32-bit output, folding it into a persistent accumulator via multiplication, and just revealing the top byte: accum = some_odd_integer while True: accum *= random.getrandbits(32) | 1 yield accum >> 24 I did see one paper suggesting it was possible to distinguish the output of that from a truly random sequence given 2**50 consecutive outputs (but that's all - still no way to deduce the state).

[Tim]
Although what's "computationally feasible" may well have changed since then! These days I expect even a modestly endowed attacker could afford to store an exhaustive table of the 2**32 possible outputs and their corresponding hashes. Then the hashes are 100% invertible via simple lookup, so are no better than not hashing at all.

On 2015-09-16 01:43, Tim Peters wrote:
Periodic reseeding also serves to guard against other leaks of information about the underlying state that don't come from breaking through the cipher. If an attacker manages to deduce the state through side channels, timing attacks on the machine, brief physical access, whatever, then reseeding with new entropy will limit the damage rather than blithely continuing on with a compromised state forever. It's an important feature of a CSPRNG. Using a crypto output function in your PRNG is a necessary but not sufficient condition for security. -- Robert Kern "I have come to believe that the whole world is an enigma, a harmless enigma that is made terrible by our own mad attempt to interpret it as though it had an underlying truth." -- Umberto Eco

On 16.09.2015 02:43, Tim Peters wrote:
Thanks for the "CryptMT" pointers. I'll do some research after PyCon UK on this. http://www.math.sci.hiroshima-u.ac.jp/~m-mat/MT/CRYPTMT/index.html A quick glimpse at http://www.ecrypt.eu.org/stream/p3ciphers/cryptmt/cryptmt_p3.pdf suggests that this is a completely new stream cipher, though it uses the typical elements (key + non-linear filter + feedback loop). The approach is interesting, though: they propose an PRNG which can then get used as stream cipher by XOR'ing the PRNG output with the data stream. So the PRNG implies the cipher, not the other way around as many other approaches to CSPRNGs. That's probably also one of its perceived weaknesses: it's different than the common approach. On 16.09.2015 02:55, Tim Peters wrote:> [Tim]
Simply adding a hash doesn't sound like a good idea. My initial thought was using a (well studied) stream cipher on the output, not just a hash on the output. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Experts (#1, Sep 16 2015)
2015-09-14: Released mxODBC Plone/Zope DA 2.2.3 http://egenix.com/go84 2015-09-18: PyCon UK 2015 ... 2 days to go 2015-09-26: Python Meeting Duesseldorf Sprint 2015 10 days to go eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/

On Wed, Sep 16, 2015 at 1:21 AM, M.-A. Lemburg <mal@egenix.com> wrote:
NB that that paper also says that it's patented and requires a license for commercial use.
I think you just described the standard definition of a stream cipher? "Stream cipher" is just the crypto term for a deterministic RNG, that you XOR with data. (However it's a not a CSPRNG, because those require seeding schedules and things like that -- check out e.g. Fortuna.) -n -- Nathaniel J. Smith -- http://vorpus.org

On 16.09.2015 11:02, Nathaniel Smith wrote:
Hmm, you're right: """ If CryptMT is selected as one of the recommendable stream ciphers by eSTREAM, then it is free even for commercial use. """ Hasn't happened yet, but perhaps either eSTREAM or the authors will change their minds. Too bad they haven't yet, since it's a pretty fast stream cipher :-( Anyway, here's a paper on CryptMT: http://cryptography.gmu.edu/~jkaps/download.php?docid=1083
The standard definition I know reads like this: """ Stream ciphers are an important class of encryption algorithms. They encrypt individual characters (usually binary digits) of a plaintext message one at a time, using an encryption transformation which varies with time. """ (taken from Chap 6.1 Introduction of "Handbook of Applied Cryptography"; http://cacr.uwaterloo.ca/hac/about/chap6.pdf) That's a bit more general than what you describe, since the keystream can pretty much be generated in arbitrary ways. What I wanted to emphasize is that a common way of coming up with a stream cipher is to use an existing block cipher which you then transform into a stream cipher. See e.g. https://www.emsec.rub.de/media/crypto/attachments/files/2011/03/hudde.pdf E.g. take AES run in CTR (counter) mode: it applies AES repeatedly to the values of a simple counter as "RNG". Running MT + AES would result in a similar setup, except that the source would have somewhat better qualities and would be based on standard well studied technology, albeit slower than going straight for a native stream cipher. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Experts (#1, Sep 16 2015)
2015-09-14: Released mxODBC Plone/Zope DA 2.2.3 http://egenix.com/go84 2015-09-18: PyCon UK 2015 ... 2 days to go 2015-09-26: Python Meeting Duesseldorf Sprint 2015 10 days to go eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/

On 2015-09-16 12:38, M.-A. Lemburg wrote:
Indeed. DE Shaw has done the analysis for you: https://www.deshawresearch.com/resources_random123.html
Why do you think it would have better qualities? You'll have to redo the analysis that makes MT and AES each so well-studied, and I'm not sure that all of the desirable properties of either will survive the combination. -- Robert Kern "I have come to believe that the whole world is an enigma, a harmless enigma that is made terrible by our own mad attempt to interpret it as though it had an underlying truth." -- Umberto Eco

On 15/09/15 09:36, Nathaniel Smith wrote:
No. Cryptographers care about predictability, not the exact distribution. Any distribution can be considered randomness with a given entropy, but not any distribution is uniform. Only the uniform distribution is uniform. That is where our needs fail to meet. Cryptographers damn any RNG that allow the internal state to be reconstructed. Scientists damn any RNG that do not produce the distribution of interest. Sturla

On Sep 15, 2015 4:57 AM, "Sturla Molden" <sturla.molden@gmail.com> wrote:
Any distribution can be considered randomness with a given entropy, but not any distribution is uniform. Only the uniform distribution is uniform. That is where our needs fail to meet. Cryptographers damn any RNG that allow the internal state to be reconstructed. Scientists damn any RNG that do not produce the distribution of interest. No, this is simply wrong. I promise! ("Oh, sorry, this is contradictions...") For the output of a cryptographic RNG, any deviation from the uniform distribution is considered a flaw. (And as you know, given uniform variates you can construct any distribution of interest.) If I know that you're using a coin that usually comes up heads to generate your passwords, then this gives me a head start in guessing your passwords, and that's considered unacceptable. Or for further evidence, consider: "Scott Fluhrer and David McGrew also showed such attacks which distinguished the keystream of the RC4 from a random stream given a gigabyte of output." -- https://en.m.wikipedia.org/wiki/RC4#Biased_outputs_of_the_RC4 This result is listed on wikipedia because the existence of a program that can detect a deviation from perfect uniformity given a gigabyte of samples and an arbitrarily complicated test statistic is considered a publishable security flaw (and RC4 is generally deprecated because of this and related issues -- this is why openbsd's "arc4random" function no longer uses (A)RC4). -n

On 15/09/15 14:34, Nathaniel Smith wrote:
The uniform distribution has the highest entropy, yes, but it does not mean that other distributions are unacceptable. The sequence just has to be incredibly hard to predict. A non-uniform distribution will give an adversary a head start, that is true, but if the adversary still cannot complete the brute-force attack before the end of the universe there is little help in knowing this. In scientific computing we do not care about adversaries. We care about the correctness of our numerical result. That means we should be fuzzy about the distribution, not about the predictability or "randomness" of a sequence, nor about adversaries looking to recover the internal state. MT is proven to be uniform (equidistributed) up to 623 dimensions, but it is incredibly easy to recover the internal state. The latter we do not care about. In fact, we can often do even better with "quasi-random" sequences, e.g. Sobol sequences, which are not constructed to produce "uncorrelated" points, but constructed to produce correlated points that are delibarately more uniform than uncorrelated points. Sturla

Nathaniel Smith <njs@...> writes:
Do you have links to papers analyzing chacha20 w.r.t statistical properties? The only information that I found is http://www.pcg-random.org/other-rngs.html#id11 "Fewer rounds result in poor statistical performance; ChaCha2 fails statistical tests badly, and ChaCha4 passes TestU01 but sophisticated mathematical analysis has shown it to exhibit some bias. ChaCha8 (and higher) are believed to be good. Nevertheless, ChaCha needs to go to more work to achieve satisfactory statistical quality than many other generators. ChaCha20, being newer, has received less scrutiny from the cryptographic community than Arc4." Stefan Krah

On 2015-09-15 04:53, Steven D'Aprano wrote:
k=623 is a tiny number of dimensions for testing "well-distributedness". You should be able to draw millions of values without detecting significant correlations. Perfect k-dim equidistribution is not a particularly useful metric on its own (at least for simulation work). You can't just say "PRNG A has a bigger k than PRNG B therefore PRNG A is better". You need a minimum period to even possibly reach a certain k, and that period goes up exponentially with k. Given two PRNGs that have the same period, but one has a much smaller k than the other, *then* you can start making inferences about relative quality (again for simulation work; ChaCha20 has a long period but no guarantees of k that I am aware of, but its claim to fame is security, not simulation work). Astronomical periods have costs, so you only want to pay for what is actually worth it, so it's certainly a good thing that the MT has a k near its upper bound. PRNGs with shorter, but still roomy periods like 2**128 are not worse because they have necessarily smaller ks. -- Robert Kern "I have come to believe that the whole world is an enigma, a harmless enigma that is made terrible by our own mad attempt to interpret it as though it had an underlying truth." -- Umberto Eco

On 14 September 2015 at 16:38, Nathaniel Smith <njs@pobox.com> wrote:
I started a new thread breaking that out into more of a proto-PEP (including your reference to the PHP research - thanks for that!). Regards, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

[Tim]
[Nathaniel Smith <njs@pobox.com>]
Here you go: https://media.blackhat.com/bh-us-12/Briefings/Argyros/BH_US_12_Argyros_PRNG_...
The most important question I have about that is from its Appendix D, where they try to give a secure "all-purpose" token generator. If openssl_random_pseudo_bytes is available, they use just that and call it done. Otherwise they go on to all sorts of stuff. My question is, even if /dev/urandom is available, they're _not_ content to use that.alone. They continue to mix it up with all other kinds of silly stuff. So why do they trust urandom less than OpenSLL's gimmick? It's important to me because, far as I'm concerned, os.urandom() is already Python's way to spell "crypto random" (yes, it would be better to have one guaranteed to be available on all Python platforms).
I couldn't help but note ;-) that at least 3 of the apps had already attempted to repair bug reports filed against their insecure password-reset schemes at least 5 years ago. You can lead a PHP'er to water, but ... ;-)
It's cute, but I doubt anyone but the authors had the patience - or knowledge - to write a solver dealing with tens of thousands of picky equations over about 20000 variables. They don't explain enough about the details for a script kiddie to do it. Even very bright hackers attack what's easiest to topple, like poor seeding - that's just easy brute force. It remained unclear to me _exactly_ what "in the presence of non consecutive outputs" is supposed to mean. In the only examples, they knew exactly how many times MT was called. "Non consecutive" in all those contexts appeared to mean "but we couldn't observe_any_ output bits in some cases - the ones we could know something about were sometimes non-consecutive". So in the MT output sequence, they had no knowledge of _some_ of the outputs, but they nevertheless knew exactly _which_ of the outputs they were wholly ignorant about. That's no problem for the equation solver. They just skip adding any equations for the affected bits, keep collecting more outputs and potentially "wrap around", probably leading to an overdetermined system in the end. But Python doesn't work the way PHP does here. As explained in another message, in Python you can have _no idea_ how many MT outputs are consumed by a single .choice() call. In the PHP equivalent, you always consume exactly one MT output. PHP's method suffers statistical bias, but under the covers Python uses an accept/reject method to avoid that. Any number of MT outputs may be (invisibly!) consumed before "accept" is reached, although typically only one or two. You can deduce some of the leading MT output bits from the .choice() result, but _only_ for the single MT output .choice() reveals anything about. About the other MT outputs it may consume, you can't even know that some _were_ skipped over, let alone how many. Best I can tell, that makes a huge difference to whether their solver is even applicable to cracking idiomatic "password generators" in Python. You can't know which variables correspond to the bits you can deduce. You could split the solver into multiple instances to cover all feasible possibilities (for how many MT outputs may have been invisibly consumed), but the number of solver instances needed then grows exponentially with the number of outputs you do see something about. In the worst case (31 bits are truncated), they need over 19000 outputs to deduce the state. Even a wildly optimistic "well, let's guess no more than 1 MT output was invisibly rejected each time" leads to over 2**19000 solver clones then. Sure, there's doubtless a far cleverer way to approach that. But unless another group of PhDs looking to level up in Security World burns their grant money to tackle it, that's yet another attack that will never be seen in the real world ;-)
And they all use .choice(), which is overwhelmingly the most natural way to do this kind of thing in Python.
As above, I'm still not much worried about .choice(). Even if I were, I'd be happy to leave it at "use .choice() from a random.SystemRandom instance instead". Unless there's some non-obvious (to me) reason these authors appear to be unhappy with urandom.
_I_ would use SystemRandom. But, as you can tell, I'm extremely paranoid ;-)
Have you released software used by millions of people? If not, you have no idea how ticked off users get by needing to change anything. But Guido does ;-) Why not add a new "saferandom" module and leave it at that? Encourage newbies to use it. Nobody's old code ever breaks. But nobody's old code is saved from problems it likely didn't have anyway ;-)
-n

On Mon, Sep 14, 2015 at 10:49 PM, Tim Peters <tim.peters@gmail.com> wrote:
Who knows why they wrote the code in that exact way. /dev/urandom is fine.
This led me to look at the implementation of Python's choice(), and it's interesting; I hadn't realized that it was using such an inefficient method. (To make a random selection between, say, 36 items, it rounds up to 64 = 2**6, draws a 32-bit sample from MT, discards 26 of the bits (!) to get a number between 0-63, and then repeats until this number happens to fall in the 0-35 range, so it rejects with probability ~0.45. A more efficient algorithm is the one that it uses if getrandbits is not available, where it uses all 32 bits and only rejects with probability (2**32 % 36) / (2**32) = ~1e-9.) I guess this does add a bit of obfuscation. OTOH the amount of obfuscation is very sensitive to the size of the password alphabet. If I use uppercase + lowercase + digits, that gives me 62 options, so I only reject with probability 1/32, and I can expect that any given 40-character session key will contain zero skips with probability ~0.28, and that reveals 240 bits of seed. I don't have time right now to go look up the MT equations to see how easy it is to make use of such partial information, but there certainly are lots of real-world weaponized exploits that begin with something like "first, gather 10**8 session keys...". I certainly wouldn't trust it. Also, if I use the base64 or hex alphabets, then the probability of rejection is 0, and I can deterministically read off bits from the underlying MT state. (Alternatively, if someone in the future makes the obvious optimization to choice(), then it will basically stop rejecting in practice, and again it becomes trivial to read off all the bits from the underlying MT state.) The point of "secure by default" is that you don't have to spend all these paragraphs doing the math to try and guess whether some RNG usage might maybe be secure; it just is secure.
Your "wildly optimistic" estimate is wildly conservative under realistic conditions. How confident are you that the rest of your analysis is totally free of similar errors? Would you willing to bet, say, the public revelation of every website you've visited in the last 5 years on it?
Grant money is a drop in the bucket of security research funding these days. Criminals and governments have very deep pockets, and it's well documented that there are quite a few people with PhDs who make their living by coming up with exploits and then auctioning them on the black market. BTW, it looks like that PHP paper was an undergraduate project. You don't need a PhD to solve linear equations :-).
No, SystemRandom.choice is certainly fine. But people clearly don't use it, so it's fine-ness doesn't matter that much in practice... -n -- Nathaniel J. Smith -- http://vorpus.org

... [Nathaniel Smith <njs@pobox.com>]
Here you go: https://media.blackhat.com/bh-us-12/Briefings/Argyros/BH_US_12_Argyros_PRNG_...
[Tim, on appendix D]
[Nathaniel]
Who knows why they wrote the code in that exact way. /dev/urandom is fine.
Presumably _they_ know, yes? When they're presenting a "gold standard" all-purpose token generator, it would be nice if they had taken care to make every step clear to other crypto wonks. Otherwise the rest of us are left wondering if _anyone_ really knows what they're talking about ;-) [on the MT state solver]
Speed is irrelevant here, but do note that .choice() isn't restricted to picking from less than 2**32 possibilities.
random.choice(range(2**62)) 2693408174642551707
Special-casing is done at Python speed, and adding a Python-level branch to ask "is it less than 2**32?" is typically more expensive than calling MT again.
And note that the branch you like better _also_ needs another Python-speed test to raise an exception if the range is "too big". The branch actually used doesn't need that.
To be quite clear, nothing can ever be learned about "the seed". All that can be observed is bits from the outputs, and the goal is to deduce "the state". There is an invertible permutation mapping a state word to an output word, but from partial knowledge of N < 32 bits from a single output word you typically can't deduce N bits of the state word from which the output was derived (for example, if you only know the first bit of an output, you can't deduce from that alone the value of any bit in the corresponding state word). That's why they need such a hairy framework to begin with (in the example, you _can_ add equations tying the known value of the first output bit to linear functions of all bits of the state related to the first output bit). About the "240 bits", they need about 80 times more than that to deduce the state. 0.28 ** 80 is even less than a tenth ;-)
I don't have time right now to go look up the MT equations to see how easy it is to make use of such partial information,
It's robust against only knowing a subset of an output's bits (including none). It's not robust against not knowing _which_ output you're staring at. They label the state bits with variables x_0 through x_19936, and generate equations relating specific state bits to the output bits they can deduce. If they don't know how many times MT was invoked between outputs they know were generated, they can't know _which_ state-bit variables to plug into their equations. Is this output related to (e.g.) at least x_0 through x_31, or is it x_32 through x_63? x_64 through x_95? Take an absurd extreme to illustrate an obvious futility in general: suppose .choice() remembered the size of the last range it was asked to pick from. Then if the next call to .choice() is for the same-sized range, call MT 2**19937-2 times ignoring the outputs, and call it once more. It will get the same result then. "The solver" will deduce exactly the same output bits every time, and will never learn more than that. Eventually, if they're doing enough sanity checking, the best their solver could do is notice that the derived equations (regardless of how they construct them) are inconsistent. The worst it could do is "deduce" a state that's pure fantasy. They can't know that they are in fact seeing an output derived from exactly the same state every time. Unless they read the source code for .choice() and see it's playing this trick. in that case, they would never add to their collection of equations after the first output was dealt with. _Then_ the equations would faithfully reflect the truth: that they learned a tiny bit at first, but never learn more than just that.
And I'm not asking you to. I wouldn't either. I'm expressing skepticism about that the solver in this paper is a slam-dunk proof that all existing idiomatic Python password generators are about to cause the world to end ;-)
I' readily agree that if .choice(x) is used whenever len(x) is a power of two, then their solver applies directly to such cases. It's in all and only such cases they can know exactly how many MT outputs were consumed, and so know also which specific state-bit variables to use.
(Alternatively, if someone in the future makes the obvious optimization
As above, if what you like better were obviously faster in reality, it would have been written that way already ;-) Honest, this looks like Raymond Hettinger's code, and he typically obsesses over speed.
to choice(), then it will basically stop rejecting in practice, and again it becomes trivial to read off all the bits from the underlying MT state.)
Not trivial. You still need this hairy solver framework, and best I can tell its code wasn't made available. Note that the URL given in the paper gives a 404 error now. It isn't trivial to code it either.
No argument there! I'm just questioning how worried people "should be" over what actual Python code actually does now. I confess I remain free of outright panic ;-)
Your "wildly optimistic" estimate is wildly conservative under realistic conditions.
Eh.
I couldn't care less if that were revealed. In fact, I'd enjoy the trip down memory lane ;-)
Excellent points! Snideness doesn't always pay off for me ;-)
BTW, it looks like that PHP paper was an undergraduate project. You don't need a PhD to solve linear equations :-).
So give 'em their doctorates! I've seen doctoral theses a hundred times less substantial ;-)
It's just waiting for a real exploit. People writing security papers love to "name & shame". "Gothca! Gotcha!" Once people see there _is_ "a real problem" (if there is), they'll scramble to avoid being the target of the next name-&-shame campaign. Before then, they're much too busy trying to erase all traces of every website they've visited in the last 5 years ;-)

Tim Peters writes:
Well, my worst possible case "in theory" was that a documented MTA parameter would simply not be implemented and not error when I configured it to a non-default value -- but that's how yours truly ended up running an open relay (Smail 3.1.100 I think it was, and I got it from Debian so it wasn't like I was using alpha code). That's what taught me to do functional tests. :-) So yes, I do think there are a lot of folks out there working with software without realizing that there are any risks involved. Life being life, I'd bet on some of them being programmers working with RNG.
But see, that's my main point. Analogies to *anybody's* personal life are irrelevant when we're talking about a bug that could be fixed *once* and save *millions* of users from being exploited. If the wonks are right, it's a big deal, big enough to balance the low probability of them being right. ;-)
Sure, but that's not what happened at AOL and Yahoo! AFAIK (of course they're pretty cagey about details). It seems that a single leak or a small number of leaks at each company exposed millions of address books. (I hasten to add that I doubt the Mersenne Twister had anything to do with the leaks.)
What I question is whether this has anything _plausible_ to do with Python's PRNG.
Me too. People who claim some expertise think so, though.
That's not unfair, but if they did, I'd go find myself another crypto wonk. But who cares about me? What matters is that Guido would, too.
Judging [the random module] by standards that didn't become trendy until much later is only fair now ;-)
You're not the only one who, when offered a choice between fair and fun, chooses the latter. ;-)
We can even give it a name shorter than "random" to encourage its use. That's all most users really care about anyway ;-)
That's beyond "unfair"!

On Thu, Sep 10, 2015 at 9:07 PM, Stephen J. Turnbull <stephen@xemacs.org> wrote:
* [ ] DOC: note regarding the 'pseudo' part of pseudorandom and MT * https://docs.python.org/2/library/random.html * https://docs.python.org/3/library/random.html * [ ] DOC: upgrade cryptography docs in re: random numbers * https://cryptography.io/en/latest/random-numbers/ * [ ] ENH: random_(algo=) (~IRandomSource) * [ ] ENH: Add arc4random * [ ] ENH: Add chacha * [ ] ENH: Add OpenBSD's * [ ] BUG,SEC: new secure default named random.random * must also be stateful / **reproducible** (must support .seed) * justified as BUG,SEC because: [secure by default is the answer] * https://en.wikipedia.org/wiki/Session_fixation * https://cwe.mitre.org/data/definitions/384.html * The docs did not say "you should know better." * see also: hash collisions: https://bugs.python.org/issue13703 * [ ] REF: random.random -> random.random_old

devguide/documenting.html#security-considerations-and-other-concerns https://docs.python.org/devguide/documenting.html#security-considerations-an... On Fri, Sep 11, 2015 at 11:05 AM, Wes Turner <wes.turner@gmail.com> wrote:

On Wed, Sep 9, 2015 at 4:23 PM, Greg Ewing <greg.ewing@canterbury.ac.nz> wrote:
The answer to both of these questions is no. For modern cryptographic PRNGs, full determinism is considered a flaw, and determinism is a necessary precondition to supporting jumpahead. The reason is that even if an attacker learns your secret RNG state at time t, then you want this to have a limited impact -- they'll obviously be able to predict your RNG output for a while, but you don't want them to be able to predict it from now until the end of time. So determinism is considered bad, and high-quality CPRNGs automatically reseed themselves with new entropy according to some carefully designed schedule. And OpenBSD's "arc4random" generator is a high-quality CPRNG in this sense. -- Nathaniel J. Smith -- http://vorpus.org

On Wed, Sep 9, 2015, at 19:23, Greg Ewing wrote:
Being able to produce multiple independent streams of numbers is the important feature. Doing it by "jumping ahead" seems less so. And the need for doing it "efficiently" isn't as clear either - how many streams do you need?

random832@fastmail.us wrote:
Being able to produce multiple independent streams of numbers is the important feature. Doing it by "jumping ahead" seems less so.
Doing it by jumping ahead isn't strictly necessary; the important thing is to have some way of generating *provably* non-overlapping and independent sequences. Jumping ahead is one obvious way to achieve that. Simply setting the seed of each generator randomly and hoping for the best is not really good enough.
And the need for doing it "efficiently" isn't as clear either
I say that because you can obviously jump ahead N steps in any generator just by running it for N cycles, but that's likely to be unacceptably slow. A more direct way of getting there is desirable. -- Greg

Greg Ewing writes:
By definition you don't have (stochastic) independence if you're using a PRNG and deterministically jumping ahead. Proving non-overlapping is easy, but I don't even have a definition of "independence" of fixed sequences: equidistribution of pairs? That might make sense if you have a sequence long enough to contain all pairs, but even then you really just have a single sequence with larger support, and I don't see how you can prove that it's a "good" sequence for using in a simulation.
It is not at all obvious to me that jumping ahead is better than randomly seeding separate generators. The latter actually gives stochastic independence (at least if you randomize over all possible seeds).

On Wed, Sep 9, 2015 at 2:07 PM, Steven D'Aprano <steve@pearwood.info> wrote:
This is a good point. Let's remove the ssl library from Python too. Until recently, the most widely used versions of Python were all woefully behind the times and anyone wanting anything relatively up-to-date had to use a third party library. Even so, if you count the distributions of RHEL and other "Long Term Support" operating systems that are running Python 2.7 (pre 2.7.9) and below, most people are operating with barely secure versions of OpenSSL on versions of Python that don't have constants for modern secure communications standards. Clearly, deciding to add the ssl module was a huge mistake because it wasn't forwards compatible with future security standards.

Tim Peters <tim.peters@...> writes:
My intuition is that if someone just uses a random() function without checking if it's cryptographically secure then the application will probably have other holes as well. I mean, for example no one is going to use C's rand() function for crypto. Stefan Krah

On Wed, Sep 9, 2015, at 14:00, Stefan Krah wrote:
Let's turn the question around - what's the _benefit_ of having a random number generator available that _isn't_ cryptographically secure? One possible argument is performance. If that's the issue - what are our performance targets? How can they be measured? Another argument is that some applications really do need deterministic seeding. Is there a reason not to require them to be explicit about it?

<random832@...> writes:
As you say, performance: http://www.pcg-random.org/rng-performance.html Random number generation is a very broad field. I'm not a specialist, so I just entered "Mersenne Twister" into an academic search engine and got many results, but none for arc4random. It's an interesting question you ask. I'd have to do a lot of reading first to get an overview. Stefan Krah

<random832@...> writes:
I know chacha (and most of djb's other works). I thought we were talking about the suitability of cryptographically secure RNGs for traditional scientific applications, in particular whether there are *other* reasons apart from performance not to use them. Stefan Krah

[Stefan Krah <skrah@bytereef.org>]
I read it the same way on all counts.
Yes, if they're not checking the random() docs first, they're a total crypto moron - in which case it's insane to believe they'll do anything else related to crypto-strength requirements right either. It's hard to make something idiot-proof even if your target audience is bona fide crypto experts ;-)

On 09.09.15 19:35, Guido van Rossum wrote:
Entropy -- limited and slowly recoverable resource (especially if there is no network activity). If you consume it too quickly (for example in a scientific simulation or in a game), it will not have time to recover, that will slow down not only your program, but all consumers of entropy. The use of random.SystemRandom by default looks dangerous. It is unlikely that all existing programs will be rewritten to use random.FastInsecureRandom.

On September 9, 2015 at 1:19:34 PM, Serhiy Storchaka (storchaka@gmail.com) wrote:
This isn’t exactly true. Hardware entropy limited and slowly recovering which is why no sane implementation uses that except to periodically reseed the CSPRNG which is typically based on ARC4 or ChaCha. The standard CSPRNGs that most platforms use are fast enough for most people's use cases. ----------------- Donald Stufft PGP: 0x6E3CBCE93372DCFA // 7C6B 7C5D 5E2B 6356 A926 F04F 6E3C BCE9 3372 DCFA

On Wed, Sep 9, 2015, at 13:18, Serhiy Storchaka wrote:
http://www.2uo.de/myths-about-urandom/ should be required reading. As far as I know, no-one is actually proposing the use of a method that blocks when there's "not enough entropy", nor does arc4random itself appear to do so.

The commit message changing libc's random functions to use arc4random is as follows:
Perhaps someone could ask them for information about that audit, and how many / what of those pieces of software were actually using these functions in ways which made them insecure, but whose security would be notably improved by a better random implementation (I suspect that the main thrust of the audit, though, was on finding which ones would be broken by taking away the default deterministic seeding). That could tell us how typical it is for people to ignorantly use default random functions for security-critical code with no other flaws.

In a word - No. There is zero reason for people doing crypto to use the random module, therefor we should not change the random module to be cryptographically secure. Don't break things and slow my code down by default for dubious reasons, please. On 9/9/2015 12:35, Guido van Rossum wrote:

[Alexander Walters <tritium-list@sdamon.com>]
Would your answer change if a crypto generator were _faster_ than MT? MT isn't speedy by modern standards, and is cache-hostile (about 2500 bytes of mutable state). Not claiming a crypto hash _would_ be faster. But it is possible.

On Wed, Sep 09, 2015 at 08:55:06PM -0500, Tim Peters wrote:
If the crypto PRNG were comparable in speed to what we have now (not significantly slower), or faster, *and* gave reproducible results with the same seed, *and* had no known/detectable statistical biases), and we could promise that those properties would continue to hold even when the state of the art changed and we got a new default crypto PRNG, then I'd still be -0.5 on the change due to the "false sense of security" factor. As I've already mentioned in another comment, I'm with Paul Moore -- I think anyone foolish/ignorant/lazy/malicious enough to use the default PRNG for crypto is surely making more than one mistake, and fixing that one thing for them will just give people a false sense of security. -- Steve

On 9/9/2015 22:11, Steven D'Aprano wrote: source of cryptographic data, since... Lets be frank about this, Guido is not a security expert. I am not a security expert. Tim, I suspect you are not a security expert. Lets leave actually attempting to be at the cutting edge of cryptographic randomness to modules by security experts. I have far too much use for randomness outside of a cryptographic context to sacrifice the API and feature set we have for, in my opinion, a myopic focus on one, already discouraged, use of the random module.

On 10/09/15 03:55, Tim Peters wrote:
Speed is not the main matter of concern. MT19937 is not very fast, it is very accurate. It is used in scientific computing when we want to simulate sampling from a given distribution as accurately as possible. Its strength is in the distribution of number it generates, not in its security or speed. MT19937 allows us to produce a very precise simulation of a stochastic process. The alternatives cannot compare in numerical quality, though they might be faster or more secure, or both. When we use MT19937 in scientific computing we deliberately sacrifice speed for accuracy. A cryto hash might be faster, but will it be more accurate? Accuracy means how well the generated sequence emulates sampling from a perfect uniform distribution. MT19937 does not have any real competition in this game. Sturla

On September 9, 2015 at 12:36:16 PM, Guido van Rossum (guido@python.org) wrote:
Everyone is right :) MT is a fine algorithm for random numbers when you don't need them to be cryptographically safe, it is a disastrous algorithm if you do need them to be safe. As long as you only use MT (and the default ``random``) implementation for things where the fact the numbers you get aren't going to be quite random (e.g. they are actually predictable) and you use os.urandom/random.SystemRandom for everything where you need actual random then everything is fine. The problem boils down to, are people going to accidently use the default random module when they really should use os.urandom or random.SystemRandom. It is my opinion (and I believe Theo's) that they are going to use the MT backed random functions in random.py when they shouldn't be. However I don't have a great solution to what we should do about it. One option is to add a new, random.FastInsecureRandom class, and switch the "bare" random functions in that module over to using random.SystemRandom by default. Then if people want to opt into a faster random that isn't crpytographically secure by default they can use that class. This would essentially be inverting the relationship today, where it defaults to insecure and you have to opt in to secure. ----------------- Donald Stufft PGP: 0x6E3CBCE93372DCFA // 7C6B 7C5D 5E2B 6356 A926 F04F 6E3C BCE9 3372 DCFA

On 09.09.2015 18:53, Donald Stufft wrote:
Not being an expert on this but I agree with this assessment. You can determine easily whether your program runs fast enough. If not, you can fix it. You cannot determine easily whether something you made is cryptographically secure. The default should be as secure as possible. Best, Sven

{Guido]
There is nothing _right_ about MT in a crypto context - it's entirely unsuitable for any such purpose, and always was. Just to be clear about that ;-) But it's an excellent generator for almost all other purposes. So the real question is: whose use cases do you want to cater to by default? If you answer "crytpo", then realize the Python generator will have to change every time the crypto community changes its mind about what's _currently_ "good enough". There's a long history of that already. Indeed, there are already numerous "chacha" variants. For a brief overview, scroll down to the ChaCha20 section of this exceptionally readable page listing pros and cons of various generators: http://www.pcg-random.org/other-rngs.html There are no answers to vital pragmatic questions (like "is it possible to supply a seed to get reproducible results?") without specifying whose implementation of which chacha variant you're asking about. I've always thought Python should be a follower rather than a leader in this specific area. For example, I didn't push for the Twister before it was well on its way to becoming a de facto standard. Anyway, it's all moot until someone supplies a patch - and that sure ain't gonna be me ;-) On Wed, Sep 9, 2015 at 11:35 AM, Guido van Rossum <guido@python.org> wrote:

On September 9, 2015 at 1:11:22 PM, Tim Peters (tim.peters@gmail.com) wrote:
This is not really true in that sense that Python would need to do anything if the blessed generator changed. We'd use /dev/urandom, one of the syscalls that do the same thing, or the CryptGen API on Windows. Python should not have it's own userland CSPRNG. Then it's up to the platform to follow what generator they are going to provide. ----------------- Donald Stufft PGP: 0x6E3CBCE93372DCFA // 7C6B 7C5D 5E2B 6356 A926 F04F 6E3C BCE9 3372 DCFA

[Tim]
[Donald Stufft <donald@stufft.io>]
This is not really true in that sense that Python would need to do anything if the blessed generator changed.
I read Guido's message as specifically asking about Theo's "strongly worded recommendation of [Python switching to] the OpenBSD version of arc4random()" as its default generator. In which, case, yes, when that specific implementation falls out of favor, Python would need to change.
I read Guido's message as asking whether Python should indeed do just that.

On September 9, 2015 at 1:28:46 PM, Tim Peters (tim.peters@gmail.com) wrote:
arc4random changes as the underlying implementation changes too, the name is a historical accident really. arc4random no longer uses arc4 it uses chacha, and when/if chacha needs to be replaced, arc4random will still be the name. ----------------- Donald Stufft PGP: 0x6E3CBCE93372DCFA // 7C6B 7C5D 5E2B 6356 A926 F04F 6E3C BCE9 3372 DCFA

On Wed, Sep 9, 2015, at 13:43, Donald Stufft wrote:
The issue is, what should Python do, if the decision is made to not provide its own RNG [presumably would be a forked copy of OpenBSD's current arc4random] on systems that do not provide a function named arc4random? Use /dev/urandom (or CryptGenRandom) every time [more expensive, performs I/O]? rand48? random? rand? I don't see the issue with Python providing its own implementation. If the state of the art changes, we can have another discussion then.

[random832@fastmail.us]
I don't see the issue with Python providing its own implementation. If the state of the art changes,
It will. Over & over again. That's why it's called "art" ;-)
we can have another discussion then.
Also over & over again. If you volunteer to own responsibility for updating all versions of Python each time it changes (in a crypto context, an advance in the state of the art implies the prior state becomes "a bug"), and post a performance bond sufficient to pay someone else to do it if you vanish, then a major pragmatic objection would go away ;-)

Tim Peters <tim.peters@...> writes:
The OpenBSD devs could also publish arc4random as a library that works everywhere (like OpenSSH). That would be a nicer solution for everyone (except for the devs perhaps :). Stefan Krah

Stefan Krah <skrah@...> writes:
And naturally they're already doing that. I missed this in Theo's first mail: https://github.com/libressl-portable/openbsd/tree/master/src/lib/libcrypto/c... https://github.com/libressl-portable/portable/tree/master/crypto/compat So I guess the whole thing also depends on how popular these libraries will be. Stefan Krah

On Wed, Sep 9, 2015, at 14:31, Tim Peters wrote:
I don't see how "Changing Python's RNG implementation today to arc4random as it exists now" necessarily implies "Making a commitment to guarantee the cryptographic suitability of Python's RNG for all time". Those are two separate things.

[<random832@fastmail.us>]
Disagree. The _only_ point to switching today is "to guarantee the cryptographic suitability of Python's RNG" today. It misses the intent of the switch entirely to give a "but tomorrow? eh - that'[s a different issue" dodge. No, no rules of formal logic would be violated by separating the two - it would be a violation of the only _sense_ in making a switch at all. If you don't believe me, try asking Theo ;-)

On Wed, Sep 09, 2015 at 02:55:01PM -0400, random832@fastmail.us wrote:
Not really. Look at the subject line. It doesn't say "should we change from MT to arc4random?", it asks if the default random number generator should be secure. The only reason we are considering the change from MT to arc4random is to make the PRNG cryptographically secure. "Secure" is a moving target, what is secure today will not be secure tomorrow. Yes, in principle, we could make the change once, then never again. But why bother? We don't gain anything from changing to arc4random if there is no promise to be secure into the future. Question, aimed at anyone, not necessarily random832 -- one desirable property of PRNGs is that you can repeat a sequence of values if you re-seed with a known value. Does arc4random keep that property? I think that it is important that the default RNG be deterministic when given a known seed. (I'm happy for the default seed to be unpredictable.) -- Steve

[Steven D'Aprano <steve@pearwood.info>]
"arc4random" is ill-defined. From what I gathered, it's the case that "pure chacha" variants can be seeded to get a reproducible sequence "in theory", but that not all implementations support that. Specifically, the OpenBSD implementation being "sold" here does not and cannot: http://www.openbsd.org/cgi-bin/man.cgi/OpenBSD-current/man3/arc4random.3 "Does not" because there is no API to either request or set a seed. "Cannot" because: The subsystem is re-seeded from the kernel random number subsystem using getentropy(2) on a regular basis Other variants skip that last part.

On Sep 9, 2015 12:21 PM, "Tim Peters" <tim.peters@gmail.com> wrote:
cannot:
http://www.openbsd.org/cgi-bin/man.cgi/OpenBSD-current/man3/arc4random.3
Another reason why it is important *not* to provide a seeding api for a crypto rng is that this means you can later swap out the underlying algorithms easily as the state of the art improves. By contrast, if you have a deterministic seeded mode, then swapping out the algorithm becomes a compatibility break. (You can provide a "mix this extra entropy into the pool" api, which looks rather similar to seeding, but has fundamentally different semantics.) The only real problem that I see with switching the random module to use a crypto rng is exactly this backwards compatibility issue. For scientific users, reproducibility of output streams is really important. (Ironically, this is a variety of "important" that crypto people are very familiar with: universally acknowledged to be the right thing by everyone who's thought about it, a minority do religiously and rely on, and most people ignore out of ignorance. Education is ongoing...) OTOH python has never made strong guarantees of output stream reproducibility -- 3.2 broke all seeds by default (you have to add 'version=1' to your seed call to get the same results on post-3.2 pythons -- which of course gives an error on older versions). And 99% of the methods are documented to be unstable across versions -- the only method that's guaranteed to produce reproducible results across versions is random.random(). In practice the other methods usually don't change so people get away with it, but. See: https://docs.python.org/3/library/random.html#notes-on-reproducibility So in practice the stdlib random module is not super suitable for scientific work anyway. Not that this stops anyone from using it for this purpose... see above. (And to be fair even this limited determinism is still enough to be somewhat useful -- not all projects require reproducibility across years of different python versions.) Plus even a lot of people who know about the importance of seeding don't realize that the stdlib's support has these gotchas. (Numpy unsurprisingly puts higher priority on these issues -- our random module guarantees exact reproducibility of seeded outputs modulo rounding, across versions and systems, except for bugfixes necessary for correctness. This means that we carry around a bunch of old inefficient implementations of the distribution methods, but so be it...) So, all that considered: I can actually see an argument for removing the seeding methods from the the stdib entirely, and directing those who need reproducibility to a third party library like numpy (/ pygsl / ...). This would be pretty annoying for some cases where someone really does have simple needs and wants just a little determinism without a binary extension, but on net it might even be an improvement, given how easy it is to misread the current api as guaranteeing more than it actually promises. OTOH this would actually break the current promise, weak as it is. Keeping that promise in mind, an alternative would be to keep both generators around, use the cryptographically secure one by default, and switch to MT when someone calls seed(1234, generator="INSECURE LEGACY MT") But this would justifiably get us crucified by the security community, because the above call would flip the insecure switch for your entire program, including possibly other modules that were depending on random to provide secure bits. So if we were going to do this then I think it would have to be by switching the global RNG over unconditionally, and to fulfill the promise, provide the MT option as a separate class that the user would have to instantiate explicitly if they wanted it for backcompat. Document that you should replace import random random.seed(12345) if random.whatever(): ... with from random import MTRandom random = MTRandom(12345) if random.whatever(): ... As part of this transition I would also suggest making the seed method on non-seedable RNGs raise an error when given an explicit seed, instead of silently doing nothing like the current SystemRandom. -n

On Wed, Sep 9, 2015, at 17:02, Nathaniel Smith wrote:
Ideally, neither the crypto bits nor the science bits of a big program should be using the module-level functions. A small program either hasn't got both kinds of bits, or won't be using them at the same time. And if you've got non-science bits doing stuff with your RNG then your results probably aren't going to be reproducible anyway. Which suggests a solution: How about exposing a way to switch out the Random instance used by the module-level functions? The instance itself exists now as random._inst, but the module just spews its bound methods all over its namespace. (Long-term, it might make sense to deprecate the module-level functions)

On Wed, Sep 9, 2015, at 17:02, Nathaniel Smith wrote:
I just realized, OpenBSD has precisely this functionality, for the rand/random/rand48 functions, in the "_deterministic" versions of their respective seed functions. So that's probably not a terrible path to go down: Make a Random class that uses a CSPRNG and/or os.urandom until/unless it is explicitly seeded. Use that class for the global instance. We could probably skip the "make a separate function name to show you really mean it" because unlike C, Python has never encouraged explicitly seeding with the {time, pid, four bytes from /dev/random} when one doesn't actually want determinism. (The default seed in C for rand/random is *1*; for rand48 it is an implementation-defined, but specified to be constant, value). For completeness, have getstate return a tuple of a boolean (for which mode it is in) and whatever state Random returns. setstate can accept either this tuple, or for compatibility whatever Random uses.

On Fri, Sep 11, 2015 at 11:12 PM, <random832@fastmail.us> wrote:
Calling getstate() means yoy want to call setstate() at some point in the future, and have deterministic results. Getting the CSRNG state is dangerous (since it would allow replaying), and it's not even useful (since system entropy gets mixed in occasionally). Instead, in this scheme, getstate() should activate the deterministic RNG (seeding it if it's the first use), and return its state. setstate() would then also switch to the Twister, and seed it.

On Fri, Sep 11, 2015, at 17:48, Petr Viktorin wrote:
My thinking was that "CSRNG is enabled" should be regarded as a single state of the "magic switching RNG". The alternative would be that calling getstate on a magic switching RNG that is not already in deterministic mode is an error.

On 9 September 2015 at 20:33, Stefan Krah <skrah@bytereef.org> wrote:
I use a RNG quite often. Typically for simulations (games, dierolls, card draws, that sort of thing). Sometimes for many millions of results (Monte Carlo simulations, for example). I would always just use the default RNG supplied by the stdlib - I view my use case as "normal use" and wouldn't go looking for specialist answers. I'd occasionally look for reproducibility, although it's not often a key requirement for me (I would expect it as an option from the stdlib RNG, though). Anyone doing crypto who doesn't fully appreciate that it's a specialist subject and that they should be looking for a dedicated RNG suitable for crypto, is probably going to make a lot of *other* mistakes as well. Leading them away from this one probably isn't going to be enough to make their code something I'd want to use... So as a user, I'm against making a change like this. Let the default RNG in the stdlib be something suitable for simulations, "pick a random question", and similar situations, and provide a crypto-capable RNG for those who need it, but not as the default. (I am, of course, assuming that it's not possible to have a single RNG that is the best option for both uses - nobody on this thread seems to have suggested that I'm wrong in this assumption). Paul

On Wed, Sep 9, 2015 at 9:33 PM, Stefan Krah <skrah@bytereef.org> wrote:
The OpenBSD implementation does not allow any kind of reproducible results. Reading http://www.pcg-random.org/other-rngs.html, I see that arc4random is not built for is statistical quality and k-dimensional equidistribution, which are also properties you might not need for crypto, but do want for simulations. So there are two quite different use cases (plus a lot of grey area where any solution is okay). The current situation may be surprising to people who didn't read the docs. Switching away from MT might be a disservice to users that did read and understand them.

Petr Viktorin <encukou@...> writes:
I can't find much at all when searching for "chacha20 equidistribution". Contrast that with "mersenne twister equidistribution" and it seems that chacha20 hasn't been studied very much in that respect (except for the pcg-random site). So I also think this should preclude us from replacing the current random() functions. Adding an arc4random module with the caveat that its quality will be as good as the current OpenBSD libcrypto/libressl(?) would be okay. Stefan Krah

[Stefan Krah <skrah@bytereef.org>]
Well, most arguments about random functions rely on fantasies ;-) For example, yes, the Twister is provably equidistributed to 32 bits across 623 dimensions, but ... does it make a difference to anything? That's across the Twister's _entire period_, which couldn't actually be generated across the age of the universe. What may really matter to an application is whether it will see rough equidistribution across the infinitesimally small slice (of the Twister's full period) it actually generates. And you'll find very little about _that_ (last time I searched, I found nothing). For assurances about that, people rely on test suites developed to test generators. The Twister's provably perfect equidistribution across its whole period also has its scary sides. For example, run random.random() often enough, and it's _guaranteed_ you'll eventually reach a state where the output is exactly 0.0 hundreds of times in a row. That will happen as often as it "should happen" by chance, but that's scant comfort if you happen to hit such a state. Indeed, the Twister was patched relatively early in its life to try to prevent it from _starting_ in such miserable states. Such states are nevertheless still reachable from every starting state. But few people know any of that, so they take "equidistribution" as meaning a necessary thing rather than as an absolute guarantee of eventual disaster ;-) What may really matter for most simulations is that the Twister never reaches a state where, in low dimensions, k-tuples fall on "just a few" regularly-spaced hyperplanes forever after. That's a systematic problem with old flavors of linear congruential generators. But that problem is _so_ old that no new generator proposed over the last few decades suffers it either.
Adding an arc4random module with the caveat that its quality will be as good as the current OpenBSD libcrypto/libressl(?) would be okay.
Anyone is welcome to supply such a module today, and distribute it via the usual channels. Python already supplies the platform spelling of `urandom`, and a very capable random.SystemRandom class based on `urandom`, for those needing crypto-strength randomness (relying on what their OS believed that means, and supplied via their `urandom`). Good enough for me. But, no, I'm not a security wonk at heart.

On Wed, Sep 9, 2015 at 3:19 PM, Tim Peters <tim.peters@gmail.com> wrote:
Yeah, equidistribution is not a guarantee of anything on its own. For example, an integer counter modulo 2**(623*32) is equidistributed to 32 bits across 623 dimensions, just like the Mersenne Twister. Mostly people talk about equidistribution because for a deterministic RNG, (a) being non-equidistributed would be bad, (b) equidistribution is something you can reasonably hope to prove for simple non-cryptographic generators, and mathematicians like writing proofs. OTOH equidistribution is not even well-defined for the OpenBSD "arc4random" generator, because it is genuinely non-deterministic -- it regularly mixes new entropy into its state as it goes -- and equidistribution by definition requires determinism. So it "fails" this test of "randomness" because it is too random... In practice, the chances that your Monte Carlo simulation is going to give bad results because of systematic biases in arc4random are much, much lower than the chances that it will give bad results because of subtle hardware failures that corrupt your simulation. And hey, if arc4random *does* mess up your simulation, then congratulations, your simulation is publishable as a cryptographic attack and will probably get written up in the NYTimes :-). The real reasons to prefer non-cryptographic RNGs are the auxiliary features like determinism, speed, jumpahead, multi-thread friendliness, etc. But the stdlib random module doesn't really provide any of these (except determinism in strictly limited cases), so I'm not sure it matters much.
This criticism seems a bit unfair though -- even a true stream of random bits (e.g. from a true unbiased quantum source) has this property, and trying to avoid this happening would introduce bias that really could cause problems in practice. A good probabilistic program is one that has a high probability of returning some useful result, but they always have some low probability of returning something weird. So this is just saying that most people don't understand probability. Which is true, but there isn't much that the random module can do about it :-) -n -- Nathaniel J. Smith -- http://vorpus.org

On Wed, Sep 09, 2015 at 04:15:31PM -0700, Nathaniel Smith wrote:
The default MT is certainly deterministic, and although only the output of random() itself is guaranteed to be reproducible, the other methods are *usually* stable in practice. There's a jumpahead method too, and for use with multiple threads, you can (and should) create your own instances that don't share state. I call that "multi-thread friendliness" :-) I think Paul Moore's position is a good one. Anyone writing crypto code without reading the docs and understanding what they are doing are surely making more mistakes than just using the wrong PRNG. There may be a good argument for adding arc4random support to the stdlib, but making it the default (with the disadvantages discussed, breaking backwards compatibility, surprising non-crypto users, etc.) won't fix the broken crypto code. It will just give people a false sense of security and encourage them to ignore the docs and write broken crypto code. -- Steve

[Nathaniel Smith]
[Steven D'Aprano]
Not in Python. There was for the ancient Wichmann-Hill generator, but not for MT. A detailed sketch of ways to implement efficient jumpahead for MT are given here: A Fast Jump Ahead Algorithm for Linear Recurrences in a Polynomial Space http://www.math.sci.hiroshima-u.ac.jp/~m-mat/MT/ARTICLES/jump-seta-lfsr.pdf But because MT isn't a _simple_ recurrence, they're all depressingly complex :-( For Wichmann-Hill it was just a few integer modular exponentiations.
and for use with multiple threads, you can (and should) create your own instances that don't share state. I call that "multi-thread friendliness" :-)
That's what people do, but MT's creators don't recommend it anymore (IIRC, their FAQ did recommend it some years ago). Then they switched to recommending using jumpahead with a large value (to guarantee different instances' states would never overlap). Now (well, last I saw) they recommend a parameterized scheme creating a distinct variant of MT per thread (not just different state, but a different (albeit related) algorithm): http://www.math.sci.hiroshima-u.ac.jp/~m-mat/MT/DC/dgene.pdf So I'd say it's clear as mud ;-)
...

[Sorry to Tim and Steven if they get multiple copies of this... Gmail recently broke their Android app's handling of from addresses, so resending, sigh] On Sep 9, 2015 7:24 PM, "Tim Peters" <tim.peters@gmail.com> wrote: [...]
Yeah, the independent-seed-for-each-thread approach is an option with any RNG, but just like people feel better if they have a 100% certified guarantee that the RNG output in a single thread will pass through every combination of possible values (if you wait some cosmological time), they also feel better if there is some 100% certified guarantee that the RNG values in two threads will also be uncorrelated with each other. With something like MT, if two threads did end up with nearby seeds, then that would be bad: each thread individually would see values that looked like high quality randomness, but if you compared across the two threads, they would be identical modulo some lag. So all the nice theoretical analysis of the single threaded stream falls apart. However, for two independently seeded threads to end up anywhere near each other in the MT state space requires that you have picked two numbers between 0 and 2**19937 and gotten values that were "close". Assuming your seeding procedure is functional at all, then this is not a thing that will ever actually happen in this universe. So AFAICT the rise of explicitly multi-threaded RNG designs is one of those fake problems that exists only so people can write papers about solving it. (Maybe this is uncharitable.) So there exist RNG designs that handle multi-threading explicitly, and it shows up on feature comparison checklists. I don't think it should really affect Python's decisions at all though. -n

[Steven D'Aprano]
[Tim]
Not in Python.
[Steve]
It is there, up to Python 2.7. I hadn't noticed it was gone in Python 3.
Yes, there's something _called_ `,jumpahead()`, for backward compatibility with the old WIchmann-Hill generator. But what it does for MT is "eh - no idea what to do, so let's just make stuff up": def jumpahead(self, n): """Change the internal state to one that is likely far away from the current state. This method will not be in Py3.x, so it is better to simply reseed. """ # The super.jumpahead() method uses shuffling to change state, # so it needs a large and "interesting" n to work with. Here, # we use hashing to create a large n for the shuffle. s = repr(n) + repr(self.getstate()) n = int(_hashlib.new('sha512', s).hexdigest(), 16) super(Random, self).jumpahead(n) I doubt there's anything that can be proved about the result of doing that - except that it's almost certain it won't bear any relationship to what calling the generator `n` times instead would have done ;-)

On 10/09/15 04:23, Tim Peters wrote:
The DCMT use the same algorithm (Mersenne Twister) but with different polynomials. The choice of polynomial is more or less arbitrary. You can search for a set of N polynomials that are (almost) prime to each other, and thus end up with totally independent sequences. Searching for such a set can take some time, so you need to do that in advance and save the result. But once you have a set, each one of them is just as valid as the vanilla MT. PCG also provides independent streams. Sturla

[Nathaniel Smith <njs@pobox.com>]
The analogy is almost exact. If you view MT's state as a single 19937-bit integer (=623*32 + 1), then MT's state "simply" cycles through a specific permutation of range(1, 2**19937) (with a different orbit consisting solely of 0). That was the "hard part" to prove. Everything about equidistribution was more of an observation following from that. Doesn't say anything about distribution "in the small" (across small slices), alas.
Heh. In the NYT or a security wonk's blog, maybe. But why would a reputable journal believe me? By design, the results of using the OpenBSD arc4random can't be reproduced ;-)
Python's implementation of MT has never changed anything about the sequence produced from a given seed state, and indeed gives the same sequence from the same seed state as every other correct implementation of the same flavor of MT. That is "strictly limited", to perfection ;-) At a higher level, depends on the app. People are quite creative at defeating efforts to be helpful ;-)
This criticism seems a bit unfair though
Those are facts, not criticisms. I like the Twister very much. But those who have no fear of it are dreaming - while those who have significant fear of it are also dreaming. It's my job to ensure nobody is either frightened or complacent ;-)
-- even a true stream of random bits (e.g. from a true unbiased quantum source) has this property,
But good generators with astronomically smaller periods do not. In a sense, MT made it possible to get results "far more random" than any widely used deterministic generator before it. The patch I mentioned above was to fix real problems in real life, where people using simple seeding schemes systematically ended up with results so transparently ludicrous nobody could possibly accept them for any purpose. "The fix" consisted of using scrambling functions to spray the bits in the user-*supplied* seed all over the place, in a pseudo-random way, to probabilistically ensure "the real" state wouldn't end up with "too many" zero bits. "A lot of zero bits" tends to persist across MT state transitions for a long time.
and trying to avoid this happening would introduce bias that really could cause problems in practice.
For example, nobody _needs_ a generator capable of producing hundreds of 0.0 in a row. Use a variant even of MT with a much smaller period, and that problem goes away, with no bad effects on any app. Push that more, and what many Monte Carlo applications _really_ want is "low discrepancy": some randomness is essential, but becomes a waste of cycles and assaults "common sense" if it gets too far from covering the domain more-or-less uniformly. So there are many ways known of generating "quasi-random" sequences instead, melding a notion of randomness with guarantees of relatively low discrepancy (low deviation from uniformity). Nothing works best for all purposes - except Python.
Or nobody does. Probability really is insufferably subtle. Guido should ban it.
Which is true, but there isn't much that the random module can do about it :-)
Python should just remove it. It's an "attractive nuisance" to everyone ;-)

On 2015-09-10 00:15, Nathaniel Smith wrote:
On Wed, Sep 9, 2015 at 3:19 PM, Tim Peters <tim.peters@gmail.com> wrote:
The MT actually does have a problem unique to it (or at least to its family of Generalized Feedback Shift Registers) where a state with a high proportion of 0 bits will get stuck in a region of successive states with high proportions of 0 bits. Other 623-dimensional equidistributed PRNGs will indeed come across the same states with high 0-bit sequences with the frequency that you expect from probability, but they will be surrounded by states with dissimilar 0-bit proportions. This problem isn't *due* to equidistribution per se, but I think Tim's point is that you are inevitably due to hit one such patch if you sample long enough. -- Robert Kern "I have come to believe that the whole world is an enigma, a harmless enigma that is made terrible by our own mad attempt to interpret it as though it had an underlying truth." -- Umberto Eco

[Robert Kern <robert.kern@gmail.com>]
Thank you for explaining it better than I did. I implied MT's "stuck in zero-land" problems were _due_ to perfect equidistribution, but of course they're not. It's just that MT's specific permutation of the equidistributed-regardless-of-order range(1, 2**19937) systematically puts integers with "lots of 0 bits" next to each other. And there are many such patches. But 2**19337 is so large you need to contrive the state "by hand" to get there at once. For example,
That's "impossible" ;-) (1 chance in 2***(53*12)) of seeing 0.0 twelve times in a row)

On Thu, Sep 10, 2015 at 9:23 AM, Greg Ewing <greg.ewing@canterbury.ac.nz> wrote:
If arc4random reseeds with entropy periodically, then jumping ahead past such a reseed is simply a matter of performing a reseed, isn't it? ChrisA

[Steven D'Aprano]
[Greg Ewing] [> Another property that's important for some applications is
No for "arc4random" based on RC4, yes for "arc4random" based on ChaCha20, "mostly yes" for "arc4random" in the OpenBSD implementation, wholly unknown for whatever functions that will may be_called_ "arc4random" in the future. The fly in the ointment for the OpenBSD version is that it periodically fiddles its internal state with "entropy" obtained from the kernel. It's completely unreproducible for that reason. However, you can still jump ahead in the state. It's just impossible to say that it's the same state you would have arrived at had you invoked the function that many times instead (the kernel could change the state in unpredictable ways any number of times while you were doing that).;

Reading this thread is fun, but it doesn't seem to be getting anywhere - perhaps that's part of the fun ;-) Realistically, I see two options: 1. Someone goes and implements the OpenBSD random function in C and put a package up on PyPI, updating it whenever OpenBSD thinks that a new algorithm is needed or a security issue has to be fixed (from my experience with other crypto software like OpenSSL, this should be on the order of every 2-6 months ;-)) 2. Ditto, but we put the module in the stdlib and then run around issuing patch level security releases every 2-6 months. Replacing our deterministic default PRNG with a non-deterministic one doesn't really fly, since we'd break an important feature of random.random(). You may remember that we already ran a similar stunt with the string hash function, with very mixed results. Calling the result of such a switch-over "secure" is even worse, since it's a promise we cannot keep (probably not even fully define). Better leave the promise at "insecure" - that's something we can promise forever and don't have to define :-) Regardless of what we end up with, I think Python land can do better than name it "arc4random". We're great at bike shedding, so how about we start the fun with "randomYMMV" :-) Overall, I think having more options for good PRNGs is great. Whether this "arc4random" is any good remains to be seen, but given that OpenBSD developed it, chances are higher than usual. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Sep 10 2015)
2015-09-18: PyCon UK 2015 ... 8 days to go ::::: Try our mxODBC.Connect Python Database Interface for free ! :::::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/

M.-A. Lemburg <mal@...> writes:
The sane option would be to use the OpenBSD libcrypto, which seems to be part of their OpenSSL fork (libressl), just like libcrypto is part of OpenSSL. Then the crypto maintenance would be delegated to the distributions. I would even be interested in writing such a package, but it would be external and non-redistributable for well-known reasons. :) Stefan Krah

On 10.09.2015 15:39, Stefan Krah wrote:
Well, we already link to OpenSSL for SSL and hashes. I guess exposing the OpenSSL RAND interface in a module would be the easiest way to go about this. pyOpenSSL already does this: http://www.egenix.com/products/python/pyOpenSSL/doc/pyopenssl.html/#document... More pointers: https://wiki.openssl.org/index.php/Random_Numbers https://www.openssl.org/docs/manmaster/crypto/rand.html What's nice about the API is that you can add entropy as you find it.
-- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Sep 10 2015)
2015-09-18: PyCon UK 2015 ... 8 days to go ::::: Try our mxODBC.Connect Python Database Interface for free ! :::::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/

M.-A. Lemburg <mal@...> writes:
Yes, my suggestion was based on the premise that OpenBSD's libcrypto (which should include the portable arc4(chacha20)random) is more secure, faster, etc. That's a big 'if', their PRNG had a couple of bugs on Linux last year, but OpenSSL also regularly has issues. Stefan Krah

On Thu, 10 Sep 2015 at 01:26 M.-A. Lemburg <mal@egenix.com> wrote:
I see a third: rename random.random() to be be something that gets the point across it is not crypto secure and then stop at that. I don't think the stdlib should get into the game of trying to provide a RNG that we claim is cryptographically secure as that will change suddenly when a weakness is discovered (this is one of the key reasons we chose not to consider adding requests to the stdlib, for instance). Theo's key issue is misuse of random.random(), not the lack of a crypto-appropriate RNG in the stdlib (that just happens to be his solution because he has an RNG that he is partially in charge of). So that means either we take a "consenting adults" approach and say we can't prevent people from using code without reading the docs or we try to rename the function. But then again that won't help with all of the other functions in the random module that implicitly use random.random() (if that even matters; not sure if the helper functions in the module have any crypto use that would lead to their misuse). Oh, and there is always the nuclear 4th option and we just deprecate the random module. ;) -Brett

[Brett Cannon <brett@python.org>]
The most likely "misuses" in idiomatic Python (not mindlessly translated low-level C) involve some spelling of getting or using random integers, like .choice(), .randrange(), .randint(), or even .sample() and .shuffle(). At least in Python 3, those don't normally ever invoke .random() (neither directly nor indirectly) - they normally use the (didn't always exist) "primitive" .getrandbits() instead (indirectly via the private ._randbelow()). So if something here does need to change, it's all or nothing.
Oh, and there is always the nuclear 4th option and we just deprecate the random module. ;)
I already removed it from the repository. Deprecating it would be a security risk, since it would give hackers information about our future actions ;-)

My belief is that doing the safe thing by default is a major plus of python. So in this point of view, using a cryptographic secure PRNG for random.random() should be done if possible. That would not change a lot the way of people creating insecure software by lack of knowledge (me the first) but could help a little I see a third: rename random.random() to be be something that gets the
This is in my opinion would not be a good idea. Having safe default is a major plus of python, it is not like not having default because one think it eventually it could become insecure. And comparing a cryptographic secure PNRG with openSSL for the expected security release time is not fair because the complexity of the both software is clearly different.

On 10.09.2015 17:46, Brett Cannon wrote:
I think this is the major misunderstanding here: The random module never suggested that it generates pseudo-random data of crypto quality. I'm pretty sure people doing crypto will know and most others simply don't care :-) Evidence: We used a Wichmann-Hill PRNG as default in random for a decade and people still got their work done. Mersenne was added in Python 2.3 and bumped the period from 6,953,607,871,644 (13 digits) to 2**19937-1 (6002 digits).
Why not add ssl.random() et al. (as interface to the OpenSSL rand APIs) ? By putting the crypto random stuff into the ssl module, even people who don't know about the difference, will recognize that the ssl version must be doing something more related to crypto than the regular random module one, which never promised this. Some background on why I think deterministic RNGs are more useful to have as default than non-deterministic ones: A common use case for me is to write test data generators for large database systems. For such generators, I don't keep the many GBs data around, but instead make the generator take a few parameters which then seed the RNGs, the time module and a few other modules via monkey-patching. This allows me to create reproducible test sets in a very efficient way. The feature to be able to reproduce a set is typically only needed when tracking down a bug in the system, but the whole setup avoids having to keep the whole test sets around on disk. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Sep 10 2015)
2015-09-18: PyCon UK 2015 ... 8 days to go ::::: Try our mxODBC.Connect Python Database Interface for free ! :::::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/

Executive summary: The question is, "what value is there in changing the default to be crypto strong to protect future security-sensitive applications from naive implementers vs. the costs to current users who need to rewrite their applications to explicitly invoke the current default?" M.-A. Lemburg writes:
I'm pretty sure people doing crypto will know and most others simply don't care :-)
Which is why botnets have millions of nodes. People who do web security evidently believe that inappropriate RNGs have something to do with widespread security issues. (That doesn't mean they're right, but it gives me pause for thought -- evidently, Guido thought so too!)
Evidence: We used a Wichmann-Hill PRNG as default in random for a decade and people still got their work done.
The question is not whether people get their work done. People work (unless they're seriously dysfunctional), that's what people do. Especially programmers (cf. GNU Manifesto). The question is whether the work of the *crackers* is made significantly easier by security holes that are opened by inappropriate use of random.random. I tend to agree with Steven d'A. (and others) that the answer is no: it doesn't matter if the kind of person who leaves a key under the third flowerpot from the left also habitually leaves the door unlocked (especially if "I'm only gonna be gone for 5 minutes"), and I think that's likely. IOW, installing crypto strong RNGs as default is *not* analogous to the changes to SSL support that were so important that they were backported to 2.7 in a late patch release. OTOH, why default to crypto weak if crypto strong is easily available? You might save a few million Debian users from having to regenerate all their SSH keys.[1] But the people who are "just getting work done" in new programs *won't notice*. I don't think that they care what's under the hood of random.random, as long as (1) the API stays the same, and (2) the documentation clearly indicates where to find PRNGs that support determinism, jumpahead, replicability, and all those other good things, for the needs they doesn't have now but know they probably will have some day. The rub is, as usual, existing applications that would have to be changed for no reason that is relevant to them. Note that arc4random is much simpler to use than random.random. No knobs to tweak or seeds to store for future reference. Seems perfectly suited to "just getting work" done to me. OTOH, if you have an application where you need replicability, jumpahead, etc, you're going to need to read the docs enough to find the APIs for seeding and so on. At design time, I don't see why it would hurt to select an RNG algorithm explicitly as well.
Why not add ssl.random() et al. (as interface to the OpenSSL rand APIs) ?
I like that naming proposal. I'm sure changing the nature of random.random would annoy the heck out of *many* users. An alternative would be to add random.crypto.
If you've gone to that much effort, you evidently have read the docs and it wouldn't have been a huge amount of trouble to use a non-default module with a specified PRNG -- if you were doing it now. But you have every right to be very peeved if you have a bunch of old test runs you want to replicate with a new version of Python, and we've changed the random.random RNG on you. Footnotes: [1] I hasten to add that a programmer who isn't as smart as he thinks he is who "improves" a crypto algorithm is far more likely than that the implementer of a crypto suite would choose an RNG that is inappropriate by design. Still, it's a theoretical possibility, and security is about eliminating every theoretical possibility you can think of.

On 11 September 2015 at 12:07, Stephen J. Turnbull <stephen@xemacs.org> wrote:
They're right. I used to be sanguine about this kind of thing because I spent a long time working in the defence sector, and assumed everyone else was as professionally paranoid as we were. I've been out of that world long enough now to realise that that assumption was deeply, and problematically, wrong*. In that world, you work on the following assumptions: 1) you're an interesting target; 2) the attackers' compute capacity is nigh infinite; 3) any weakness will be found; 4) any weakness will be exploited; 5) "other weaknesses exist" isn't a reason to avoid addressing the weaknesses you know about. As useful background, there's a recent Ars Technica article on the technical details of cracking the passwords in the Ashley Madison data dump, where security researchers found a NINE order of magnitude speedup due to a vulnerability in another part of the system which let them drastically reduce the search space for passwords: http://arstechnica.com/security/2015/09/once-seen-as-bulletproof-11-million-... That kind of reduction in search requirements means that searches that *should* have taken almost 3000 years (in the absence of the vulnerability) can instead be completed within a day. Weak random number generators have a similar effect of reducing the search space for attackers - if you know a weakly random source was used, rather than a cryptographically secure one, then you can use what you know about the random number generator to favour inputs it is *likely* to have produced, rather than having to assume equal probability for the entire search space. And if the target was using a deterministic RNG and you're able to figure out the seed that was used? You no longer need to search at all - you can recreate the exact series of numbers the target was using. Moving the default random source to a CSPRNG, and allowing folks to move a faster deterministic PRNG for known non-security related use cases, or to the system random number generator for known security-related ones is likely to prove a good way to provide safer defaults without reducing flexibility or raising barriers to entry too much. Regards, Nick. P.S. * As a case in point, it was only a couple of years ago that I realised most developers *haven't* read docs like the NIST crypto usage guidelines or the IEEE 802.11i WPA2 spec, and don't make a habit of even casually following the progress of block cipher and secure hash function design competitions. It's been an interesting exercise for me in learning the true meaning of "expertise is relative" :) -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

[M.-A. Lemburg]
I'm pretty sure people doing crypto will know and most others simply don't care :-)
[Stephen J. Turnbull <stephen@xemacs.org>]
Which is why botnets have millions of nodes.
I'm not a security wonk, but I'll bet a life's salary ;-) we'd have botnets just as pervasive if every non-crypto RNG in the world were banned - or had never existed. To start a botnet, the key skill is social engineering: tricking ordinary users into installing malicious software. So long as end users are allowed to run programs, that problem will never go away. Hell, I get offers to install malware each day on Facebook alone, although they're *spelled* like "Install Flash update to see this shocking video!". Those never end for the same reason I still routinely get Nigerian 419 spam: there are plenty of people gullible enough to fall for them outright. Technical wizardry isn't needed to get in the door on millions of machines. So if RNGs have something to do with security, it's not with botnets; let's not oversell this.
People who do web security evidently believe that inappropriate RNGs have something to do with widespread security issues.
Do they really? I casually follow news of the latest exploits, and I really don't recall any of them pinned on an RNG (as opposed to highly predictable default RNG _seeding_ from several years back). Mostly out-of-bounds crap in C, or exploiting holes in security models, or bugs in the implementations of those models (whether Microsoft's, Java's, Adobe Flash's ...).
(That doesn't mean they're right, but it gives me pause for thought -- evidently, Guido thought so too!)
Or it's that Theo can be very insistent, and Guido is only brusque with the non-Dutch ;-) Not saying switching is bad. Am saying I've seen no compelling justification for causing users (& book & course authors & ....) such pain. If this were Python 0.9.1 at issue, sure - but random.py's basic API really hasn't changed since then.

Tim Peters writes:
I am in violent agreement with you on that point and most others.[1] However, the analogy was not intended to be so direct as to imply that "insecure" RNGs are responsible for botnets, merely that not caring is. I agree I twisted MAL's words a bit -- he meant most people have no technical need for crypto, and so don't care, I suppose. But then, "doing crypto" (== security) is like "speaking prose": a lot of folks doing it don't realize that's what they're doing -- and they don't care, either.
So long as end users are allowed to run programs, that problem will never go away.
s/users/programmers/ and s/run/write/, and we get a different analogy that is literally correct -- but fails in an important dimension. One user's mistake adds one node to the botnet, and that's about as bad as one user's mistake gets in terms of harm to third parties. But one programmer's (or system administrator's) mistake can put many, perhaps millions, at risk. Personally I doubt that justifies an API break here, even if we can come up with attacks where breaking the PRNG would be cost-effective compared to "social engineering" or theft of physical media. I think it does justify putting quite a bit of thought into ways to make it easier for naive programmers to do the "safe" thing even if they technically don't need it. I will say that IMO the now-traditional API was a very unfortunate choice. If you have a CSPRNG that just generates "uniform random numbers" and has no user-visible APIs for getting or setting state, it's immediately obvious to the people who know they need access to state what they need to do -- change "RNG" implementation. The most it might cost them is rerunning an expensive base case simulation with a more appropriate implementation that provides the needed APIs. On the other hand, if you have something like the MT that "shouldn't be allowed anywhere near a password", it's easy to ignore the state access APIs, and call it the same way that you would call a CSPRNG. In fact that's what's documented as correct usage, as Paul Moore points out. Thus, programmers who are using a PRNG whose parameters can be inferred from its output, and should not be doing so, generally won't know it until the (potentially widespread) harm is done. It would be nice if it wasn't so easy for them to use the MT. Footnotes: [1] I think "agree with Tim" is a pretty safe default.<wink/>

... [Tim]
[Stephen J. Turnbull <stephen@xemacs.org>]
I don't know that it's true, though. Crypto wonks are like lawyers that way, always worrying about the worst possible case "in theory". In my personal life, I've had to tell lawyers "enough already - I'm not paying another N thousand dollars to insert another page about what happens in case of nuclear war". Crypto wonks have no limit on the costs they'd like to impose either - a Security State never does.
So long as end users are allowed to run programs, that problem will never go away.
Not really. The best social engineering is for a bot to rummage through your email address book and send copies of itself to people you know, appearing to be a thoroughly legitimate email from you. Add a teaser to invite the recipient to click on the attachment, and response rate can be terrific. And my ISP (like many others) happens to provide a free industrial-strength virus/malware scanner/cleaner program. I doubt that's because they actually care about me ;-) Seems more likely they don't want to pay the costs of hosting millions of active bots.
But one programmer's (or system administrator's) mistake can put many, perhaps millions, at risk.
What I question is whether this has anything _plausible_ to do with Python's PRNG.
I remain unclear on what "the danger" is thought to be, such that replacing with a CSPRNG could plausibly prevent it. For example, I know how to deduce the MT's internal state from outputs (and posted working code for doing so, requiring a small fraction of a second given 624 consecutive 32-bit outputs). But it's not an easy problem _unless_ you have 624 consecutive 32-bit outputs. It's far beyond the ken of script kiddies. If it's the NSA, they can demand you turn over everything anyway, or just plain steal it ;-) Consider one of these "password" examples: import string, random alphabet = string.ascii_letters + string.digits print(random.choice(alphabet)) Suppose that prints 'c'. What have you learned? Surprisingly, perhaps, very little. You learned that one 32-bit output of MT had 0b000010 as its first 6 bits. You know nothing about its other 26 bits. And you don't know _which_ MT 32-bit output: internally, .choice() consumes as many 32-bit outputs as it needs until it finds one whose first six bits are less than 62 (len(alphabet)). So all you've learned about MT is that, at the time .choice() was called: - 0 or more 32-bit outputs x were such that (x >> 26) >= 62. - Then one 32-bit output x had (x >> 26) == 2. This isn't much to go on. To deduce the whole state easily, you need to know 19,968 consecutive output bits (624*32). Get more and more letters from the password generator, and you learn more and more about the first 6 bits of an unknowable number of MT outputs consumed, but learn nothing whatsoever about any of the lower 26 bits of any of them, and learn nothing but a range constraint on the first 6 bits of an unknowable number of outputs that were skipped over. Sure, every clue reveals _something_. In theory ;-) Note that, as explained earlier in these messages, Python's default _seeding_ of MT is already computationally infeasible to attack (urandom() is already used to set the entire massive internal state). _That_ I did worry about in the past. So in the above I'm not worried at all that an attacker exploited poor default seeding to know there were only a few billion possible MT states _before_ `c` was generated. All MT states are possible, and MT's state space is large beyond comprehension (let alone calculation). Would the user _really_ be better off using .urandom()? I don't know. Since a crypto wonk will rarely recommend doing anything _other_ than using urandom() directly, I bet they'd discourage using .choice() at all, even if it is built on urandom(). Then the user will write their own implementation of .choice(), something like: u = urandom(n) # for some n letter = alphabet[int(u / 2.0**(8*n) * len(alphabet))] If they manage to get that much right, _now_ they've introduced a statistical bias unless len(alphabet) is a power of 2. If we're assuming they're crypto morons, chances are good they're not rock stars at this kind of thing either ;-)
I will say that IMO the now-traditional API was a very unfortunate choice.
Ah, but I remember 1990 ;-) Python's `random` API was far richer than "the norm" in languages like C and FORTRAN at the time. It was a delight! Judging it by standards that didn't become trendy until much later is only fair now ;-)
I don't recall any language _at the time_ that did so.
As above, I'm not sure a real crypto wonk would endorse a module that provided any more than a bare-bones API, forcing the user to build everything else out of one or two core primitives. Look at, e.g., how tiny the OpenBSD arc4random API is. I do applaud it for offering arc4random_uniform(uint32_t upper_bound) That's exactly what's needed to, e.g., build a bias-free .choice() (provided you have fewer than 2**32-1 elements to choose from).
On the other hand, if you have something like the MT that "shouldn't be allowed anywhere near a password",
As above, I think that's a weak claim. The details matter. As an example of a strong claim: you should never, ever use MT to produce the keystream for a stream cipher. But only a crypto wonk would be trying to generate a keystream to begin with. Or a user who did use MT for that purpose is probably so clueless they'd forget the xor and just send the plaintext ;-)
And yet nobody so far has a produced a single example of any harm done in any of the near-countless languages that supply non-crypto RNGs. I know, my lawyer gets annoyed too when I point out there hasn't been a nuclear war ;-) Anyway, if people want to pursue this, I suggest adding a _new_ module doing exactly whatever it is certified crypto experts say is necessary. We can even give it a name shorter than "random" to encourage its use. That's all most users really care about anyway ;-)
Footnotes: [1] I think "agree with Tim" is a pretty safe default.<wink/>
It's not always optimal, but, yes, you could do worse ;-)

On Sun, Sep 13, 2015 at 10:29 PM, Tim Peters <tim.peters@gmail.com> wrote:
Here you go: https://media.blackhat.com/bh-us-12/Briefings/Argyros/BH_US_12_Argyros_PRNG_... They present real-world attacks on PHP applications that use something like the "password generation" code we've been talking about as a way to generate cookies and password reset nonces, including in particular the case of applications that use a strongly-seeded Mersenne Twister as their RNG: "We develop a suite of new techniques and tools to mount attacks against all PRNGs of the PHP core system even when it is hardened with the Suhosin patch [which adds strong seeding] and apply our techniques to create practical exploits for a number of the most popular PHP applications (including Mediawiki, Gallery, osCommerce and Joomla) focusing on the password reset functionality. Our exploits allow an attacker to completely take over arbitrary user accounts." "Section 5.3: ... In this section we give a description of the Mersenne Twister generator and present an algorithm that allows the recovery of the internal state of the generator even when the output is truncated. Our algorithm also works in the presence of non consecutive outputs ..." Out of curiosity, I tried searching github for "random cookie language:python". The 5th hit (out of ~100k) was a web project that appears to use this insecure method to generate session cookies: https://github.com/bytasv/bbapi/blob/34e294becb22bae6e685f2e742b7ffdb53a83bc... https://github.com/bytasv/bbapi/blob/34e294becb22bae6e685f2e742b7ffdb53a83bc... (Fortunately this project doesn't appear to actually have any login or permissions functionality, so I don't think this is an actual CVE-worthy bug, but that's just a luck -- I'm sure there are plenty of projects that start out looking like this one and then add security features without revisiting how they generate session ids.) There's a reason security people are so Manichean about these kinds of things. If something is not intended to be secure or used in security-sensitive ways, then fine, no worries. But if it is, then there's no point in trying to mess around with "probably mostly secure" -- either solve the problem right or don't bother. (See also: the time Python wasted trying to solve hash randomization without actually solving hash randomization [1].) If Tim Peters can get fooled into thinking something like using MT to generate session ids is "probably mostly secure", then what chance do the rest of us have? <wink> NB this isn't an argument for *whether* we should make random cryptographically strong by default; it's just an argument against wasting time debating whether it's already "secure enough". It's not secure. Maybe that's okay, maybe it's not. For the record though I do tend to agree with the idea that it's not okay, because it's an increasingly hostile world out there, and secure-random-by-default makes whole classes of these issues just disappear. It's not often that you get to fix thousands of bugs with one commit, including at least some with severity level "all your users' private data just got uploaded to bittorrent". I like Nick's proposal here: https://code.activestate.com/lists/python-ideas/35842/ as probably the most solid strategy for implementing that idea -- the only projects that would be negatively affected are those that are using the seeding functionality of the global random API, which is a tiny fraction, and the effect on those projects is that they get nudged into using the superior object-oriented API. -n [1] https://lwn.net/Articles/574761/ -- Nathaniel J. Smith -- http://vorpus.org

On 14.09.2015 08:38, Nathaniel Smith wrote:
I don't think that Tim can get fooled into believing he is a crypto wonk ;-) The thread reveals another misunderstanding: Broken code doesn't get any better when you change the context in which it is run. By fixing the RNG used in such broken code and making it harder to run attacks, you are only changing the context in which the code is run. The code itself still remains broken. Code which uses the output from an RNG as session id without adding any additional security measures is broken, regardless of what kind of RNG you are using. I bet such code will also take any session id it receives as cookie and trust it without applying extra checks on it. Rather than trying to fix up the default RNG in Python by replacing it with a crypto RNG, it's better to open bug reports to get the broken software fixed. Replacing the default Python RNG with a new unstudied crypto one, will likely introduce problems into working code which rightly assumes the proven statistical properties of the MT. Just think of the consequences of adding unwanted bias to simulations. This is far more likely to go unnoticed than a session highjack due to a broken system and can easily cost millions (or earn you millions - it's all probability after all :-)). Now, pointing people who write broken code to a new module which provides a crypto RNG probably isn't much better either. They'd feel instantly secure because it says "crypto" on the box and forget about redesigning their insecure protocol as well. Nothing much you can do about that, I'm afraid. Too easy sometimes is too easy indeed ;-) -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Experts (#1, Sep 14 2015)
2015-09-14: Released mxODBC Plone/Zope DA 2.2.3 http://egenix.com/go84 2015-09-18: PyCon UK 2015 ... 4 days to go 2015-09-26: Python Meeting Duesseldorf Sprint 2015 12 days to go eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/

Quick fix! The problem with MT would be someone having all 624 32-byte numbers from the state. So, every now and then, the random generator should run twice and discard one of the outputs. Do this about 20 times for each 624 calls and no brute force can find the state. Thanks for your attention ;) João Bernardo

On 14 September 2015 at 13:26, João Bernardo <jbvsmo@gmail.com> wrote:
'Every now and then': what's that? Is it a deterministic interval or a random one? If a random one, where does the random number come from: MT? If deterministic, it's trivial to include the effect in your calculations. More generally, what you're doing here is gaining *information* about the state. You don't have to know it perfectly, just to reduce the space of possible states down. Even if you threw 95% of the results of MT away, each time I watch I can reduce the space of possible states the MT is in. This is not a fix.

On Mon, Sep 14, 2015 at 09:26:33AM -0300, João Bernardo wrote:
No, that's not good enough. You can skip a few outputs, and still recover the internal state. The more outputs are skipped, the harder it becomes, but still possible.
Do this about 20 times for each 624 calls and no brute force can find the state. Thanks for your attention ;)
This is not brute force. The recovery attack does not try generating every possible internal state. The MT is a big, complicated equation (technically, it is called a recurrence relation), but being an equation, it is completely deterministic. Given enough values, we can build another equation which can be solved to give the internal state of the MT equation. Are you suggesting that every time you call random.random(), it should secretly generate 20 random numbers and throw away all but the last? (1) That would break backwards compatibility for those who need it. The output of random() is stable across versions: [steve@ando ~]$ for vers in 2.4 2.5 2.6 2.7 3.1 3.2 3.3 3.4; do
(There's a change in the printable output starting in 3.2, but the numbers themselves are the same.) (2) it would make the random number generator twenty times slower than it is now, and MT is already not very fast; (3) most importantly, I don't think that would even solve the problem. I personally don't know how, but I would predict that somebody with more maths skills than me would be able to still recover the internal state. They would just have to collect more values. -- Steve

On Mon, Sep 14, 2015 at 3:37 AM, M.-A. Lemburg <mal@egenix.com> wrote:
As an aphorism this sounds nice, but logically it makes no sense. If the broken thing about your code is that it assumes that the output of the RNG is unguessable, and you change the context by making the output of the RNG unguessable, then now the code it isn't broken. The code would indeed remain broken when run under e.g. older interpreters, but this is not an argument that we should make sure that it stays broken in the future.
Yes, that's... generally the thing you do with session cookies? They're shared secret string that you use as keys into some sort of server-side session database? What extra checks need to be applied?
I'm afraid you just don't understand what you're talking about here. When it comes to adding bias to simulations, all crypto RNGs have *better* statistical properties than MT. A crypto RNG which was merely as statistically-well-behaved as MT would be considered totally broken, because MT doesn't even pass black-box tests of randomness like TestU01.
Yes, improving the RNG only helps with some problems, not others; it might merely make a system harder to attack, rather than impossible to attack. But giving people unguessable random numbers by default does solve real problems. -n -- Nathaniel J. Smith -- http://vorpus.org

On Mon, Sep 14, 2015 at 10:26 PM, Nathaniel Smith <njs@pobox.com> wrote:
Some systems check to see if the session was created by the same IP address. That can help, but it also annoys legitimate users who change their IP addresses. ChrisA

[This is getting off-topic, so I'll stop after this reply] On 14.09.2015 14:26, Nathaniel Smith wrote:
It's still broken, because it's making wrong assumptions on the documented context and given that it did in the first place, suggests that this is not the only aspect of it being broken (pure speculation, but experience shows that bugs usually run around in groups ;-)).
You will at least want to add checks that the session id string was indeed generated by the server and not some bot trying to find valid session ids, e.g. by signing the session id and checking the sig on incoming requests. Other things you can do: fold timeouts into the id, add IP addresses, browser sigs, request sequence numbers. You also need to make sure that the session ids are taken from a large enough set to make it highly unlikely that someone can guess the id simply in case the number of active sessions is significant compared to the universe of possible ids, e.g. 32-bit ids are great for database indexes, but a pretty bad idea if you have millions of active sessions.
I am well aware that MT doesn't satisfy all empirical tests and also that it is not a CSPRNG (see the code Tim and I discussed in this thread showing how easy it is to synchronize to an existing MT RNG if you can gain knowledge of 624 output values). However, it has been extensively studied and it is proven to be equidistributed which is a key property needed for it to be used as basis for other derived probability distributions (as it done by the random module). For CSPRNGs you can empirically test properties, but due to their nature not prove e.g. them being equidistributed - even though they usually will pass standard frequency tests. For real-life purposes, you're probably right with them not being biased. I'm a mathematician, though, so like provable more than empirical :-) The main purpose of CSPRNGs is producing output which you cannot guess, not to produce output which has provable distribution qualities. They do this in a more efficient way than having to wait for enough entropy to be collected - basically making true random number generators practically usable. There's a new field which appears to be popular these days: "Chaotic Pseudo Random Number Generators" (CPRNGs). These are based on chaotic systems and are great for making better use of available entropy. I'm sure we'll have something similar to the MT for these chaotic systems come out of this research in a while and then Python should follow this by implementing it in a new module. Until then, I think it's worthwhile using the existing rand code in OpenSSL and exposing this through the ssl module: https://www.openssl.org/docs/man1.0.1/crypto/rand.html It interfaces to platform hardware true RNGs where available, falls back to an SHA-1 based 1k pool based generator where needed. It's being used for SSL session keys, key generation, etc., trusted by millions of people and passes the NIST tests. This paper explains the algorithm in more detail: http://webpages.uncc.edu/yonwang/papers/lilesorics.pdf The downside of the OpenSSL implementation is that it can fail if there isn't enough entropy available. Here's a slightly better algorithm, but it's just one of many which you can find when searching for CPRNGs: https://eprint.iacr.org/2012/471.pdf
Drop the "by default" and I agree, as will probably everyone else in this thread :-) -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Experts (#1, Sep 14 2015)
2015-09-14: Released mxODBC Plone/Zope DA 2.2.3 http://egenix.com/go84 2015-09-18: PyCon UK 2015 ... 4 days to go 2015-09-26: Python Meeting Duesseldorf Sprint 2015 12 days to go eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/

On Sep 14 2015, "M.-A. Lemburg" <mal-SVD0I98eSHvQT0dZR+AlfA@public.gmane.org> wrote:
The chance of a bot hitting a valid (randomly generated) session key by chance should be just as high as the bot generating a correctly signed session key by chance, if I'm not mistaken. (Assuming, of course, that the completely random key has the same number of bits as they other key + signature). Best, -Nikolaus -- GPG encrypted emails preferred. Key id: 0xD113FCAC3C4E599F Fingerprint: ED31 791B 2C5C 1613 AF38 8B8A D113 FCAC 3C4E 599F »Time flies like an arrow, fruit flies like a Banana.«

On 14/09/15 19:15, M.-A. Lemburg wrote:
I am well aware that MT doesn't satisfy all empirical tests and also that it is not a CSPRNG
And with this criterion, only MT and certain PCG generators are acceptable. Those are (to my knowledge) the only ones with proven equidistribution. Sturla

On 14/09/15 21:07, Sturla Molden wrote:
Just to explain, for those who do not know... Equidistribution means that the numbers are "uniformly distributed", or specifically that "the proportion of the sequence that fall within an interval is proportional to the length of the interval". With one-dimensional equidistribution, the deviates are uniformly distributed on a line. With two-dimensional equidistribution, the deviates are uniformly distributed in a square. With three-dimensional equidistribution, the deviates are uniformly distributed in a cube. k-dimensional equi-distribution generalizes this up to a k-dimensional space. Let us say you want to simulate a shooter firing a gun at a target. Every bullet is aimed at the target and hits in a sightly different place. The shooter is unbiased, but there will be some random jitter. The probability of hitting the target should be proportional to its size, right? Perhaps! Mersenne Twister MT19937 (used in Python) is proven to have 623 dimensional equidistribution. Certain PCG generators are proven to have equidistribution of arbitrary dimensionality. Your simulation of the shooter will agree with common sence if you pick one of these. With other generators, such there are no k-dimensional equidistribution. Your simulation of the shooter will disagree with common sence. Which is correct? Common sence. From a mathematical point of view, this is so important than anything else than Mersenne Twister or PCG is not worth considering in a Monte Carlo simulation. Sturla

On 2015-09-14 20:07, Sturla Molden wrote:
Do not confuse k-dimensional equidistribution with "equidistribution". The latter property (how uniformly a single draw is distributed) is the one that the derived probability distributions rely upon, not the former. Funny story: MT is provably *not* strictly equidistributed; it produces a exactly 624 fewer 0s than it does any other uint32 if you run it over its entire period. Not that it really matters, practically speaking. FWIW, lots of PRNGs can prove either property. To Nate's point, I think he is primarily thinking of counter-mode block ciphers when we talks of CSPRNGs, and they are trivially proved to be equidistributed. The counter is obviously equidistributed, and the symmetric encryption function is a bijective function from counter to output. However, not all CSPRNGs are constructed alike. In particular, ChaCha20 is a stream cipher rather than a block cipher, and I think Marc-Andre is right that it would be difficult to prove equidistribution. Proving substantial *non*-equidistribution could eventually happen though, as it did to ARC4, which prompted its replacement with ChaCha20 in OpenBSD, IIRC. And all that said, provable equidistribution (much less provable k-dimensional equidistribution) doesn't make a good PRNG. A simple counter satisfies equidistribution, but it is a terrible PRNG. The empirical tests are more important IMO. -- Robert Kern "I have come to believe that the whole world is an enigma, a harmless enigma that is made terrible by our own mad attempt to interpret it as though it had an underlying truth." -- Umberto Eco

On 2015-09-14 21:56, Sturla Molden wrote:
Yeah, but we do that every time we draw k numbers in a simulation at all. And we usually draw millions. In that case, perfect k=623-dimensional equidistribution is not really any better than k=1, provided that the PRNG is otherwise good. The requirement for a good PRNG for simulation work is that it be *well* distributed in reasonable dimensions, not that it be *exactly* equidistributed for some k. And well-distributedness is exactly what is tested in TestU01. It is essentially a collection of simulations designed to expose known statistical flaws in PRNGs. So to your earlier question as to which is more damning, failing TestU01 or not being perfectly 623-dim equidistributed, failing TestU01 is. -- Robert Kern "I have come to believe that the whole world is an enigma, a harmless enigma that is made terrible by our own mad attempt to interpret it as though it had an underlying truth." -- Umberto Eco

On 14.09.2015 23:19, Robert Kern wrote:
Depends on your use case, but the fact that you can prove it is what really matters - well, at least for me :-)
TestU01 includes tests which PRNGs of the MT type have trouble passing, since they are linear. This makes them poor choices for crypto applications, but does not have much effect on simulations using only a tiny part of the available period. For MT there's an enhanced version called SFMT which performs better in this respect: http://www.math.sci.hiroshima-u.ac.jp/~m-mat/MT/SFMT/M062821.pdf (the paper also discusses the linear dependencies) See http://www.jstatsoft.org/v50/c01/paper for a discussion of MT vs. SFMT. You can also trick TestU01 to have all tests pass by applying a non-linear transformation (though I don't really see the point in doing this). The WELL family of generators is an newer development, which provides even better characteristics: http://www.iro.umontreal.ca/~lecuyer/myftp/papers/wellrng.pdf Also note that by seeding the MT in Python with truly random data, the shortcomings of MT w/r to having problems escaping "zeroland" (many 0 bits in the seed) are mostly avoided. Anyway, it's been an interesting discussion, but I think it's time to let go :-) Here's a firt cut at an implementation of the idea to use OpenSSL's rand API as basis for an RNG. It even supports monkey patching the random module, though I don't think that's good design. """ RNG based on OpenSSL's rand API. Marc-Andre lemburg, 2015. License: MIT """ # Needs OpenSSL installed: pip install egenix-pyopenssl from OpenSSL import rand import random, struct, binascii # Number of bits in an IEEE float BITS_IN_FLOAT = 53 # Scale to apply to RNG output to make uniform UNIFORM_SCALING = 2 ** -BITS_IN_FLOAT ### Helpers # Unpacker def str2long(value): value_len = len(value) if value_len <= 4: if value_len < 4: value = '\0' * (4 - value_len) + value return struct.unpack('>L', value)[0] elif value_len <= 8: if value_len < 8: value = '\0' * (8 - value_len) + value return struct.unpack('>Q', value)[0] return long(binascii.hexlify(value), 16) ### class OpenSSLRandom(random.Random): """ RNG using the OpenSSL rand API, which provides a cross-platform cryptographically secure RNG. """ def random(self): """ Return a random float from [0.0, 1.0). """ return (str2long(rand.bytes(7)) >> 3) * UNIFORM_SCALING def getrandbits(self, bits): """ Return an integer with the given number of random bits. """ if bits <= 0: raise ValueError('bits must be >0') if bits != int(bits): raise TypeError('bits must be an integer') # Get enough bytes for the requested number of bits numbytes = (bits + 7) // 8 x = str2long(rand.bytes(numbytes)) # Truncate bits, if needed return x >> (numbytes * 8 - bits) def seed(self, value=None): """ Feed entropy to the RNG. value may be None, an integer or a string. If None, 2.5k bytes data are read from /dev/urandom and fed into the RNG. """ if value is None: try: value = random._urandom(2500) entropy = 2500 except NotImplementedError: return if isinstance(value, (int, long)): value = hexlify(value) entropy = len(value) else: value = str(value) entropy = len(value) # Let's be conservative regarding the available entropy in # value rand.add(value, entropy / 2) def _notimplemented(self, *args, **kwds): raise NotImplementedError( 'OpenSSL RNG does not implement this method') getstate = _notimplemented setstate = _notimplemented ### Testing def install_as_default_rng(): """ Monkey patch the random module """ _inst = OpenSSLRandom() random._inst = _inst for attr in ('seed', 'random', 'uniform', 'triangular', 'randint', 'choice', 'randrange', 'sample', 'shuffle', 'normalvariate', 'lognormvariate', 'expovariate', 'vonmisesvariate', 'gammavariate', 'gauss', 'betavariate', 'paretovariate', 'weibullvariate', 'getstate', 'setstate', 'jumpahead', 'getrandbits', ): setattr(random, attr, getattr(_inst, attr)) def _test(): # Install install_as_default_rng() # Now run the random module tests random._test() ### if __name__ == '__main__': _test() -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Experts (#1, Sep 14 2015)
2015-09-14: Released mxODBC Plone/Zope DA 2.2.3 http://egenix.com/go84 2015-09-18: PyCon UK 2015 ... 4 days to go 2015-09-26: Python Meeting Duesseldorf Sprint 2015 12 days to go eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/

On Mon, Sep 14, 2015 at 10:19:09PM +0100, Robert Kern wrote:
I'm confused here. Isn't "well-distributed" a less-strict test than "exactly equidistributed"? MT is (almost) exactly k-equidistributed up to k = 623, correct? So how does it fail the "well-distributed" test? -- Steve

On Mon, Sep 14, 2015 at 8:53 PM, Steven D'Aprano <steve@pearwood.info> wrote:
No, "well-distributed" means "distributed like true random stream", which is a strong and somewhat fuzzy requirement. "k-equidistributed" is one particular way of operationalizing this requirement, but it's not a very good one. The idea of k-equidistribution is historically important because it spurred the development of generators that avoided the really terrible flaws of early designs like Unix rand, but it's not terribly relevant to modern RNG design. Here's how it works. Formally speaking, a randomized algorithm or Monte Carlo simulation or similar can be understood as a function mapping an infinite bitstring to some output value, F : R -> O. If we sample infinite bitstrings R uniformly at random (using a theoretical "true" random number generator), and then apply F to each bitstring, then this produces some probability distribution over output values O. Now given some *pseudo* random number generator, we consider our generator to be successful if that repeatedly running F(sample from this RNG) gives us the same distribution over output values as if we had repeatedly run F(true random sample). So for example, you could have a function F that counts up how many times it sees a zero or a one in the first 20,000 entries in the bitstring, and you expect those numbers to come up at ~10,000 each with some distribution around that. If your RNG is such that you reproduce that distribution, then you pass this function. Note that this is a little counterintuitive: if your RNG is such that over the first 20,000 entries it always produces *exactly* 10,000 zeros and 10,000 ones, then it fails this test. The Mersenne Twister will pass this test. Or you could have a function F that takes the first 19937 bits from the random stream, uses it to construct a model of the internal state of a Mersenne Twister, predicts the next 1000 bits, and returns True if they match and False if they don't match. On a real random stream this function returns True with probability 2^-1000; on a MT random stream it returns True with probability 1. So the MT fails this test. OTOH this is obviously a pretty artificial example. The only case the scientists actually care about is the one where F is "this simulation right here that I'm trying to run before the conference deadline". But since scientists don't really want to design a new RNG for every simulation, we instead try to design our RNGs such that for all "likely" or "reasonable" functions F, they'll probably work ok. In practice this means that we write down a bunch of explicit test functions F inside a test battery like TestU01, run the functions on the RNG stream, and if their output is indistinguishable in distribution from what it would be for a true random stream then we say they pass. And we hope that this will probably generalize to the simulation we actually care about. Cryptographers are worried about the exact same issue -- they want RNGs that have the property that for all functions F, the behavior is indistinguishable from true randomness. But unlike the scientists, they're not content to say "eh, I checked a few functions and it seemed to work on those, probably the ones I actually care about are okay too". The cryptographers consider it a failure if an adversary with arbitrary computing power, full knowledge of the RNG algorithm, plus other magic powers like the ability to influence the RNG seeding, can invent *any* function F that acts differently (produces a different distribution over outputs) when fed the input from the RNG as compared to a true random stream. The only rule for them is that the function has to be one that you can actually implement on a computer that masses, like, less than Jupiter, and only has 1000 years to run. And as far as we can tell, modern crypto RNGs succeed at this very well. Obviously the thing the scientists worry about is a *strict* subset of what the cryptographers are worried about. This is why it is silly to worry that a crypto RNG will cause problems for a scientific simulation. The cryptographers take the scientists' real goal -- the correctness of arbitrary programs like e.g. a monte carlo simulation -- *much* more seriously than the scientists themselves do. (This is because scientists need RNGs to do their real work, whereas for cryptographers RNGs are their real work.) Compared to this, k-dimensional equidistribution is a red herring: it requires that you have a RNG that repeats itself after a while, and within each repeat it produces a uniform distribution over bitstrings of some particular length. By contrast, a true random bitstring does not repeat itself, and it gives a uniform distribution over bitstrings of *arbitrary* length. In this regard, crypto RNGs are like true random bitstrings, not like k-equidistributed RNGs. This is a good thing. k-equidistribution doesn't really hurt (theoretically it introduces flaws, but for realistic designs they don't really matter), but if randomness is what you want then crypto RNGs are better. I-should-really-get-a-blog-shouldn't-I'ly-yrs, -n -- Nathaniel J. Smith -- http://vorpus.org

On 15.09.2015 09:36, Nathaniel Smith wrote:
I think this explains why we cannot make ends meet: A scientist wants to be able to *repeat* a simulation in exactly the same way without having to store GBs of data (or send them to colleagues to have them very the results). Crypto RNGs cannot provide this feature per design. What people designing PRNGs are after is to improve the statistical properties of these PRNGs while still maintaining the repeatability of the output.
Yes, cryptographers are the better folks, understood. These arguments are not really helpful. They are not even arguments. It's really simple: If you don't care about being able to reproduce your simulation results, you can use a crypto RNG, otherwise you can't.
k-dim equidistribution is a way to measure how well your PRNG behaves, because it describes in analytical terms how far you can get with increasing the linear complexity of your RNG output. The goal is not to design an PRNG with specific k, but to increase k as far as possible, given the RNGs deterministic limitations. It's also not something you require of a PRNG, it's simply a form of analytical measurement, just like the tests in TestU01 or the NIST test set are statistical measurements for various aspects of RNGs. Those statistical tests are good in detecting flaws of certain kinds, but they are not perfect. If you know the tests, you can work around them and have your RNG appear to pass them, e.g. you can trick a statistical test for linear dependencies by applying a non-linear transform. That doesn't make the RNGs better, but it apparently is a way to convince some people of the quality of your RNG.
If you can come up with a crypto RNG that allows repeating the results, I think you'd have us all convinced, otherwise it doesn't really make sense to compare apples and oranges, and insisting that orange juice is better for you than apple juice ;-) -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Experts (#1, Sep 15 2015)
2015-09-14: Released mxODBC Plone/Zope DA 2.2.3 http://egenix.com/go84 2015-09-18: PyCon UK 2015 ... 3 days to go 2015-09-26: Python Meeting Duesseldorf Sprint 2015 11 days to go eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/

On Tue, Sep 15, 2015 at 1:45 AM, M.-A. Lemburg <mal@egenix.com> wrote:
Err... I think we're arguing past each other. (Hint: I'm a scientist, not a cryptographer ;-).) My email was *only* trying to clear up the argument that keeps popping up about whether or not a cryptographic RNG could introduce bias in simulations etc., as compared to the allegedly-better-behaved Mersenne Twister. (As in e.g. your comment upthread that "[MT] is proven to be equidistributed which is a key property needed for it to be used as basis for other derived probability distributions".) This argument is incorrect -- equidistribution is not a guarantee that an RNG will produce good results when deriving other probability distributions, and in general cryptographic RNGs will produce as-or-better results than MT in terms of correctness of output. On this particular axis, using a cryptographic RNG is not at all dangerous. Obviously this is only one of the considerations in choosing an RNG; the quality of the randomness is totally orthogonal to considerations like determinism. (Cryptographers also have deterministic RNGs -- they call them "stream ciphers" -- and these will also meet or beat MT in any practically relevant test of correctness for the same reasons I outlined, despite not being provably equidistributed. Of course there are then yet other trade-offs like speed. But that's not really relevant to this thread, because no-one is proposing replacing MT as the standard deterministic RNG in Python; I'm just trying to be clear about how one judges the quality of randomness that an RNG produces.) -n -- Nathaniel J. Smith -- http://vorpus.org

On 15.09.2015 13:41, Nathaniel Smith wrote:
Ok, thanks for the clarification.
You won't get me to agree on "statistical tests are better than mathematical proofs", so let's call it a day :-)
-- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Experts (#1, Sep 15 2015)
2015-09-14: Released mxODBC Plone/Zope DA 2.2.3 http://egenix.com/go84 2015-09-18: PyCon UK 2015 ... 3 days to go 2015-09-26: Python Meeting Duesseldorf Sprint 2015 11 days to go eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/

On 15/09/15 10:45, M.-A. Lemburg wrote:
Yes and no. Conceptually it means that k subsequent samples will have exactly zero correlation. But any PRNG that produces detectable correlation between samples 623 steps apart is junk anyway. The MT have proven equidistribution for k=623, but many have measured equidistribution for far longer periods than that. Numerical computations are subject to rounding error and truncation error whatever you do. The question is whether the deviation from k-dim equidistribution will show up in your simulation result or drown in the error terms. Sturla

On 15.09.2015 14:09, Sturla Molden wrote:
I guess the answer is: it depends :-) According to the SFMT paper: """ ...it requires 10**28 samples to detect an F2-linear relation with 15 (or more) terms among 521 bits, by a standard statistical test. If the number of bits is increased, the necessary sample size is increased rapidly. Thus, it seems that k(v) of SFMT19937 is sufficiently large, far beyond the level of the observable bias. On the other hand, the speed of the generator is observable. """ http://www.math.sci.hiroshima-u.ac.jp/~m-mat/MT/SFMT/M062821.pdf (which again refers to this paper: http://www.math.sci.hiroshima-u.ac.jp/~m-mat/MT/ARTICLES/HONGKONG/hong-fin4....) 10**28 is already a lot of data, but YMMV, of course. Here's a quote for the WELL family of PRNGs: """ The WELL generators mentioned in Table IV successfully passed all the statistical tests included ... TestU01 ..., except those that look for linear dependencies in a long sequence of bits, such as the matrix-rank test ... for very large binary matrices and the linear complexity tests ... This is in fact a limitation of all F2-linear generators, including the Mersenne twister, the TT800, etc. Because of their linear nature, the sequences produced by these generators just cannot have the linear complexity of a truly random sequence. This is definitely unacceptable in cryptology, for example, but is quite acceptable for the vast majority of simulation applications if the linear dependencies are of long range and high order. """ http://www.iro.umontreal.ca/~lecuyer/myftp/papers/wellrng.pdf -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Experts (#1, Sep 15 2015)
2015-09-14: Released mxODBC Plone/Zope DA 2.2.3 http://egenix.com/go84 2015-09-18: PyCon UK 2015 ... 3 days to go 2015-09-26: Python Meeting Duesseldorf Sprint 2015 11 days to go eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/

On 15/09/15 16:20, M.-A. Lemburg wrote:
You seem to be confusing the DCMT with the SFMT which is a fast SIMD friendly Mersenne Twister. The DCMT is intended for using the Mersenne Twister in parallel computing (i.e. one Mersenne Twister per processor). It is not a Mersenne Twister accelerated with parallel hardware. That would be the SFMT. http://www.math.sci.hiroshima-u.ac.jp/~m-mat/MT/DC/dc.html http://www.math.sci.hiroshima-u.ac.jp/~m-mat/MT/DC/dgene.pdf The period for the DC Mersenne Twisters they report are long enough, e.g. 2^127-1 or 2^521-1, but much shorter than the period of MT19937 (2^19937-1). This does not matter because the period of MT19937 is excessive. In scientific computing, the sequence is long enough for most practical purposes if it is larger than 2^64. 2^127-1 is more than enough, and this is the shortest period DCMT reported in the paper. So do we care? Probably not. Sturla

On 15.09.2015 16:54, Sturla Molden wrote:
I was talking about the SFMT, which is a variant of the MT for processors with SIMD instruction sets (most CPUs have these nowadays) and which has 32-, 64-bit or floating point output: http://www.math.sci.hiroshima-u.ac.jp/~m-mat/MT/SFMT/index.html But what I really wanted to reference was the discussion in the SFMT paper about the practical effects of the 623-dim equidistribution (see the end of the first paper I quoted; the discussion references the second paper).
Thanks for the pointers. I wasn't aware of a special MT variant for parallel computing. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Experts (#1, Sep 15 2015)
2015-09-14: Released mxODBC Plone/Zope DA 2.2.3 http://egenix.com/go84 2015-09-18: PyCon UK 2015 ... 3 days to go 2015-09-26: Python Meeting Duesseldorf Sprint 2015 11 days to go eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/

M.-A. Lemburg wrote:
According to http://www.pcg-random.org/other-rngs.html This chacha20 implementation is seedable and should be reproducible: https://gist.github.com/orlp/32f5d1b631ab092608b1 ...though I am concerned about the k-dimensional equidistribution as a scientist, and also that if the random number generator is changed without the interface changing, then it may screw up tests and existing codes which rely on a particular sequence of random numbers. J

[M.-A. Lemburg <mal@egenix.com>]
For example, run AES in CTR mode. Remember that we did something related on whatever mailing list it was ;-) discussing the PSF's voting system, to break ties in a reproducible-by-anyone way using some public info ("news") that couldn't be known until after the election ended. My understanding is that ChaCha20 (underlying currently-trendy implementations of arc4random) is not only deterministic, it even _could_ support an efficient jumpahead(n) operation. The specific OpenBSD implementation of arc4random goes beyond just using ChaCha20 by periodically scrambling the state with kernel-obtained "entropy" too, and that makes it impossible to reproduce its sequence. But it would remain a crytpo-strength generator without that extra scrambling step. Note that these _can_ be very simple to program. The "Blum Blum Shub" crypto generator from 30 years ago just iteratively squares a "big integer" modulo a (carefully chosen) constant. Not only deterministic, given any integer `i` it's efficient to directly compute the i'th output. It's an expensive generator, though (typically only 1 output bit is derived from each modular squaring operation).

On 15.09.2015 17:46, Tim Peters wrote:
Ah, now we're getting somewhere :-) If we accept that non-guessable, but deterministic is a good compromise, then adding a cipher behind MT sounds like a reasonable way forward, even as default. For full crypto strength, people would still have to rely on solutions like /dev/urandom or the OpenSSL one (or reseed the default RNG every now and then). All others get the benefit of non-guessable, but keep the ability to seed the default RNG in Python. Is there some research on this (MT + cipher or hash) ?
IMO, that's a different discussion and we should rely on existing well tested full entropy mixers (urandom or OpenSSL) until the researchers have come with something like MT for chaotic PRNGs. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Experts (#1, Sep 15 2015)
2015-09-14: Released mxODBC Plone/Zope DA 2.2.3 http://egenix.com/go84 2015-09-18: PyCon UK 2015 ... 3 days to go 2015-09-26: Python Meeting Duesseldorf Sprint 2015 11 days to go eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/

... [Marc-Andre]
I expect the only real reason "new entropy" is periodically mixed in to OpenBSD's arc4random() is to guard against that a weakness in ChaCha20 may be discovered later. If there were any known computationally feasible way whatsoever to distinguish bare-bones ChaCha20's output from a "truly random" sequence, it wouldn't be called "crypto" to begin with. But reseeding MT every now and again is definitely not suitable for crypto purposes. You would need to reseed at least every 624 outputs, and from a crypto-strength seed source. In which case, why bother with MT at all? You could just as well use the crypto source directly.
Is there some research on this (MT + cipher or hash) ?
Oh, sure. MT's creators noted from the start that it would suffice to run MT's outputs through a crypto hash (like your favorite flavor of SHA). That's just as vulnerable to "poor seeding" attacks as plain MT, but it's computationally infeasible to deduce the state from any number of hashed outputs (unless your crypto hash is at least partly invertible, in which case it's not really a crypto hash ;-) ).; For other approaches, search for CryptMT. MT's creators suggested a number of other schemes over the years. The simplest throws away the "tempering" part of MT (the 4 lines that map the raw state word into a mildly scrambled output word - not because it needs to be thrown away, but because they think it would no longer be needed given what follows). Then one byte is obtained via grabbing the next MT 32-bit output, folding it into a persistent accumulator via multiplication, and just revealing the top byte: accum = some_odd_integer while True: accum *= random.getrandbits(32) | 1 yield accum >> 24 I did see one paper suggesting it was possible to distinguish the output of that from a truly random sequence given 2**50 consecutive outputs (but that's all - still no way to deduce the state).

[Tim]
Although what's "computationally feasible" may well have changed since then! These days I expect even a modestly endowed attacker could afford to store an exhaustive table of the 2**32 possible outputs and their corresponding hashes. Then the hashes are 100% invertible via simple lookup, so are no better than not hashing at all.

On 2015-09-16 01:43, Tim Peters wrote:
Periodic reseeding also serves to guard against other leaks of information about the underlying state that don't come from breaking through the cipher. If an attacker manages to deduce the state through side channels, timing attacks on the machine, brief physical access, whatever, then reseeding with new entropy will limit the damage rather than blithely continuing on with a compromised state forever. It's an important feature of a CSPRNG. Using a crypto output function in your PRNG is a necessary but not sufficient condition for security. -- Robert Kern "I have come to believe that the whole world is an enigma, a harmless enigma that is made terrible by our own mad attempt to interpret it as though it had an underlying truth." -- Umberto Eco

On 16.09.2015 02:43, Tim Peters wrote:
Thanks for the "CryptMT" pointers. I'll do some research after PyCon UK on this. http://www.math.sci.hiroshima-u.ac.jp/~m-mat/MT/CRYPTMT/index.html A quick glimpse at http://www.ecrypt.eu.org/stream/p3ciphers/cryptmt/cryptmt_p3.pdf suggests that this is a completely new stream cipher, though it uses the typical elements (key + non-linear filter + feedback loop). The approach is interesting, though: they propose an PRNG which can then get used as stream cipher by XOR'ing the PRNG output with the data stream. So the PRNG implies the cipher, not the other way around as many other approaches to CSPRNGs. That's probably also one of its perceived weaknesses: it's different than the common approach. On 16.09.2015 02:55, Tim Peters wrote:> [Tim]
Simply adding a hash doesn't sound like a good idea. My initial thought was using a (well studied) stream cipher on the output, not just a hash on the output. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Experts (#1, Sep 16 2015)
2015-09-14: Released mxODBC Plone/Zope DA 2.2.3 http://egenix.com/go84 2015-09-18: PyCon UK 2015 ... 2 days to go 2015-09-26: Python Meeting Duesseldorf Sprint 2015 10 days to go eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/

On Wed, Sep 16, 2015 at 1:21 AM, M.-A. Lemburg <mal@egenix.com> wrote:
NB that that paper also says that it's patented and requires a license for commercial use.
I think you just described the standard definition of a stream cipher? "Stream cipher" is just the crypto term for a deterministic RNG, that you XOR with data. (However it's a not a CSPRNG, because those require seeding schedules and things like that -- check out e.g. Fortuna.) -n -- Nathaniel J. Smith -- http://vorpus.org

On 16.09.2015 11:02, Nathaniel Smith wrote:
Hmm, you're right: """ If CryptMT is selected as one of the recommendable stream ciphers by eSTREAM, then it is free even for commercial use. """ Hasn't happened yet, but perhaps either eSTREAM or the authors will change their minds. Too bad they haven't yet, since it's a pretty fast stream cipher :-( Anyway, here's a paper on CryptMT: http://cryptography.gmu.edu/~jkaps/download.php?docid=1083
The standard definition I know reads like this: """ Stream ciphers are an important class of encryption algorithms. They encrypt individual characters (usually binary digits) of a plaintext message one at a time, using an encryption transformation which varies with time. """ (taken from Chap 6.1 Introduction of "Handbook of Applied Cryptography"; http://cacr.uwaterloo.ca/hac/about/chap6.pdf) That's a bit more general than what you describe, since the keystream can pretty much be generated in arbitrary ways. What I wanted to emphasize is that a common way of coming up with a stream cipher is to use an existing block cipher which you then transform into a stream cipher. See e.g. https://www.emsec.rub.de/media/crypto/attachments/files/2011/03/hudde.pdf E.g. take AES run in CTR (counter) mode: it applies AES repeatedly to the values of a simple counter as "RNG". Running MT + AES would result in a similar setup, except that the source would have somewhat better qualities and would be based on standard well studied technology, albeit slower than going straight for a native stream cipher. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Experts (#1, Sep 16 2015)
2015-09-14: Released mxODBC Plone/Zope DA 2.2.3 http://egenix.com/go84 2015-09-18: PyCon UK 2015 ... 2 days to go 2015-09-26: Python Meeting Duesseldorf Sprint 2015 10 days to go eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/

On 2015-09-16 12:38, M.-A. Lemburg wrote:
Indeed. DE Shaw has done the analysis for you: https://www.deshawresearch.com/resources_random123.html
Why do you think it would have better qualities? You'll have to redo the analysis that makes MT and AES each so well-studied, and I'm not sure that all of the desirable properties of either will survive the combination. -- Robert Kern "I have come to believe that the whole world is an enigma, a harmless enigma that is made terrible by our own mad attempt to interpret it as though it had an underlying truth." -- Umberto Eco

On 15/09/15 09:36, Nathaniel Smith wrote:
No. Cryptographers care about predictability, not the exact distribution. Any distribution can be considered randomness with a given entropy, but not any distribution is uniform. Only the uniform distribution is uniform. That is where our needs fail to meet. Cryptographers damn any RNG that allow the internal state to be reconstructed. Scientists damn any RNG that do not produce the distribution of interest. Sturla

On Sep 15, 2015 4:57 AM, "Sturla Molden" <sturla.molden@gmail.com> wrote:
Any distribution can be considered randomness with a given entropy, but not any distribution is uniform. Only the uniform distribution is uniform. That is where our needs fail to meet. Cryptographers damn any RNG that allow the internal state to be reconstructed. Scientists damn any RNG that do not produce the distribution of interest. No, this is simply wrong. I promise! ("Oh, sorry, this is contradictions...") For the output of a cryptographic RNG, any deviation from the uniform distribution is considered a flaw. (And as you know, given uniform variates you can construct any distribution of interest.) If I know that you're using a coin that usually comes up heads to generate your passwords, then this gives me a head start in guessing your passwords, and that's considered unacceptable. Or for further evidence, consider: "Scott Fluhrer and David McGrew also showed such attacks which distinguished the keystream of the RC4 from a random stream given a gigabyte of output." -- https://en.m.wikipedia.org/wiki/RC4#Biased_outputs_of_the_RC4 This result is listed on wikipedia because the existence of a program that can detect a deviation from perfect uniformity given a gigabyte of samples and an arbitrarily complicated test statistic is considered a publishable security flaw (and RC4 is generally deprecated because of this and related issues -- this is why openbsd's "arc4random" function no longer uses (A)RC4). -n

On 15/09/15 14:34, Nathaniel Smith wrote:
The uniform distribution has the highest entropy, yes, but it does not mean that other distributions are unacceptable. The sequence just has to be incredibly hard to predict. A non-uniform distribution will give an adversary a head start, that is true, but if the adversary still cannot complete the brute-force attack before the end of the universe there is little help in knowing this. In scientific computing we do not care about adversaries. We care about the correctness of our numerical result. That means we should be fuzzy about the distribution, not about the predictability or "randomness" of a sequence, nor about adversaries looking to recover the internal state. MT is proven to be uniform (equidistributed) up to 623 dimensions, but it is incredibly easy to recover the internal state. The latter we do not care about. In fact, we can often do even better with "quasi-random" sequences, e.g. Sobol sequences, which are not constructed to produce "uncorrelated" points, but constructed to produce correlated points that are delibarately more uniform than uncorrelated points. Sturla

Nathaniel Smith <njs@...> writes:
Do you have links to papers analyzing chacha20 w.r.t statistical properties? The only information that I found is http://www.pcg-random.org/other-rngs.html#id11 "Fewer rounds result in poor statistical performance; ChaCha2 fails statistical tests badly, and ChaCha4 passes TestU01 but sophisticated mathematical analysis has shown it to exhibit some bias. ChaCha8 (and higher) are believed to be good. Nevertheless, ChaCha needs to go to more work to achieve satisfactory statistical quality than many other generators. ChaCha20, being newer, has received less scrutiny from the cryptographic community than Arc4." Stefan Krah

On 2015-09-15 04:53, Steven D'Aprano wrote:
k=623 is a tiny number of dimensions for testing "well-distributedness". You should be able to draw millions of values without detecting significant correlations. Perfect k-dim equidistribution is not a particularly useful metric on its own (at least for simulation work). You can't just say "PRNG A has a bigger k than PRNG B therefore PRNG A is better". You need a minimum period to even possibly reach a certain k, and that period goes up exponentially with k. Given two PRNGs that have the same period, but one has a much smaller k than the other, *then* you can start making inferences about relative quality (again for simulation work; ChaCha20 has a long period but no guarantees of k that I am aware of, but its claim to fame is security, not simulation work). Astronomical periods have costs, so you only want to pay for what is actually worth it, so it's certainly a good thing that the MT has a k near its upper bound. PRNGs with shorter, but still roomy periods like 2**128 are not worse because they have necessarily smaller ks. -- Robert Kern "I have come to believe that the whole world is an enigma, a harmless enigma that is made terrible by our own mad attempt to interpret it as though it had an underlying truth." -- Umberto Eco

On 14 September 2015 at 16:38, Nathaniel Smith <njs@pobox.com> wrote:
I started a new thread breaking that out into more of a proto-PEP (including your reference to the PHP research - thanks for that!). Regards, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

[Tim]
[Nathaniel Smith <njs@pobox.com>]
Here you go: https://media.blackhat.com/bh-us-12/Briefings/Argyros/BH_US_12_Argyros_PRNG_...
The most important question I have about that is from its Appendix D, where they try to give a secure "all-purpose" token generator. If openssl_random_pseudo_bytes is available, they use just that and call it done. Otherwise they go on to all sorts of stuff. My question is, even if /dev/urandom is available, they're _not_ content to use that.alone. They continue to mix it up with all other kinds of silly stuff. So why do they trust urandom less than OpenSLL's gimmick? It's important to me because, far as I'm concerned, os.urandom() is already Python's way to spell "crypto random" (yes, it would be better to have one guaranteed to be available on all Python platforms).
I couldn't help but note ;-) that at least 3 of the apps had already attempted to repair bug reports filed against their insecure password-reset schemes at least 5 years ago. You can lead a PHP'er to water, but ... ;-)
It's cute, but I doubt anyone but the authors had the patience - or knowledge - to write a solver dealing with tens of thousands of picky equations over about 20000 variables. They don't explain enough about the details for a script kiddie to do it. Even very bright hackers attack what's easiest to topple, like poor seeding - that's just easy brute force. It remained unclear to me _exactly_ what "in the presence of non consecutive outputs" is supposed to mean. In the only examples, they knew exactly how many times MT was called. "Non consecutive" in all those contexts appeared to mean "but we couldn't observe_any_ output bits in some cases - the ones we could know something about were sometimes non-consecutive". So in the MT output sequence, they had no knowledge of _some_ of the outputs, but they nevertheless knew exactly _which_ of the outputs they were wholly ignorant about. That's no problem for the equation solver. They just skip adding any equations for the affected bits, keep collecting more outputs and potentially "wrap around", probably leading to an overdetermined system in the end. But Python doesn't work the way PHP does here. As explained in another message, in Python you can have _no idea_ how many MT outputs are consumed by a single .choice() call. In the PHP equivalent, you always consume exactly one MT output. PHP's method suffers statistical bias, but under the covers Python uses an accept/reject method to avoid that. Any number of MT outputs may be (invisibly!) consumed before "accept" is reached, although typically only one or two. You can deduce some of the leading MT output bits from the .choice() result, but _only_ for the single MT output .choice() reveals anything about. About the other MT outputs it may consume, you can't even know that some _were_ skipped over, let alone how many. Best I can tell, that makes a huge difference to whether their solver is even applicable to cracking idiomatic "password generators" in Python. You can't know which variables correspond to the bits you can deduce. You could split the solver into multiple instances to cover all feasible possibilities (for how many MT outputs may have been invisibly consumed), but the number of solver instances needed then grows exponentially with the number of outputs you do see something about. In the worst case (31 bits are truncated), they need over 19000 outputs to deduce the state. Even a wildly optimistic "well, let's guess no more than 1 MT output was invisibly rejected each time" leads to over 2**19000 solver clones then. Sure, there's doubtless a far cleverer way to approach that. But unless another group of PhDs looking to level up in Security World burns their grant money to tackle it, that's yet another attack that will never be seen in the real world ;-)
And they all use .choice(), which is overwhelmingly the most natural way to do this kind of thing in Python.
As above, I'm still not much worried about .choice(). Even if I were, I'd be happy to leave it at "use .choice() from a random.SystemRandom instance instead". Unless there's some non-obvious (to me) reason these authors appear to be unhappy with urandom.
_I_ would use SystemRandom. But, as you can tell, I'm extremely paranoid ;-)
Have you released software used by millions of people? If not, you have no idea how ticked off users get by needing to change anything. But Guido does ;-) Why not add a new "saferandom" module and leave it at that? Encourage newbies to use it. Nobody's old code ever breaks. But nobody's old code is saved from problems it likely didn't have anyway ;-)
-n

On Mon, Sep 14, 2015 at 10:49 PM, Tim Peters <tim.peters@gmail.com> wrote:
Who knows why they wrote the code in that exact way. /dev/urandom is fine.
This led me to look at the implementation of Python's choice(), and it's interesting; I hadn't realized that it was using such an inefficient method. (To make a random selection between, say, 36 items, it rounds up to 64 = 2**6, draws a 32-bit sample from MT, discards 26 of the bits (!) to get a number between 0-63, and then repeats until this number happens to fall in the 0-35 range, so it rejects with probability ~0.45. A more efficient algorithm is the one that it uses if getrandbits is not available, where it uses all 32 bits and only rejects with probability (2**32 % 36) / (2**32) = ~1e-9.) I guess this does add a bit of obfuscation. OTOH the amount of obfuscation is very sensitive to the size of the password alphabet. If I use uppercase + lowercase + digits, that gives me 62 options, so I only reject with probability 1/32, and I can expect that any given 40-character session key will contain zero skips with probability ~0.28, and that reveals 240 bits of seed. I don't have time right now to go look up the MT equations to see how easy it is to make use of such partial information, but there certainly are lots of real-world weaponized exploits that begin with something like "first, gather 10**8 session keys...". I certainly wouldn't trust it. Also, if I use the base64 or hex alphabets, then the probability of rejection is 0, and I can deterministically read off bits from the underlying MT state. (Alternatively, if someone in the future makes the obvious optimization to choice(), then it will basically stop rejecting in practice, and again it becomes trivial to read off all the bits from the underlying MT state.) The point of "secure by default" is that you don't have to spend all these paragraphs doing the math to try and guess whether some RNG usage might maybe be secure; it just is secure.
Your "wildly optimistic" estimate is wildly conservative under realistic conditions. How confident are you that the rest of your analysis is totally free of similar errors? Would you willing to bet, say, the public revelation of every website you've visited in the last 5 years on it?
Grant money is a drop in the bucket of security research funding these days. Criminals and governments have very deep pockets, and it's well documented that there are quite a few people with PhDs who make their living by coming up with exploits and then auctioning them on the black market. BTW, it looks like that PHP paper was an undergraduate project. You don't need a PhD to solve linear equations :-).
No, SystemRandom.choice is certainly fine. But people clearly don't use it, so it's fine-ness doesn't matter that much in practice... -n -- Nathaniel J. Smith -- http://vorpus.org

... [Nathaniel Smith <njs@pobox.com>]
Here you go: https://media.blackhat.com/bh-us-12/Briefings/Argyros/BH_US_12_Argyros_PRNG_...
[Tim, on appendix D]
[Nathaniel]
Who knows why they wrote the code in that exact way. /dev/urandom is fine.
Presumably _they_ know, yes? When they're presenting a "gold standard" all-purpose token generator, it would be nice if they had taken care to make every step clear to other crypto wonks. Otherwise the rest of us are left wondering if _anyone_ really knows what they're talking about ;-) [on the MT state solver]
Speed is irrelevant here, but do note that .choice() isn't restricted to picking from less than 2**32 possibilities.
random.choice(range(2**62)) 2693408174642551707
Special-casing is done at Python speed, and adding a Python-level branch to ask "is it less than 2**32?" is typically more expensive than calling MT again.
And note that the branch you like better _also_ needs another Python-speed test to raise an exception if the range is "too big". The branch actually used doesn't need that.
To be quite clear, nothing can ever be learned about "the seed". All that can be observed is bits from the outputs, and the goal is to deduce "the state". There is an invertible permutation mapping a state word to an output word, but from partial knowledge of N < 32 bits from a single output word you typically can't deduce N bits of the state word from which the output was derived (for example, if you only know the first bit of an output, you can't deduce from that alone the value of any bit in the corresponding state word). That's why they need such a hairy framework to begin with (in the example, you _can_ add equations tying the known value of the first output bit to linear functions of all bits of the state related to the first output bit). About the "240 bits", they need about 80 times more than that to deduce the state. 0.28 ** 80 is even less than a tenth ;-)
I don't have time right now to go look up the MT equations to see how easy it is to make use of such partial information,
It's robust against only knowing a subset of an output's bits (including none). It's not robust against not knowing _which_ output you're staring at. They label the state bits with variables x_0 through x_19936, and generate equations relating specific state bits to the output bits they can deduce. If they don't know how many times MT was invoked between outputs they know were generated, they can't know _which_ state-bit variables to plug into their equations. Is this output related to (e.g.) at least x_0 through x_31, or is it x_32 through x_63? x_64 through x_95? Take an absurd extreme to illustrate an obvious futility in general: suppose .choice() remembered the size of the last range it was asked to pick from. Then if the next call to .choice() is for the same-sized range, call MT 2**19937-2 times ignoring the outputs, and call it once more. It will get the same result then. "The solver" will deduce exactly the same output bits every time, and will never learn more than that. Eventually, if they're doing enough sanity checking, the best their solver could do is notice that the derived equations (regardless of how they construct them) are inconsistent. The worst it could do is "deduce" a state that's pure fantasy. They can't know that they are in fact seeing an output derived from exactly the same state every time. Unless they read the source code for .choice() and see it's playing this trick. in that case, they would never add to their collection of equations after the first output was dealt with. _Then_ the equations would faithfully reflect the truth: that they learned a tiny bit at first, but never learn more than just that.
And I'm not asking you to. I wouldn't either. I'm expressing skepticism about that the solver in this paper is a slam-dunk proof that all existing idiomatic Python password generators are about to cause the world to end ;-)
I' readily agree that if .choice(x) is used whenever len(x) is a power of two, then their solver applies directly to such cases. It's in all and only such cases they can know exactly how many MT outputs were consumed, and so know also which specific state-bit variables to use.
(Alternatively, if someone in the future makes the obvious optimization
As above, if what you like better were obviously faster in reality, it would have been written that way already ;-) Honest, this looks like Raymond Hettinger's code, and he typically obsesses over speed.
to choice(), then it will basically stop rejecting in practice, and again it becomes trivial to read off all the bits from the underlying MT state.)
Not trivial. You still need this hairy solver framework, and best I can tell its code wasn't made available. Note that the URL given in the paper gives a 404 error now. It isn't trivial to code it either.
No argument there! I'm just questioning how worried people "should be" over what actual Python code actually does now. I confess I remain free of outright panic ;-)
Your "wildly optimistic" estimate is wildly conservative under realistic conditions.
Eh.
I couldn't care less if that were revealed. In fact, I'd enjoy the trip down memory lane ;-)
Excellent points! Snideness doesn't always pay off for me ;-)
BTW, it looks like that PHP paper was an undergraduate project. You don't need a PhD to solve linear equations :-).
So give 'em their doctorates! I've seen doctoral theses a hundred times less substantial ;-)
It's just waiting for a real exploit. People writing security papers love to "name & shame". "Gothca! Gotcha!" Once people see there _is_ "a real problem" (if there is), they'll scramble to avoid being the target of the next name-&-shame campaign. Before then, they're much too busy trying to erase all traces of every website they've visited in the last 5 years ;-)

Tim Peters writes:
Well, my worst possible case "in theory" was that a documented MTA parameter would simply not be implemented and not error when I configured it to a non-default value -- but that's how yours truly ended up running an open relay (Smail 3.1.100 I think it was, and I got it from Debian so it wasn't like I was using alpha code). That's what taught me to do functional tests. :-) So yes, I do think there are a lot of folks out there working with software without realizing that there are any risks involved. Life being life, I'd bet on some of them being programmers working with RNG.
But see, that's my main point. Analogies to *anybody's* personal life are irrelevant when we're talking about a bug that could be fixed *once* and save *millions* of users from being exploited. If the wonks are right, it's a big deal, big enough to balance the low probability of them being right. ;-)
Sure, but that's not what happened at AOL and Yahoo! AFAIK (of course they're pretty cagey about details). It seems that a single leak or a small number of leaks at each company exposed millions of address books. (I hasten to add that I doubt the Mersenne Twister had anything to do with the leaks.)
What I question is whether this has anything _plausible_ to do with Python's PRNG.
Me too. People who claim some expertise think so, though.
That's not unfair, but if they did, I'd go find myself another crypto wonk. But who cares about me? What matters is that Guido would, too.
Judging [the random module] by standards that didn't become trendy until much later is only fair now ;-)
You're not the only one who, when offered a choice between fair and fun, chooses the latter. ;-)
We can even give it a name shorter than "random" to encourage its use. That's all most users really care about anyway ;-)
That's beyond "unfair"!

On Thu, Sep 10, 2015 at 9:07 PM, Stephen J. Turnbull <stephen@xemacs.org> wrote:
* [ ] DOC: note regarding the 'pseudo' part of pseudorandom and MT * https://docs.python.org/2/library/random.html * https://docs.python.org/3/library/random.html * [ ] DOC: upgrade cryptography docs in re: random numbers * https://cryptography.io/en/latest/random-numbers/ * [ ] ENH: random_(algo=) (~IRandomSource) * [ ] ENH: Add arc4random * [ ] ENH: Add chacha * [ ] ENH: Add OpenBSD's * [ ] BUG,SEC: new secure default named random.random * must also be stateful / **reproducible** (must support .seed) * justified as BUG,SEC because: [secure by default is the answer] * https://en.wikipedia.org/wiki/Session_fixation * https://cwe.mitre.org/data/definitions/384.html * The docs did not say "you should know better." * see also: hash collisions: https://bugs.python.org/issue13703 * [ ] REF: random.random -> random.random_old

devguide/documenting.html#security-considerations-and-other-concerns https://docs.python.org/devguide/documenting.html#security-considerations-an... On Fri, Sep 11, 2015 at 11:05 AM, Wes Turner <wes.turner@gmail.com> wrote:

On Wed, Sep 9, 2015 at 4:23 PM, Greg Ewing <greg.ewing@canterbury.ac.nz> wrote:
The answer to both of these questions is no. For modern cryptographic PRNGs, full determinism is considered a flaw, and determinism is a necessary precondition to supporting jumpahead. The reason is that even if an attacker learns your secret RNG state at time t, then you want this to have a limited impact -- they'll obviously be able to predict your RNG output for a while, but you don't want them to be able to predict it from now until the end of time. So determinism is considered bad, and high-quality CPRNGs automatically reseed themselves with new entropy according to some carefully designed schedule. And OpenBSD's "arc4random" generator is a high-quality CPRNG in this sense. -- Nathaniel J. Smith -- http://vorpus.org

On Wed, Sep 9, 2015, at 19:23, Greg Ewing wrote:
Being able to produce multiple independent streams of numbers is the important feature. Doing it by "jumping ahead" seems less so. And the need for doing it "efficiently" isn't as clear either - how many streams do you need?

random832@fastmail.us wrote:
Being able to produce multiple independent streams of numbers is the important feature. Doing it by "jumping ahead" seems less so.
Doing it by jumping ahead isn't strictly necessary; the important thing is to have some way of generating *provably* non-overlapping and independent sequences. Jumping ahead is one obvious way to achieve that. Simply setting the seed of each generator randomly and hoping for the best is not really good enough.
And the need for doing it "efficiently" isn't as clear either
I say that because you can obviously jump ahead N steps in any generator just by running it for N cycles, but that's likely to be unacceptably slow. A more direct way of getting there is desirable. -- Greg

Greg Ewing writes:
By definition you don't have (stochastic) independence if you're using a PRNG and deterministically jumping ahead. Proving non-overlapping is easy, but I don't even have a definition of "independence" of fixed sequences: equidistribution of pairs? That might make sense if you have a sequence long enough to contain all pairs, but even then you really just have a single sequence with larger support, and I don't see how you can prove that it's a "good" sequence for using in a simulation.
It is not at all obvious to me that jumping ahead is better than randomly seeding separate generators. The latter actually gives stochastic independence (at least if you randomize over all possible seeds).

On Wed, Sep 9, 2015 at 2:07 PM, Steven D'Aprano <steve@pearwood.info> wrote:
This is a good point. Let's remove the ssl library from Python too. Until recently, the most widely used versions of Python were all woefully behind the times and anyone wanting anything relatively up-to-date had to use a third party library. Even so, if you count the distributions of RHEL and other "Long Term Support" operating systems that are running Python 2.7 (pre 2.7.9) and below, most people are operating with barely secure versions of OpenSSL on versions of Python that don't have constants for modern secure communications standards. Clearly, deciding to add the ssl module was a huge mistake because it wasn't forwards compatible with future security standards.

Tim Peters <tim.peters@...> writes:
My intuition is that if someone just uses a random() function without checking if it's cryptographically secure then the application will probably have other holes as well. I mean, for example no one is going to use C's rand() function for crypto. Stefan Krah

On Wed, Sep 9, 2015, at 14:00, Stefan Krah wrote:
Let's turn the question around - what's the _benefit_ of having a random number generator available that _isn't_ cryptographically secure? One possible argument is performance. If that's the issue - what are our performance targets? How can they be measured? Another argument is that some applications really do need deterministic seeding. Is there a reason not to require them to be explicit about it?

<random832@...> writes:
As you say, performance: http://www.pcg-random.org/rng-performance.html Random number generation is a very broad field. I'm not a specialist, so I just entered "Mersenne Twister" into an academic search engine and got many results, but none for arc4random. It's an interesting question you ask. I'd have to do a lot of reading first to get an overview. Stefan Krah

<random832@...> writes:
I know chacha (and most of djb's other works). I thought we were talking about the suitability of cryptographically secure RNGs for traditional scientific applications, in particular whether there are *other* reasons apart from performance not to use them. Stefan Krah

[Stefan Krah <skrah@bytereef.org>]
I read it the same way on all counts.
Yes, if they're not checking the random() docs first, they're a total crypto moron - in which case it's insane to believe they'll do anything else related to crypto-strength requirements right either. It's hard to make something idiot-proof even if your target audience is bona fide crypto experts ;-)

On 09.09.15 19:35, Guido van Rossum wrote:
Entropy -- limited and slowly recoverable resource (especially if there is no network activity). If you consume it too quickly (for example in a scientific simulation or in a game), it will not have time to recover, that will slow down not only your program, but all consumers of entropy. The use of random.SystemRandom by default looks dangerous. It is unlikely that all existing programs will be rewritten to use random.FastInsecureRandom.

On September 9, 2015 at 1:19:34 PM, Serhiy Storchaka (storchaka@gmail.com) wrote:
This isn’t exactly true. Hardware entropy limited and slowly recovering which is why no sane implementation uses that except to periodically reseed the CSPRNG which is typically based on ARC4 or ChaCha. The standard CSPRNGs that most platforms use are fast enough for most people's use cases. ----------------- Donald Stufft PGP: 0x6E3CBCE93372DCFA // 7C6B 7C5D 5E2B 6356 A926 F04F 6E3C BCE9 3372 DCFA

On Wed, Sep 9, 2015, at 13:18, Serhiy Storchaka wrote:
http://www.2uo.de/myths-about-urandom/ should be required reading. As far as I know, no-one is actually proposing the use of a method that blocks when there's "not enough entropy", nor does arc4random itself appear to do so.

The commit message changing libc's random functions to use arc4random is as follows:
Perhaps someone could ask them for information about that audit, and how many / what of those pieces of software were actually using these functions in ways which made them insecure, but whose security would be notably improved by a better random implementation (I suspect that the main thrust of the audit, though, was on finding which ones would be broken by taking away the default deterministic seeding). That could tell us how typical it is for people to ignorantly use default random functions for security-critical code with no other flaws.

In a word - No. There is zero reason for people doing crypto to use the random module, therefor we should not change the random module to be cryptographically secure. Don't break things and slow my code down by default for dubious reasons, please. On 9/9/2015 12:35, Guido van Rossum wrote:

[Alexander Walters <tritium-list@sdamon.com>]
Would your answer change if a crypto generator were _faster_ than MT? MT isn't speedy by modern standards, and is cache-hostile (about 2500 bytes of mutable state). Not claiming a crypto hash _would_ be faster. But it is possible.

On Wed, Sep 09, 2015 at 08:55:06PM -0500, Tim Peters wrote:
If the crypto PRNG were comparable in speed to what we have now (not significantly slower), or faster, *and* gave reproducible results with the same seed, *and* had no known/detectable statistical biases), and we could promise that those properties would continue to hold even when the state of the art changed and we got a new default crypto PRNG, then I'd still be -0.5 on the change due to the "false sense of security" factor. As I've already mentioned in another comment, I'm with Paul Moore -- I think anyone foolish/ignorant/lazy/malicious enough to use the default PRNG for crypto is surely making more than one mistake, and fixing that one thing for them will just give people a false sense of security. -- Steve

On 9/9/2015 22:11, Steven D'Aprano wrote: source of cryptographic data, since... Lets be frank about this, Guido is not a security expert. I am not a security expert. Tim, I suspect you are not a security expert. Lets leave actually attempting to be at the cutting edge of cryptographic randomness to modules by security experts. I have far too much use for randomness outside of a cryptographic context to sacrifice the API and feature set we have for, in my opinion, a myopic focus on one, already discouraged, use of the random module.

On 10/09/15 03:55, Tim Peters wrote:
Speed is not the main matter of concern. MT19937 is not very fast, it is very accurate. It is used in scientific computing when we want to simulate sampling from a given distribution as accurately as possible. Its strength is in the distribution of number it generates, not in its security or speed. MT19937 allows us to produce a very precise simulation of a stochastic process. The alternatives cannot compare in numerical quality, though they might be faster or more secure, or both. When we use MT19937 in scientific computing we deliberately sacrifice speed for accuracy. A cryto hash might be faster, but will it be more accurate? Accuracy means how well the generated sequence emulates sampling from a perfect uniform distribution. MT19937 does not have any real competition in this game. Sturla
participants (28)
-
Alexander Walters
-
Brett Cannon
-
Chris Angelico
-
Cory Benfield
-
Donald Stufft
-
Eric Snow
-
Greg Ewing
-
Guido van Rossum
-
Ian Cordasco
-
Jeremy Sanders
-
João Bernardo
-
M.-A. Lemburg
-
Nathaniel Smith
-
Nick Coghlan
-
Nikolaus Rath
-
Paul Moore
-
Petr Viktorin
-
random832@fastmail.us
-
Robert Kern
-
Serhiy Storchaka
-
Stefan Krah
-
Stephen J. Turnbull
-
Steven D'Aprano
-
Sturla Molden
-
Sven R. Kunze
-
Tim Peters
-
Wes Turner
-
Xavier Combelle