[Python-ideas] Should our default random number generator be secure?

Tue Sep 15 07:49:15 CEST 2015

[Tim]
>> And yet nobody so far has a produced a single example of any harm done
>> in any of the near-countless languages that supply non-crypto RNGs.  I
>> know, my lawyer gets annoyed too when I point out there hasn't been a
>> nuclear war ;-)

[Nathaniel Smith <njs at pobox.com>]
> Here you go:
>   https://media.blackhat.com/bh-us-12/Briefings/Argyros/BH_US_12_Argyros_PRNG_WP.pdf

The most important question I have about that is from its Appendix D,
where they try to give a secure "all-purpose" token generator.  If
openssl_random_pseudo_bytes is available, they use just that and call
it done.  Otherwise they go on to all sorts of stuff.

My question is, even if /dev/urandom is available, they're _not_
content to use that.alone.  They continue to mix it up with all other
kinds of silly stuff.  So why do they trust urandom less than
OpenSLL's gimmick?  It's important to me because, far as I'm
concerned, os.urandom() is already Python's way to spell "crypto
random" (yes, it would be better to have one guaranteed to be
available on all Python platforms).

> They present real-world attacks on PHP applications that use something
> like the "password generation" code we've been talking about as a way
> to generate cookies and password reset nonces, including in particular
> the case of applications that use a strongly-seeded Mersenne Twister
> as their RNG:

I couldn't help but note ;-) that at least 3 of the apps had already
attempted to repair bug reports filed against their insecure
password-reset schemes at least 5 years ago.  You can lead a PHP'er to
water, but ... ;-)

> ...
> "Section 5.3: ... In this section we give a description of the
> Mersenne Twister generator and present an algorithm that allows the
> recovery of the internal state of the generator even when the output
> is truncated. Our algorithm also works in the presence of non
> consecutive outputs ..."

It's cute, but I doubt anyone but the authors had the patience - or
knowledge - to write a solver dealing with tens of thousands of picky
equations over about 20000 variables.  They don't explain enough about
the details for a script kiddie to do it.  Even very bright hackers
attack what's easiest to topple, like poor seeding - that's just easy
brute force.

It remained unclear to me _exactly_ what "in the presence of non
consecutive outputs" is supposed to mean.  In the only examples, they
knew exactly how many times MT was called.  "Non consecutive" in all
those contexts appeared to mean "but we couldn't observe_any_ output
bits in some cases - the ones we could know something about were
sometimes non-consecutive".  So in the MT output sequence, they had no
knowledge of _some_ of the outputs, but they nevertheless knew exactly
_which_ of the outputs they were wholly ignorant about.

That's no problem for the equation solver.  They just skip adding any
equations for the affected bits, keep collecting more outputs and
potentially "wrap around", probably leading to an overdetermined
system in the end.

But Python doesn't work the way PHP does here.  As explained in
another message, in Python you can have _no idea_ how many MT outputs
are consumed by a single .choice() call.  In the PHP equivalent, you
always consume exactly one MT output.  PHP's method suffers
statistical bias, but under the covers Python uses an accept/reject
method to avoid that.  Any number of MT outputs may be (invisibly!)
consumed before "accept" is reached, although typically only one or
two.  You can deduce some of the leading MT output bits from the
.choice() result, but _only_ for the single MT output .choice()
reveals anything about.  About the other MT outputs it may consume,
you can't even know that some _were_ skipped over, let alone how many.

Best I can tell, that makes a huge difference to whether their solver
is even applicable to cracking idiomatic "password generators" in
Python.  You can't know which variables correspond to the bits you can
deduce.  You could split the solver into multiple instances to cover
all feasible possibilities (for how many MT outputs may have been
invisibly consumed), but the number of solver instances needed then
grows exponentially with the number of outputs you do see something
about.  In the worst case (31 bits are truncated), they need over
19000 outputs to deduce the state.  Even a wildly optimistic "well,
let's guess no more than 1 MT output was invisibly rejected each time"
leads to over 2**19000 solver clones then.

Sure, there's doubtless a far cleverer way to approach that.  But
unless another group of PhDs looking to level up in Security World
burns their grant money to tackle it, that's yet another attack that
will never be seen in the real world ;-)

> Out of curiosity, I tried searching github for "random cookie
> language:python". The 5th hit (out of ~100k) was a web project that
> appears to use this insecure method to generate session cookies:
>   https://github.com/bytasv/bbapi/blob/34e294becb22bae6e685f2e742b7ffdb53a83bcb/bbapi/utils/cookie.py
>   https://github.com/bytasv/bbapi/blob/34e294becb22bae6e685f2e742b7ffdb53a83bcb/bbapi/api/router.py#L56-L66

And they all use .choice(), which is overwhelmingly the most natural
way to do this kind of thing in Python.

> ...
> There's a reason security people are so Manichean about these kinds of
> things. If something is not intended to be secure or used in
> security-sensitive ways, then fine, no worries. But if it is, then
> there's no point in trying to mess around with "probably mostly
> secure" -- either solve the problem right or don't bother. (See also:
> the time Python wasted trying to solve hash randomization without
> actually solving hash randomization [1].) If Tim Peters can get fooled
> into thinking something like using MT to generate session ids is
> "probably mostly secure", then what chance do the rest of us have?
> <wink>

As above, I'm still not much worried about .choice().  Even if I were,
I'd be happy to leave it at "use .choice() from a random.SystemRandom
instance instead".  Unless there's some non-obvious (to me) reason
these authors appear to be unhappy with urandom.

> NB this isn't an argument for *whether* we should make random
> cryptographically strong by default; it's just an argument against
> wasting time debating whether it's already "secure enough". It's not
> secure. Maybe that's okay, maybe it's not.

_I_ would use SystemRandom.  But, as you can tell, I'm extremely paranoid ;-)

> For the record though I do tend to agree with the idea that it's not
> okay, because it's an increasingly hostile world out there, and
> secure-random-by-default makes whole classes of these issues just
> disappear. It's not often that you get to fix thousands of bugs with
> one commit, including at least some with severity level "all your
> users' private data just got uploaded to bittorrent".
>
> I like Nick's proposal here:
>     https://code.activestate.com/lists/python-ideas/35842/
> as probably the most solid strategy for implementing that idea -- the
> only projects that would be negatively affected are those that are
> using the seeding functionality of the global random API, which is a
> tiny fraction, and the effect on those projects is that they get
> nudged into using the superior object-oriented API.

Have you released software used by millions of people?  If not, you
have no idea how ticked off users get by needing to change anything.
But Guido does ;-)

Why not add a new "saferandom" module and leave it at that?  Encourage
newbies to use it.  Nobody's old code ever breaks.  But nobody's old
code is saved from problems it likely didn't have anyway ;-)

> -n
>
> [1] https://lwn.net/Articles/574761/