[Python-ideas] Should our default random number generator be secure?

Tue Sep 15 10:42:26 CEST 2015

On Mon, Sep 14, 2015 at 10:49 PM, Tim Peters <tim.peters at gmail.com> wrote:
> [Tim]
>>> And yet nobody so far has a produced a single example of any harm done
>>> in any of the near-countless languages that supply non-crypto RNGs.  I
>>> know, my lawyer gets annoyed too when I point out there hasn't been a
>>> nuclear war ;-)
>
> [Nathaniel Smith <njs at pobox.com>]
>> Here you go:
>>   https://media.blackhat.com/bh-us-12/Briefings/Argyros/BH_US_12_Argyros_PRNG_WP.pdf
>
> The most important question I have about that is from its Appendix D,
> where they try to give a secure "all-purpose" token generator.  If
> openssl_random_pseudo_bytes is available, they use just that and call
> it done.  Otherwise they go on to all sorts of stuff.
>
> My question is, even if /dev/urandom is available, they're _not_
> content to use that.alone.  They continue to mix it up with all other
> kinds of silly stuff.  So why do they trust urandom less than
> OpenSLL's gimmick?  It's important to me because, far as I'm
> concerned, os.urandom() is already Python's way to spell "crypto
> random" (yes, it would be better to have one guaranteed to be
> available on all Python platforms).

Who knows why they wrote the code in that exact way. /dev/urandom is fine.

>> They present real-world attacks on PHP applications that use something
>> like the "password generation" code we've been talking about as a way
>> to generate cookies and password reset nonces, including in particular
>> the case of applications that use a strongly-seeded Mersenne Twister
>> as their RNG:
>
> I couldn't help but note ;-) that at least 3 of the apps had already
> attempted to repair bug reports filed against their insecure
> password-reset schemes at least 5 years ago.  You can lead a PHP'er to
> water, but ... ;-)
>
>> ...
>> "Section 5.3: ... In this section we give a description of the
>> Mersenne Twister generator and present an algorithm that allows the
>> recovery of the internal state of the generator even when the output
>> is truncated. Our algorithm also works in the presence of non
>> consecutive outputs ..."
>
> It's cute, but I doubt anyone but the authors had the patience - or
> knowledge - to write a solver dealing with tens of thousands of picky
> equations over about 20000 variables.  They don't explain enough about
> the details for a script kiddie to do it.  Even very bright hackers
> attack what's easiest to topple, like poor seeding - that's just easy
> brute force.
>
> It remained unclear to me _exactly_ what "in the presence of non
> consecutive outputs" is supposed to mean.  In the only examples, they
> knew exactly how many times MT was called.  "Non consecutive" in all
> those contexts appeared to mean "but we couldn't observe_any_ output
> bits in some cases - the ones we could know something about were
> sometimes non-consecutive".  So in the MT output sequence, they had no
> knowledge of _some_ of the outputs, but they nevertheless knew exactly
> _which_ of the outputs they were wholly ignorant about.
>
> That's no problem for the equation solver.  They just skip adding any
> equations for the affected bits, keep collecting more outputs and
> potentially "wrap around", probably leading to an overdetermined
> system in the end.
>
> But Python doesn't work the way PHP does here.  As explained in
> another message, in Python you can have _no idea_ how many MT outputs
> are consumed by a single .choice() call.  In the PHP equivalent, you
> always consume exactly one MT output.  PHP's method suffers
> statistical bias, but under the covers Python uses an accept/reject
> method to avoid that.  Any number of MT outputs may be (invisibly!)
> consumed before "accept" is reached, although typically only one or
> two.  You can deduce some of the leading MT output bits from the
> .choice() result, but _only_ for the single MT output .choice()
> reveals anything about.  About the other MT outputs it may consume,
> you can't even know that some _were_ skipped over, let alone how many.

This led me to look at the implementation of Python's choice(), and
it's interesting; I hadn't realized that it was using such an
inefficient method. (To make a random selection between, say, 36
items, it rounds up to 64 = 2**6, draws a 32-bit sample from MT,
discards 26 of the bits (!) to get a number between 0-63, and then
repeats until this number happens to fall in the 0-35 range, so it
rejects with probability ~0.45. A more efficient algorithm is the one
that it uses if getrandbits is not available, where it uses all 32
bits and only rejects with probability (2**32 % 36) / (2**32) =
~1e-9.) I guess this does add a bit of obfuscation.

OTOH the amount of obfuscation is very sensitive to the size of the
password alphabet. If I use uppercase + lowercase + digits, that gives
me 62 options, so I only reject with probability 1/32, and I can
expect that any given 40-character session key will contain zero skips
with probability ~0.28, and that reveals 240 bits of seed.

I don't have time right now to go look up the MT equations to see how
easy it is to make use of such partial information, but there
certainly are lots of real-world weaponized exploits that begin with
something like "first, gather 10**8 session keys...". I certainly
wouldn't trust it.

Also, if I use the base64 or hex alphabets, then the probability of
rejection is 0, and I can deterministically read off bits from the
underlying MT state. (Alternatively, if someone in the future makes
the obvious optimization to choice(), then it will basically stop
rejecting in practice, and again it becomes trivial to read off all
the bits from the underlying MT state.)

The point of "secure by default" is that you don't have to spend all
these paragraphs doing the math to try and guess whether some RNG
usage might maybe be secure; it just is secure.

> Best I can tell, that makes a huge difference to whether their solver
> is even applicable to cracking idiomatic "password generators" in
> Python.  You can't know which variables correspond to the bits you can
> deduce.  You could split the solver into multiple instances to cover
> all feasible possibilities (for how many MT outputs may have been
> invisibly consumed), but the number of solver instances needed then
> grows exponentially with the number of outputs you do see something
> about.  In the worst case (31 bits are truncated), they need over
> 19000 outputs to deduce the state.  Even a wildly optimistic "well,
> let's guess no more than 1 MT output was invisibly rejected each time"
> leads to over 2**19000 solver clones then.

Your "wildly optimistic" estimate is wildly conservative under
realistic conditions. How confident are you that the rest of your
analysis is totally free of similar errors? Would you willing to bet,
say, the public revelation of every website you've visited in the last
5 years on it?

> Sure, there's doubtless a far cleverer way to approach that.  But
> unless another group of PhDs looking to level up in Security World
> burns their grant money to tackle it, that's yet another attack that
> will never be seen in the real world ;-)

Grant money is a drop in the bucket of security research funding these
days. Criminals and governments have very deep pockets, and it's well
documented that there are quite a few people with PhDs who make their
living by coming up with exploits and then auctioning them on the
black market.

BTW, it looks like that PHP paper was an undergraduate project. You
don't need a PhD to solve linear equations :-).

>> Out of curiosity, I tried searching github for "random cookie
>> language:python". The 5th hit (out of ~100k) was a web project that
>> appears to use this insecure method to generate session cookies:
>>   https://github.com/bytasv/bbapi/blob/34e294becb22bae6e685f2e742b7ffdb53a83bcb/bbapi/utils/cookie.py
>>   https://github.com/bytasv/bbapi/blob/34e294becb22bae6e685f2e742b7ffdb53a83bcb/bbapi/api/router.py#L56-L66
>
> And they all use .choice(), which is overwhelmingly the most natural
> way to do this kind of thing in Python.
>
>> ...
>> There's a reason security people are so Manichean about these kinds of
>> things. If something is not intended to be secure or used in
>> security-sensitive ways, then fine, no worries. But if it is, then
>> there's no point in trying to mess around with "probably mostly
>> secure" -- either solve the problem right or don't bother. (See also:
>> the time Python wasted trying to solve hash randomization without
>> actually solving hash randomization [1].) If Tim Peters can get fooled
>> into thinking something like using MT to generate session ids is
>> "probably mostly secure", then what chance do the rest of us have?
>> <wink>
>
> As above, I'm still not much worried about .choice().  Even if I were,
> I'd be happy to leave it at "use .choice() from a random.SystemRandom
> instance instead".  Unless there's some non-obvious (to me) reason
> these authors appear to be unhappy with urandom.

No, SystemRandom.choice is certainly fine. But people clearly don't
use it, so it's fine-ness doesn't matter that much in practice...

-n

-- 
Nathaniel J. Smith -- http://vorpus.org