[Python-ideas] Should our default random number generator be secure?

Mon Sep 14 07:29:52 CEST 2015

...

[Tim]
>> I'm not a security wonk, but I'll bet a life's salary ;-) we'd have
>> botnets just as pervasive if every non-crypto RNG in the world were
>> banned - or had never existed.

[Stephen J. Turnbull <stephen at xemacs.org>]
> I am in violent agreement with you on that point and most others.[1]
> However, the analogy was not intended to be so direct as to imply that
> "insecure" RNGs are responsible for botnets, merely that not caring
> is.  I agree I twisted MAL's words a bit -- he meant most people have
> no technical need for crypto, and so don't care, I suppose.  But then,
> "doing crypto" (== security) is like "speaking prose": a lot of folks
> doing it don't realize that's what they're doing -- and they don't
> care, either.

I don't know that it's true, though.  Crypto wonks are like lawyers
that way, always worrying about the worst possible case "in theory".
In my personal life, I've had to tell lawyers "enough already - I'm
not paying another N thousand dollars to insert another page about
what happens in case of nuclear war".  Crypto wonks have no limit on
the costs they'd like to impose either - a Security State never does.

>> So long as end users are allowed to run programs, that problem will
>> never go away.

> s/users/programmers/ and s/run/write/, and we get a different analogy
> that is literally correct -- but fails in an important dimension.  One
> user's mistake adds one node to the botnet, and that's about as bad as
> one user's mistake gets in terms of harm to third parties.

Not really.  The best social engineering is for a bot to rummage
through your email address book and send copies of itself to people
you know, appearing to be a thoroughly legitimate email from you.  Add
a teaser to invite the recipient to click on the attachment, and
response rate can be terrific.

And my ISP (like many others) happens to provide a free
industrial-strength virus/malware scanner/cleaner program.  I doubt
that's because they actually care about me ;-)  Seems more likely they
don't want to pay the costs of hosting millions of active bots.

> But one programmer's (or system administrator's) mistake can put many,
> perhaps millions, at risk.

What I question is whether this has anything _plausible_ to do with
Python's PRNG.

> Personally I doubt that justifies an API break here, even if we can
> come up with attacks where breaking the PRNG would be cost-effective
> compared to "social engineering" or theft of physical media.  I think
> it does justify putting quite a bit of thought into ways to make it
> easier for naive programmers to do the "safe" thing even if they
> technically don't need it.

I remain unclear on what "the danger" is thought to be, such that
replacing with a CSPRNG could plausibly prevent it.  For example, I
know how to deduce the MT's internal state from outputs (and posted
working code for doing so, requiring a small fraction of a second
given 624 consecutive 32-bit outputs).  But it's not an easy problem
_unless_ you have 624 consecutive 32-bit outputs.  It's far beyond the
ken of script kiddies.  If it's the NSA, they can demand you turn over
everything anyway, or just plain steal it ;-)

Consider one of these "password" examples:

    import string, random
    alphabet = string.ascii_letters + string.digits
    print(random.choice(alphabet))

Suppose that prints 'c'.  What have you learned?  Surprisingly,
perhaps, very little.  You learned that one 32-bit output of MT had
0b000010 as its first 6 bits.  You know nothing about its other 26
bits.  And you don't know _which_ MT 32-bit output:  internally,
.choice() consumes as many 32-bit outputs as it needs until it finds
one whose first six bits are less than 62 (len(alphabet)).  So all
you've learned about MT is that, at the time .choice() was called:

   - 0 or more 32-bit outputs x were such that (x >> 26) >= 62.
   - Then one 32-bit output x had (x >> 26) == 2.

This isn't much to go on.  To deduce the whole state easily, you need
to know 19,968 consecutive output bits (624*32).  Get more and more
letters from the password generator, and you learn more and more about
the first 6 bits of an unknowable number of MT outputs consumed, but
learn nothing whatsoever about any of the lower 26 bits of any of
them, and learn nothing but a range constraint on the first 6 bits of
an unknowable number of outputs that were skipped over.

Sure, every clue reveals _something_.  In theory ;-)  Note that, as
explained earlier in these messages, Python's default _seeding_ of MT
is already computationally infeasible to attack (urandom() is already
used to set the entire massive internal state).  _That_ I did worry
about in the past.  So in the above I'm not worried at all that an
attacker exploited poor default seeding to know there were only a few
billion possible MT states _before_ `c` was generated.  All MT states
are possible, and MT's state space is large beyond comprehension (let
alone calculation).

Would the user _really_ be better off using .urandom()?  I don't know.
Since a crypto wonk will rarely recommend doing anything _other_ than
using urandom() directly, I bet they'd discourage using .choice() at
all, even if it is built on urandom().  Then the user will write their
own implementation of .choice(), something like:

    u = urandom(n) # for some n
    letter = alphabet[int(u / 2.0**(8*n) * len(alphabet))]

If they manage to get that much right, _now_ they've introduced a
statistical bias unless len(alphabet) is a power of 2.  If we're
assuming they're crypto morons, chances are good they're not rock
stars at this kind of thing either ;-)

> I will say that IMO the now-traditional API was a very unfortunate
> choice.

Ah, but I remember 1990 ;-)  Python's `random` API was far richer than
"the norm" in languages like C and FORTRAN at the time.  It was a
delight!  Judging it by standards that didn't become trendy until much
later is only fair now ;-)

> If you have a CSPRNG that just generates "uniform random
> numbers" and has no user-visible APIs for getting or setting state,
> it's immediately obvious to the people who know they need access to
> state what they need to do -- change "RNG" implementation.

I don't recall any language _at the time_ that did so.

> The most it might cost them is rerunning an expensive base
> case simulation with a more appropriate implementation that
> provides the needed APIs.

As above, I'm not sure a real crypto wonk would endorse a module that
provided any more than a bare-bones API, forcing the user to build
everything else out of one or two core primitives.  Look at, e.g., how
tiny the OpenBSD arc4random API is.  I do applaud it for offering

     arc4random_uniform(uint32_t upper_bound)

That's exactly what's needed to, e.g., build a bias-free .choice()
(provided you have fewer than 2**32-1 elements to choose from).

> On the other hand, if you have something like the MT that "shouldn't
> be allowed anywhere near a password",

As above, I think that's a weak claim.  The details matter.  As an
example of a strong claim: you should never, ever use MT to produce
the keystream for a stream cipher.  But only a crypto wonk would be
trying to generate a keystream to begin with.  Or a user who did use
MT for that purpose is probably so clueless they'd forget the xor and
just send the plaintext ;-)

> it's easy to ignore the state access APIs, and call it the same way
> that you would call a CSPRNG. In fact that's what's documented as
> correct usage, as Paul Moore points out.  Thus, programmers who
> are using a PRNG whose parameters can be inferred from its output,
> and should not be doing so, generally won't know it until the
> (potentially widespread) harm is done.  It would be nice if it wasn't
> so easy for them to use the MT.

And yet nobody so far has a produced a single example of any harm done
in any of the near-countless languages that supply non-crypto RNGs.  I
know, my lawyer gets annoyed too when I point out there hasn't been a
nuclear war ;-)

Anyway, if people want to pursue this, I suggest adding a _new_ module
doing exactly whatever it is certified crypto experts say is
necessary.  We can even give it a name shorter than "random" to
encourage its use.  That's all most users really care about anyway ;-)

> Footnotes:
> [1]  I think "agree with Tim" is a pretty safe default.<wink/>

It's not always optimal, but, yes, you could do worse ;-)