New subject: Python's Source of Randomness and the random.py module Redux

Sept. 10, 2015

      Ok, I reached out to Theo de Raadt to talk to him about what he was suggesting
without Guido having to play messenger and forward fragments of the email
conversation. I'm starting a new thread because this email is rather long, and
I'm hoping to divorce it a bit from the back and forth about a proposal that
wasn't exactly what Theo was suggesting that is being discussed in the other
thread.

Essentially, there are three basic types of uses of random (the concept, not
the module). Those are:

1. People/usecases who absolutely need deterministic output given a seed and
   for whom security properties don't matter.
2. People/usecases who absolutely need a cryptographically random output and
   for whom having a deterministic output is a downside.
3. People/usecases that fall somewhere in between where it may or may not be
   security sensitive or it may not be known if it's security sensitive.

The people in group #1 are currently, in the Python standard library, best
served using the MT random source as it provides exactly the kind of determinsm
they need. The people in group #2 are currently, in the Python standard
library, best served using os.urandom (either directly or via
random.SystemRandom).

However, the third case is the one that Theo's suggestion is attempting to
solve. In the current landscape, the security minded folks will tell these
people to use os.urandom/random.SystemRandom and the performance or otherwise
less security minded folks will likely tell them to just use random.py. Leaving
these people with a random that is not cryptographically safe.

The questin then is, does it matter if #3 are using a cryptographically safe
source of randomness? The answer is obviously that we don't know, and it's
possible that the user doesn't know. In these cases it's typically best if we
default to the more secure option and expect people to opt in to insecurity.

In the case of randomness, a lot of languages (Python included) don't do that
and instead they opt to pick the more peformant option first, often with the
argument (as seen in the other thread) that if people need a cryptographically
secure source of random, they'll know how to look for it and if they don't
know how to look for it, then it's likely they'll have some other security
problem. I think (and I believe Theo thinks) this sort of thinking is short
sighted. Let's take an example of a web application, it's going to need session
identifiers to put into a cookie, you'll want these to be random and it's not
obvious on the tin for a non-expert that you can't just use the module level
functions in the random module to do this. Another examples are generating API
keys or a password.

Looking on google, the first result for "python random password" is
StackOverflow which suggests:

    ''.join(random.choice(string.ascii_uppercase + string.digits) for _ in range(N))

However, it was later edited to, after that, include:

    ''.join(random.SystemRandom().choice(string.ascii_uppercase + string.digits) for _ in range(N))

So it wasn't obvious to the person who answered that question that the random
module's module scoped functions were not appropiate for this use. It appears
that the original answer lasted for roughly 4 years before it was corrected,
so who knows how many people used that in those 4 years.

The second result has someone asking if there is a better way to generate a
random password in Python than:

    import os, random, string

    length = 13
    chars = string.ascii_letters + string.digits + '!@#$%^&*()'
    random.seed = (os.urandom(1024))

    print ''.join(random.choice(chars) for i in range(length))

This person obviously knew that os.urandom existed and that he should use it,
but failed to correctly identify that the random module's module scoped
functions were not what he wanted to use here.

The third result has this code:

    import string
    import random

    def randompassword():
        chars=string.ascii_uppercase + string.ascii_lowercase + string.digits
        size=8 
        return ''.join(random.choice(chars) for x in range(size,12))

I'm not going to keep pasting snippets, but going through the results it is
clear that in the bulk of cases, this search turns up code snippets that
suggest there is likely to be a lot of code out there that is unknownly using
the random module in a very insecure way. I think this is a failing of the
random.py module to provide an API that guides users to be safe which was
attempted to be papered over by adding a warning to the documentation, however
like has been said before, you can't solve a UX problem with documentation.

Then we come to why might we want to not provide a safe random by default for
the folks in the #3 group. As we've seen in the other thread, this basically
boils down to the fact that for a lot of users they don't care about the
security properties and they just want a fast random-esque value. This
particular case is made stronger by the fact that there is a lot of code out
there using Python's random module in a completely safe way that would regress
in a meaningful way if the random module slowed down.

The fact that speed is the primary reason not to give people in #3 a
cryptographically secure source of random by default is where we come back to
the meat of Theo's suggestion. His claim is that invoking os.urandom through
any of the interfaces imposes a performance penalty because it has to round
trip through the kernel crypto sub system for every request. His suggestion is
essentially that we provide an interface to a modern, good, userland 
cryptographically secure source of random that is running within the same
process as Python itself. One such example of this is the arc4random function
(which doesn't actually provide ARC4 on OpenBSD, it provides ChaCha, it's not
tied to one specific algorithm) which comes from libc on many platforms.
According to Theo, modern userland CSPRNGs can create random bytes faster than
memcpy which eliminates the argument of speed for why a CSPRNG shouldn't be
the "default" source of randomness.

Thus the proposal is essentially:

* Provide an API to access a modern userland CSPRNG.
* Provide an implementation of random.SomeKindOfRandom that utilizes this.
* Move the MT based implementation of the random module to
  random.DeterministicRandom.
* Deprecate the module scoped functions, instructing people to use the new
  random.SomeKindofRandom unless they need deterministic random, in which case
  use random.DeterministicRandom.

This can of course be tweaked one way or the other, but that's the general idea
translated into something actionable for Python. I'm not sure exactly how I
feel about it, but I certainly do think that the current situation is confusing
to end users and leaving them in an insecure state, and that a minimum we
should move MT to something like random.DeterministicRandom and deprecate the
module scoped functions because it seems obvious to me that the idea of a
"default" random function that isn't safe is a footgun for users.

As an additional consideration, there are security experts who believe that
userland CSPRNGs should not be used at all. One of those is Thomas Ptacek who
wrote a blog post [1] on the subject. In this, Thomas makes the case that a
userland CSPRNG pretty much always depends on the cryptographic security of
the system random, but that it itself may be broken which means you're adding
a second, single point of failure where a mistake can cause you to get
non-random data out of the system. I had asked Theo about this, and he stated
that he disagreed with Thomas about never using a userland CSPRNG and in his
opinion that blog post was mostly warning people away from using something like
MT in the userland and away from /dev/random (which is often the cause of
people reaching for MT because /dev/random blocks which makes programs even
slower).

It seems to boil down to, do we want to try to protect users by default or at
least make it more obvious in the API which one they want to use (I think yes),
and if so do we think that /dev/urandom is "fast enough" for most people in
group #3 and if not, do we agree with Theo that a modern userland CSPRNG is
safe enough to use, or do we agree with Thomas that it's not and if we think
that it is, do we use arc4random and what do we do on systems that don't have
a modern userland CSPRNG in their libc.

[1] http://sockpuppet.org/blog/2014/02/25/safely-generate-random-numbers/

-----------------
Donald Stufft
PGP: 0x6E3CBCE93372DCFA // 7C6B 7C5D 5E2B 6356 A926 F04F 6E3C BCE9 3372 DCFA

Python's Source of Randomness and the random.py module Redux

Emile van Sebille

random832＠fastmail.us

random832＠fastmail.us

random832＠fastmail.us

random832＠fastmail.us