[Python-Dev] BDFL ruling request: should we block forever waiting for high-quality random bits?

Wed Jun 15 16:01:27 EDT 2016

[whew, actually read the whole thread]

On 11 June 2016 at 10:28, Terry Reedy <tjreedy at udel.edu> wrote:
> On 6/11/2016 11:34 AM, Guido van Rossum wrote:
>>
>> In terms of API design, I'd prefer a flag to os.urandom() indicating a
>> preference for
>> - blocking
>> - raising an exception
>> - weaker random bits
>
>
> +100 ;-)
>
> I proposed exactly this 2 days ago, 5 hours after Larry's initial post.

No, this is a bad idea. Asking novice developers to make security
decisions they're not yet qualified to make when it's genuinely
possible for us to do the right thing by default is the antithesis of
good security API design, and os.urandom() *is* a security API
(whether we like it or not - third party documentation written by the
cryptographic software development community has made it so, since
it's part of their guidelines for writing security sensitive code in
pure Python).

Adding *new* APIs is also a bad idea, since "os.urandom() is the right
answer on every OS except Linux, and also the best currently available
answer on Linux" has been the standard security advice for generating
cryptographic secrets in pure Python code for years now, so we should
only change that guidance if we have extraordinarily compelling
reasons to do so, and we don't. Instead, we have Ted T'so himself
chiming in to say: "My preference would be that os.[u]random should
block, because the odds that people would be trying to generate
long-term cryptographic secrets within seconds after boot is very
small, and if you *do* block for a second or two, it's not the end of
the world."

The *actual bug* that triggered this latest firestorm of commentary
(from experts and non-experts alike) had *nothing* to do with user
code calling os.urandom, and instead was a combination of:

- CPython startup requesting cryptographically secure randomness when
it didn't need it
- a systemd init script written in Python running before the kernel
RNG was fully initialised

That created a deadlock between CPython startup and the rest of the
Linux init process, so the latter only continued when the systemd
watchdog timed out and killed the offending script. As others have
noted, this kind of deadlock scenario is generally impossible on other
operating systems, as the operating system doesn't provide a way to
run Python code before the random number generator is ready.

The change Victor made in 3.5.2 to fall back to reading /dev/urandom
directly if the getrandom() syscall returns EAGAIN (effectively
reverting to the Python 3.4 behaviour) was the simplest possible fix
for that problem (and an approach I thoroughly endorse, both for 3.5.2
and for the life of the 3.5 series), but that doesn't make it the
right answer for 3.6+.

To repeat: the problem encountered was NOT due to user code calling
os.urandom(), but rather due to the way CPython initialises its own
internal hash algorithm at interpreter startup. However, due to the
way CPython is currently implemented, fixing the regression in that
not only changed the behaviour of CPython startup, it *also* changed
the behaviour of every call to os.urandom() in Python 3.5.2+.

For 3.6+, we can instead make it so that the only things that actually
rely on cryptographic quality randomness being available are:

- calling a secrets module API
- calling a random.SystemRandom method
- calling os.urandom directly

These are all APIs that were either created specifically for use in
security sensitive situations (secrets module), or have long been
documented (both within our own documentation, and in third party
documentation, books and Q&A sites) as being an appropriate choice for
use in security sensitive situations (os.urandom and
random.SystemRandom).

However, we don't need to make those block waiting for randomness to
be available - we can update them to raise BlockingIOError instead
(which makes it trivial for people to decide for themselves how they
want to handle that case).

Along with that change, we can make it so that starting the
interpreter will never block waiting for cryptographic randomness to
be available (since it doesn't need it), and importing the random
module won't block waiting for it either.

To the best of our knowledge, on all operating systems other than
Linux, encountering the new exception will still be impossible in
practice, as there is no known opportunity to run Python code before
the kernel random number generator is ready.

On Linux, init scripts may still run before the kernel random number
generator is ready, but will now throw an immediate BlockingIOError if
they access an API that relies on crytographic randomness being
available, rather than potentially deadlocking the init process. Folks
encountering that situation will then need to make an explicit
decision:

- loop until the exception is no longer thrown
- switch to reading from /dev/urandom directly instead of calling os.urandom()
- switch to using a cross-platform non-cryptographic API (probably the
random module)

Victor has some additional technical details written up at
http://haypo-notes.readthedocs.io/pep_random.html and I'd be happy to
formalise this proposed approach as a PEP (the current reference is
http://bugs.python.org/issue27282 )

Regards,
Nick.

-- 
Nick Coghlan   |   ncoghlan at gmail.com   |   Brisbane, Australia