proposals from Nick and Nathaniel from the Py-Dev thread
Nick expressed:
The *actual bug* that triggered this latest firestorm of commentary (from experts and non-experts alike) had *nothing* to do with user code calling os.urandom, and instead was a combination of:
- CPython startup requesting cryptographically secure randomness when it didn't need it - a systemd init script written in Python running before the kernel RNG was fully initialised
That created a deadlock between CPython startup and the rest of the Linux init process, so the latter only continued when the systemd watchdog timed out and killed the offending script. As others have noted, this kind of deadlock scenario is generally impossible on other operating systems, as the operating system doesn't provide a way to run Python code before the random number generator is ready.
The change Victor made in 3.5.2 to fall back to reading /dev/urandom directly if the getrandom() syscall returns EAGAIN (effectively reverting to the Python 3.4 behaviour) was the simplest possible fix for that problem (and an approach I thoroughly endorse, both for 3.5.2 and for the life of the 3.5 series), but that doesn't make it the right answer for 3.6+.
To repeat: the problem encountered was NOT due to user code calling os.urandom(), but rather due to the way CPython initialises its own internal hash algorithm at interpreter startup. However, due to the way CPython is currently implemented, fixing the regression in that not only changed the behaviour of CPython startup, it *also* changed the behaviour of every call to os.urandom() in Python 3.5.2+.
For 3.6+, we can instead make it so that the only things that actually rely on cryptographic quality randomness being available are:
- calling a secrets module API - calling a random.SystemRandom method - calling os.urandom directly
These are all APIs that were either created specifically for use in security sensitive situations (secrets module), or have long been documented (both within our own documentation, and in third party documentation, books and Q&A sites) as being an appropriate choice for use in security sensitive situations (os.urandom and random.SystemRandom).
However, we don't need to make those block waiting for randomness to be available - we can update them to raise BlockingIOError instead (which makes it trivial for people to decide for themselves how they want to handle that case).
Along with that change, we can make it so that starting the interpreter will never block waiting for cryptographic randomness to be available (since it doesn't need it), and importing the random module won't block waiting for it either.
To the best of our knowledge, on all operating systems other than Linux, encountering the new exception will still be impossible in practice, as there is no known opportunity to run Python code before the kernel random number generator is ready.
On Linux, init scripts may still run before the kernel random number generator is ready, but will now throw an immediate BlockingIOError if they access an API that relies on crytographic randomness being available, rather than potentially deadlocking the init process. Folks encountering that situation will then need to make an explicit decision:
- loop until the exception is no longer thrown - switch to reading from /dev/urandom directly instead of calling os.urandom() - switch to using a cross-platform non-cryptographic API (probably the random module)
Victor has some additional technical details written up at http://haypo-notes.readthedocs.io/pep_random.html and I'd be happy to formalise this proposed approach as a PEP (the current reference is http://bugs.python.org/issue27282 )
and Nathaniel added:
I'd make two additional suggestions:
- one person did chime in on the thread to say that they've used os.urandom for non-security-sensitive purposes, simply because it provided a convenient "give me a random byte-string" API that is missing from random. I think we should go ahead and add a .randbytes method to random.Random that simply returns a random bytestring using the regular RNG, to give these users a nice drop-in replacement for os.urandom.
Rationale: I don't think the existence of these users should block making os.urandom appropriate for generating secrets, because (1) a glance at github shows that this is very unusual -- if you skim through this search you get page after page of functions with names like "generate_secret_key"
https://github.com/search?l=python&p=2&q=urandom&ref=searchresults&type=Code&utf8=%E2%9C%93
and (2) for the minority of people who are using os.urandom for non-security-sensitive purposes, if they find os.urandom raising an error, then this is just a regular bug that they will notice immediately and fix, and anyway it's basically never going to happen. (As far as we can tell, this has never yet happened in the wild, even once.) OTOH if os.urandom is allowed to fail silently, then people who are using it to generate secrets will get silent catastrophic failures, plus those users can't assume it will never happen because they have to worry about active attackers trying to drive systems into unusual states. So I'd much rather ask the non-security-sensitive users to switch to using something in random, than force the cryptographic users to switch to using secrets. But it does seem like it would be good to give those non-security-sensitive users something to switch to .
- It's not exactly true that the Python interpreter doesn't need cryptographic randomness to initialize SipHash -- it's more that *some* Python invocations need unguessable randomness (to first approximation: all those which are exposed to hostile input), and some don't. And since the Python interpreter has no idea which case it's in, and since it's unacceptable for it to break invocations that don't need unguessable hashes, then it has to err on the side of continuing without randomness. All that's fine.
But, given that the interpreter doesn't know which state it's in, there's also the possibility that this invocation *will* be exposed to hostile input, and the 3.5.2+ behavior gives absolutely no warning that this is what's happening. So instead of letting this potential error pass silently, I propose that if SipHash fails to acquire real randomness at startup, then it should issue a warning. In practice, this will almost never happen. But in the rare cases it does, it at least gives the user a fighting chance to realize that their system is in a potentially dangerous state. And by using the warnings module, we automatically get quite a bit of flexibility. If some particular invocation (e.g. systemd-cron) has audited their code and decided that they don't care about this issue, they can make the message go away:
PYTHONWARNINGS=ignore::NoEntropyAtStartupWarning
OTOH if some particular invocation knows that they do process potentially hostile input early on (e.g. cloud-init, maybe?), then they can explicitly promote the warning to an error:
PYTHONWARNINGS=error::NoEntropyAtStartupWarning
(I guess the way to implement this would be for the SipHash initialization code -- which runs very early -- to set some flag, and then we expose that flag in sys._something, and later in the startup sequence check for it after the warnings module is functional. Exposing the flag at the Python level would also make it possible for code like cloud-init to do its own explicit check and respond appropriately.)
Victor, does your PEP differ from these proposals? (my apologies for my lack of time at the moment). -- ~Ethan~
participants (1)
-
Ethan Furman