Can /dev/urandom ever revert from the "good" to the "bad" state?

Before I can possibly start thinking about what to do when the system's CSPRNG is initialized, I need to understand more about how it works. Apparently there's a possible transition from the "not ready yet" ("bad") state to "ready" ("good"), and all it takes is usually waiting for a second or two. But is this a wait that only gets incurred once, somewhere early after a boot, or is this something that can happen at any time? -- --Guido van Rossum (python.org/~guido)

On Jun 22, 2016, at 10:15 PM, Guido van Rossum <guido@python.org> wrote:
Before I can possibly start thinking about what to do when the system's CSPRNG is initialized, I need to understand more about how it works. Apparently there's a possible transition from the "not ready yet" ("bad") state to "ready" ("good"), and all it takes is usually waiting for a second or two. But is this a wait that only gets incurred once, somewhere early after a boot, or is this something that can happen at any time?
Once, only after boot. On most (all?) modern Linux systems there’s even part of the boot process that attempts to seed the CSPRNG using random values stored during a previous boot to shorten the time window between when it’s ready and when it’s not yet initialized. However, once it is initialized it will never block (or EAGAIN) again. — Donald Stufft

On Wed, Jun 22, 2016 at 7:18 PM, Donald Stufft <donald@stufft.io> wrote:
On Jun 22, 2016, at 10:15 PM, Guido van Rossum <guido@python.org> wrote:
Before I can possibly start thinking about what to do when the system's CSPRNG is initialized, I need to understand more about how it works. Apparently there's a possible transition from the "not ready yet" ("bad") state to "ready" ("good"), and all it takes is usually waiting for a second or two. But is this a wait that only gets incurred once, somewhere early after a boot, or is this something that can happen at any time?
Once, only after boot. On most (all?) modern Linux systems there’s even part of the boot process that attempts to seed the CSPRNG using random values stored during a previous boot to shorten the time window between when it’s ready and when it’s not yet initialized. However, once it is initialized it will never block (or EAGAIN) again.
Then shouldn't it be the responsibility of the boot sequence rather than of the Python stdlib to wait for that event? IIUC that's what OS X does (I think someone described that it even kernel-panics when it can't enter the "good" state). -- --Guido van Rossum (python.org/~guido)

On Jun 22, 2016, at 10:29 PM, Guido van Rossum <guido@python.org> wrote:
On Wed, Jun 22, 2016 at 7:18 PM, Donald Stufft <donald@stufft.io <mailto:donald@stufft.io>> wrote:
On Jun 22, 2016, at 10:15 PM, Guido van Rossum <guido@python.org <mailto:guido@python.org>> wrote:
Before I can possibly start thinking about what to do when the system's CSPRNG is initialized, I need to understand more about how it works. Apparently there's a possible transition from the "not ready yet" ("bad") state to "ready" ("good"), and all it takes is usually waiting for a second or two. But is this a wait that only gets incurred once, somewhere early after a boot, or is this something that can happen at any time?
Once, only after boot. On most (all?) modern Linux systems there’s even part of the boot process that attempts to seed the CSPRNG using random values stored during a previous boot to shorten the time window between when it’s ready and when it’s not yet initialized. However, once it is initialized it will never block (or EAGAIN) again.
Then shouldn't it be the responsibility of the boot sequence rather than of the Python stdlib to wait for that event? IIUC that's what OS X does (I think someone described that it even kernel-panics when it can't enter the "good" state).
In an ideal world? Yes. However we live in a not ideal world where Linux doesn’t ensure that, so absent Linux deciding to do something like what OS X, FreeBSD, Windows, OpenBSD, etc do we have to make a choice, either we pass along the possibility that Linux left us with, and make it so people who attempt to use Python early in the boot sequence can get predictable random numbers (without any way to determine if they’re getting “good” or “bad” numbers) or we use the newer API that Linux has given us to make that assurance. AFAIK Linux (or, well Ted) has stated that the way for people who care about getting cryptographically secure random out of the kernel is to use getrandom(0) (or getrandom(GRDB_NONBLOCK) and fail on an EAGAIN) so the question I think really comes down to whether os.urandom is something we want to provide the best source of (generally) non blocking CSPRNG or whether we want it to be a narrow wrapper around whatever semantics /dev/urandom specifically has. — Donald Stufft

On Jun 22, 2016, at 10:37 PM, Donald Stufft wrote:
so the question I think really comes down to whether os.urandom is something we want to provide the best source of (generally) non blocking CSPRNG or whether we want it to be a narrow wrapper around whatever semantics /dev/urandom specifically has.
... with os.getrandom() exposed on platforms that provide it. Cheers, -Barry

On Thu, Jun 23, 2016 at 5:42 AM, Barry Warsaw <barry@python.org> wrote:
On Jun 22, 2016, at 10:37 PM, Donald Stufft wrote:
so the question I think really comes down to whether os.urandom is something we want to provide the best source of (generally) non blocking CSPRNG or whether we want it to be a narrow wrapper around whatever semantics /dev/urandom specifically has.
... with os.getrandom() exposed on platforms that provide it.
Personally I think it's better to have one API than two, even if it is named after a platform-specific API. FWIW I don't really buy the philosophy that the os module should only provide thin wrappers over what the platform offers. E.g. in the case of Windows most of what's in the os module is part of Microsoft's libc emulation, and the platform APIs have a totally different shape. os.urandom()'s past is already another example. So I don't see a reason to offer two different APIs and force users of those APIs to either commit to a platform or use an ugly try/except. Especially since in Python <= 3.5 they'll only have os.urandom(). -- --Guido van Rossum (python.org/~guido)

On Jun 23, 2016, at 11:27 AM, Guido van Rossum <guido@python.org> wrote:
On Thu, Jun 23, 2016 at 5:42 AM, Barry Warsaw <barry@python.org <mailto:barry@python.org>> wrote: On Jun 22, 2016, at 10:37 PM, Donald Stufft wrote:
so the question I think really comes down to whether os.urandom is something we want to provide the best source of (generally) non blocking CSPRNG or whether we want it to be a narrow wrapper around whatever semantics /dev/urandom specifically has.
... with os.getrandom() exposed on platforms that provide it.
Personally I think it's better to have one API than two, even if it is named after a platform-specific API.
FWIW I don't really buy the philosophy that the os module should only provide thin wrappers over what the platform offers. E.g. in the case of Windows most of what's in the os module is part of Microsoft's libc emulation, and the platform APIs have a totally different shape. os.urandom()'s past is already another example. So I don't see a reason to offer two different APIs and force users of those APIs to either commit to a platform or use an ugly try/except. Especially since in Python <= 3.5 they'll only have os.urandom().
For what it’s worth, I agree with this sentiment, though I think calling getrandom() and either blocking or erroring is still a pretty thin wrapper over what the OS provides, it’s just using a different interface to the same underlying functionality with only two real differences (1) Lack of a File Descriptor (2) Inability to get insecure values out of the API, both of which I think are good things. As far as I know, nobody has argued that os.random should *not* use getrandom(), they just want it to fall back to the same behavior as the /dev/urandom does in the (2) case… which is actually a thicker wrapper around what the OS provides than just using getrandom() since that fall back logic needs to be added ;) — Donald Stufft

On 22 June 2016 at 19:29, Guido van Rossum <guido@python.org> wrote:
On Wed, Jun 22, 2016 at 7:18 PM, Donald Stufft <donald@stufft.io> wrote:
Once, only after boot. On most (all?) modern Linux systems there’s even part of the boot process that attempts to seed the CSPRNG using random values stored during a previous boot to shorten the time window between when it’s ready and when it’s not yet initialized. However, once it is initialized it will never block (or EAGAIN) again.
Then shouldn't it be the responsibility of the boot sequence rather than of the Python stdlib to wait for that event? IIUC that's what OS X does (I think someone described that it even kernel-panics when it can't enter the "good" state).
I spent some time browsing the (mostly-but-not-all public) results of https://bugzilla.redhat.com/buglist.cgi?quicksearch=getrandom today, and unfortunately that backed up the results of Ted Ts'o's "what if /dev/urandom blocked on Linux startup?" experiments [1]. That is, Linux has the same problem at the distro level that we do at the language runtime level: the historically permissive behaviour means that Linux has existing use cases where it's legitimate to start the init process without waiting for the kernel CSPRNG to be seeded, so distros can't currently unilaterally prevent the entire OS from starting just because that subsystem isn't ready yet. We have a significant advantage that the kernel and distro devs don't enjoy though, which is a *much* nicer mechanism for runtime error reporting (in the form of exceptions and tracebacks) - by taking advantage of that, I believe we can significantly improve the default behaviour, while also writing a fairly straightforward "if you get this exception when running on Python 3.6, assess your application's needs, then apply one of these remedies" note for the Python 3.6 porting guide. Regards, Nick. [1] https://mail.python.org/pipermail/python-dev/2016-June/145146.html -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

[Guido]
Before I can possibly start thinking about what to do when the system's CSPRNG is initialized, I need to understand more about how it works. Apparently there's a possible transition from the "not ready yet" ("bad") state to "ready" ("good"), and all it takes is usually waiting for a second or two. But is this a wait that only gets incurred once, somewhere early after a boot, or is this something that can happen at any time?
[Donald Stufft]
Once, only after boot. On most (all?) modern Linux systems there’s even part of the boot process that attempts to seed the CSPRNG using random values stored during a previous boot to shorten the time window between when it’s ready and when it’s not yet initialized. However, once it is initialized it will never block (or EAGAIN) again.
Donald, at the end you're talking about how getrandom() behaves - /dev/urandom on Linux never blocks, as I understand it (but there's no advertised way to tell when /dev/urandom enters the "good" state). [Guido]
Then shouldn't it be the responsibility of the boot sequence rather than of the Python stdlib to wait for that event? IIUC that's what OS X does (I think someone described that it even kernel-panics when it can't enter the "good" state).
The rub is that sometimes Python is running soooo early in the boot sequence in these rare Linux cases. That's said to be impossible on OS X (or Windows).

On Jun 22, 2016, at 10:40 PM, Tim Peters <tim.peters@gmail.com> wrote:
[Guido]
Before I can possibly start thinking about what to do when the system's CSPRNG is initialized, I need to understand more about how it works. Apparently there's a possible transition from the "not ready yet" ("bad") state to "ready" ("good"), and all it takes is usually waiting for a second or two. But is this a wait that only gets incurred once, somewhere early after a boot, or is this something that can happen at any time?
[Donald Stufft]
Once, only after boot. On most (all?) modern Linux systems there’s even part of the boot process that attempts to seed the CSPRNG using random values stored during a previous boot to shorten the time window between when it’s ready and when it’s not yet initialized. However, once it is initialized it will never block (or EAGAIN) again.
Donald, at the end you're talking about how getrandom() behaves - /dev/urandom on Linux never blocks, as I understand it (but there's no advertised way to tell when /dev/urandom enters the "good" state).
Yes sorry, Guido asked about the system CSPRNG, in Linux there are three (previously two) basic interfaces to the same CSPRNG: /dev/urandom - This will never block, but until it gathers enough entropy in the boot process it will silently return data that is not cryptographically secure. Essentially, predictably random, however to what degree it is predictable depends on a lot of factors. As far as I am aware, there is no practical way to determine “given a read of /dev/urandom did I get ‘good’ or ‘bad’ data out of it”. /dev/random - This will randomly block whenever the kernel thinks that the entropy is “running low”. All security experts I’m aware of with maybe the exception of Ted (I don’t know how he feels about this) believe that this action of counting entropy is pure bollocks and that /dev/random randomly blocking because it thinks the entropy is low achieves nothing except to hurt the performance of things that need randomness at runtime. And on newer kernels there is the getrandom() sys call which has flags that enable three different mode of operations: getrandom(0) - This will block until the same “pool” of entropy that /dev/urandom uses has been initialized once, at boot, and then it will never block again. getrandom(GRND_NONBLOCK) - This will return a -1 and set errno to EAGAIN if the same pool of entropy that /dev/urandom uses has not been initialized, and will otherwise always return data. This is essentially the same as getrandom(0) except instead of blocking it returns an error. getrandom(GRND_RANDOM) - This is basically just a syscall interface to /dev/random and it doesn’t meaningfully deviate from what /dev/random does, except not require a file descriptor to use it. This getrandom() interface is the newer way to access these two types of random and I think it is important to notice that this newer interface does *not* have a way to get “sometimes a CSPRNG, sometimes not” data out of it like /dev/urandom does. This newer interface promises that you’ll always get cryptographically secure random and it will either block until it can do that or will EAGAIN to let you take some other action instead of relying on a CSPRNG if that suits your application.
[Guido]
Then shouldn't it be the responsibility of the boot sequence rather than of the Python stdlib to wait for that event? IIUC that's what OS X does (I think someone described that it even kernel-panics when it can't enter the "good" state).
The rub is that sometimes Python is running soooo early in the boot sequence in these rare Linux cases. That's said to be impossible on OS X (or Windows).
Yes, once the system has booted and initialized then all forms of accessing the /dev/urandom pool (/dev/urandom, getrandom(0), getrandom(GRND_NONBLOCK)) function basically the same (plus or minus a file descriptor). The problem comes in a few flavors but really they all boil down to the same thing: Code that is calling os.urandom() prior to the /dev/urandom CSPRNG being initialized. The primary case this will happen is code that is called early on in the boot sequence prior to pid 0 initializing the urandom CSPRNG from random data saved in the previous boot [1]. There are other cases this could happen though, like embedded Linux systems or RaspberryPi’s or the like that don’t have great sources of hardware entropy that will make it so the initialization of the CSPRNG will take a longer period of time. This is particularly true on systems that don’t (currently) have an active network connection since Networking is one of the better sources of randomness that the kernel can use to seed these values with. [1] This is basically what caused the initial report, systemd-cron was a Python script and the SipHash for the dictionary hash randomization was calling os.urandom to seed itself. However this particular thing isn’t being asked to be made blocking (or an error). As far as I know, most everyone agrees that for SipHash’s purpose it’s reasonable fine to fall back to an insecure source of random if a secure source isn’t available at the moment. What the security side wants is for people explicitly calling os.urandom (directly or indirectly) as part of the execute of their Python program to always get secure random if the platform we are on provides a reasonable interface to get access to it (e.g. /dev/random is not a reasonable interface, but getrandom() is). — Donald Stufft
participants (5)
-
Barry Warsaw
-
Donald Stufft
-
Guido van Rossum
-
Nick Coghlan
-
Tim Peters