PEP 522: Allow BlockingIOError in security sensitive APIs on Linux

Hi folks, Over the weekend, Nathaniel Smith and I put together a proposal to allow security sensitive APIs (os.urandom, random.SystemRandom and the new secrets module) to throw BlockingIOError if the operating system's random number generator isn't ready. We think this approach provides all the desired security guarantees, while being relatively straightforward for affected system integrators to diagnose and appropriately resolve if they're currently using these APIs in a context where Linux is currently feeding them potentially predictable random values. Rendered: https://www.python.org/dev/peps/pep-0522/ GitHub: https://github.com/python/peps/blob/master/pep-0522.txt The "Additional Background" section is mainly for the sake of folks that haven't been following any of the previous discussions, but also provides the reasoning for why we don't consider retaining consistency with "man urandom" to be a useful design goal (any more than the builtin open tries to retain consistency with "man open") Cheers, Nick. ================= PEP: 522 Title: Allow BlockingIOError in security sensitive APIs on Linux Version: $Revision$ Last-Modified: $Date$ Author: Nick Coghlan <ncoghlan@gmail.com>, Nathaniel J. Smith <njs@pobox.com> Status: Draft Type: Standards Track Content-Type: text/x-rst Created: 16 June 2016 Python-Version: 3.6 Abstract ======== A number of APIs in the standard library that return random values nominally suitable for use in security sensitive operations currently have an obscure Linux-specific failure mode that allows them to return values that are not, in fact, suitable for such operations. This PEP proposes changing such failures in Python 3.6 from the current silent, hard to detect, and hard to debug, errors to easily detected and debugged errors by raising ``BlockingIOError`` with a suitable error message, allowing developers the opportunity to unambiguously specify their preferred approach for handling the situation. The APIs affected by this change would be: * ``os.urandom`` * ``random.SystemRandom`` * the new ``secrets`` module added by PEP 506 The new exception would potentially be encountered in the following situations: * Python code calling these APIs during Linux system initialization * Python code running on improperly initialized Linux systems (e.g. embedded hardware without adequate sources of entropy to seed the system random number generator, or Linux VMs that aren't configured to accept entropy from the VM host) CPython interpreter initialization and ``random`` module initialization would also be updated to gracefully fall back to alternative seeding options if the system random number generator is not ready. Proposal ======== Changing ``os.urandom()`` on Linux ---------------------------------- This PEP proposes that in Python 3.6+, ``os.urandom()`` be updated to call the new Linux ``getrandom()`` syscall in non-blocking mode if available and raise ``BlockingIOError: system random number generator is not ready`` if the kernel reports that the call would block. This behaviour will then propagate through to higher level standard library APIs that depend on ``os.urandom`` (specifically ``random.SystemRandom`` and the new ``secrets`` module introduced by PEP 506). In all cases, as soon as a call to one of these security sensitive APIs succeeds, all future calls to these APIs in that process will succeed (once the operating system random number generator is ready after system boot, it remains ready). Related changes --------------- Currently, SipHash initialization and ``random`` module initialization both gather random bytes using the same code that underlies ``os.urandom``. This PEP proposes to modify these so that in situations where ``os.urandom`` would raise a ``BlockingIOError``, they automatically fall back on potentially more predictable sources of randomness (and in the SipHash case, print a warning message to ``stderr`` indicating that that particular Python process should not be used to process untrusted data). To transparently accommodate a potential future where Linux adopts the same "potentially blocking during system initialization" ``/dev/urandom`` behaviour used by other \*nix systems, this fallback source of randomness will *not* be the ``/dev/urandom`` device. Limitations on scope -------------------- No changes are proposed for Windows or Mac OS X systems, as neither of those platforms provides any mechanism to run Python code before the operating system random number generator has been initialized. Mac OS X goes so far as to kernel panic and abort the boot process if it can't properly initialize the random number generator (although Apple's restrictions on the supported hardware platforms make that exceedingly unlikely in practice). Similarly, no changes are proposed for other \*nix systems where ``os.urandom()`` will currently block waiting for the system random number generator to be initialized, rather than returning values that are potentially unsuitable for use in security sensitive applications. While other \*nix systems that offer a non-blocking API for requesting random numbers suitable for use in security sensitive applications could potentially receive a similar update to the one proposed for Linux in this PEP, such changes are out of scope for this particular proposal. Python's behaviour on older Linux systems that do not offer the new ``getrandom()`` syscall will also remain unchanged. Rationale ========= Raising ``BlockingIOError`` in ``os.urandom()`` on Linux -------------------------------------------------------- For several years now, the security community's guidance has been to use ``os.urandom()`` (or the ``random.SystemRandom()`` wrapper) when implementing security sensitive operations in Python. To help improve API discoverability and make it clearer that secrecy and simulation are not the same problem (even though they both involve random numbers), PEP 506 collected several of the one line recipes based on the lower level ``os.urandom()`` API into a new ``secrets`` module. However, this guidance has also come with a longstanding caveat: developers writing security sensitive software at least for Linux, and potentially for some other \*BSD systems, may need to wait until the operating system's random number generator is ready before relying on it for security sensitive operations. This generally only occurs if ``os.urandom()`` is read very early in the system initialization process, or on systems with few sources of available entropy (e.g. some kinds of virtualized or embedded systems), but unfortunately the exact conditions that trigger this are difficult to predict, and when it occurs then there is no direct way for userspace to tell it has happened without querying operating system specific interfaces. On \*BSD systems (if the particular \*BSD variant allows the problem to occur at all), encountering this situation means ``os.urandom()`` will either block waiting for the system random number generator to be ready (the associated symptom would be for the affected script to pause unexpectedly on the first call to ``os.urandom()``) or else will behave the same way as it does on Linux. On Linux, in Python versions up to and including Python 3.4, and in Python 3.5 maintenance versions following Python 3.5.2, there's no clear indicator to developers that their software may not be working as expected when run early in the Linux boot process, or on hardware without good sources of entropy to seed the operating system's random number generator: due to the behaviour of the underlying ``/dev/urandom`` device, ``os.urandom()`` on Linux returns a result either way, and it takes extensive statistical analysis to show that a security vulnerability exists. By contrast, if ``BlockingIOError`` is raised in those situations, then developers using Python 3.6+ can easily choose their desired behaviour: 1. Loop until the call succeeds (security sensitive) 2. Switch to using the random module (non-security sensitive) 3. Switch to reading ``/dev/urandom`` directly (non-security sensitive) Issuing a warning for potentially predictable internal hash initialization -------------------------------------------------------------------------- The challenge for internal hash initialization is that it might be very important to initialize SipHash with a reliably unpredictable random seed (for processes that are exposed to potentially hostile input) or it might be totally unimportant (for processes that never have to deal with untrusted data). The Python runtime has no way to know which case a given invocation involves, which means that if we allow SipHash initialization to block or error out, then our intended security enhancement may break code that is already safe and working fine, which is unacceptable -- especially since we are reasonably confident that most Python invocations that might run during Linux system initialization fall into this category (exposure to untrusted input tends to involve network access, which typically isn't brought up until after the system random number generator is initialized). However, at the same time, since Python has no way to know whether any given invocation needs to handle untrusted data, when the default SipHash initialization fails this *might* indicate a genuine security problem, which should not be allowed to pass silently. Accordingly, if internal hash initialization needs to fall back to a potentially predictable seed due to the system random number generator not being ready, it will also emit a warning message on ``stderr`` to say that the system random number generator is not available and that processing potentially hostile untrusted data should be avoided. Allowing potentially predictable ``random`` module initialization ----------------------------------------------------------------- Other than for ``random.SystemRandom`` (which is a relatively thin wrapper around ``os.urandom``), the ``random`` module has never made any guarantees that the numbers it generates are suitable for use in security sensitive operations, so the use of the system random number generator to seed the default Mersenne Twister instance is mainly beneficial as a harm mitigation measure for code that is using the ``random`` module inappropriately. Since a single call to ``os.urandom()`` is cheap once the system random number generator has been initialized it makes sense to retain that as the default behaviour, but there's no need to issue a warning when falling back to a potentially more predictable alternative when necessary (in such cases, a warning will typically already have been issued as part of interpreter startup, as the only way for the call when importing the random module to fail without the implicit call during interpreter startup also failing if for the latter to have been skipped by entirely disabling the hash randomization mechanism). Backwards Compatibility Impact Assessment ========================================= Similar to PEP 476, this is a proposal to turn a previously silent security failure into a noisy exception that requires the application developer to make an explicit decision regarding the behaviour they desire. As no changes are proposed for operating systems other than Linux, ``os.urandom()`` retains its existing behaviour as a nominally blocking API that is non-blocking in practice due to the difficulty of scheduling Python code to run before the operating system random number generator is ready. We believe it may be possible to encounter problems akin to those described in this PEP on at least some \*BSD variants, but nobody has explicitly demonstrated that. On Mac OS X and Windows, it appears to be straight up impossible to even try to run a Python interpreter that early in the boot process. On Linux, ``os.urandom()`` retains its status as a guaranteed non-blocking API. However, the means of achieving that status changes in the specific case of the operating system random number generator not being ready for use in security sensitive operations: historically it would return potentially predictable random data, with this PEP it would change to raise ``BlockingIOError``. Developers of affected applications would then be required to make one of the following changes to gain forward compatibility with Python 3.6, based on the kind of application they're developing. Unaffected Applications ----------------------- The following kinds of applications would be entirely unaffected by the change, regardless of whether or not they perform security sensitive operations: - applications that don't support Linux - applications that are only run on desktops or conventional servers - applications that are only run after the system RNG is ready Applications in this category simply won't encounter the new exception, so it will be reasonable for developers to wait and see if they receive Python 3.6 compatibility bugs related to the new runtime behaviour, rather than attempting to pre-emptively determine whether or not they're affected. Affected security sensitive applications ---------------------------------------- Security sensitive applications would need to either change their system configuration so the application is only started after the operating system random number generator is ready for security sensitive operations, or else change their code to busy loop until the operating system is ready:: def blocking_urandom(num_bytes): while True: try: return os.urandom(num_bytes) except BlockingIOError: pass Affected non-security sensitive applications -------------------------------------------- Non-security sensitive applications that don't want to assume access to ``/dev/urandom`` (or assume a non-blocking implementation of that device) can be updated to use the ``random`` module as a fallback option:: def pseudorandom_fallback(num_bytes): try: return os.urandom(num_bytes) except BlockingIOError: random.getrandbits(num_bytes*8).to_bytes(num_bytes, "little") Depending on the application, it may also be appropriate to skip accessing ``os.urandom`` at all, and instead rely solely on the ``random`` module. Affected Linux specific non-security sensitive applications ----------------------------------------------------------- Non-security sensitive applications that don't need to worry about cross platform compatibility and are willing to assume that ``/dev/urandom`` on Linux will always retain its current behaviour can be updated to access ``/dev/urandom`` directly:: def dev_urandom(num_bytes): with open("/dev/urandom", "rb") as f: return f.read(num_bytes) However, pursuing this option has the downside of contributing to ensuring that the default behaviour of Linux at the operating system level can never be changed. Additional Background ===================== Why propose this now? --------------------- The main reason is because the Python 3.5.0 release switched to using the new Linux ``getrandom()`` syscall when available in order to avoid consuming a file descriptor [1]_, and this had the side effect of making the following operations block waiting for the system random number generator to be ready: * ``os.urandom`` (and APIs that depend on it) * importing the ``random`` module * initializing the randomized hash algorithm used by some builtin types While the first of those behaviours is arguably desirable (and consistent with ``os.urandom``'s existing behaviour on other operating systems), the latter two behaviours are unnecessary and undesirable, and the last one is now known to cause a system level deadlock when attempting to run Python scripts during the Linux init process with Python 3.5.0 or 3.5.1 [2]_, while the second one can cause problems when using virtual machines without robust entropy sources configured [3]_. Since decoupling these behaviours in CPython will involve a number of implementation changes more appropriate for a feature release than a maintenance release, the relatively simple resolution applied in Python 3.5.2 was to revert all three of them to a behaviour similar to that of previous Python versions: if the new Linux syscall indicates it will block, then Python 3.5.2 will implicitly fall back on reading ``/dev/urandom`` directly [4]_. However, this bug report *also* resulted in a range of proposals to add *new* APIs like ``os.getrandom()`` [5]_, ``os.urandom_block()`` [6]_, ``os.pseudorandom()`` and ``os.cryptorandom()`` [7]_, or adding new optional parameters to ``os.urandom()`` itself [8]_, and then attempting to educate users on when they should call those APIs instead of just using a plain ``os.urandom()`` call. These proposals represent dramatic overreactions, as the question of reliably obtaining random numbers suitable for security sensitive work on Linux is a relatively obscure problem of interest mainly to operating system developers and embedded systems programmers, that in no way justifies cluttering up the Python standard library's cross-platform APIs with new Linux-specific concerns. This is especially so with the ``secrets`` module already being added as the "use this and don't worry about the low level details" option for developers writing security sensitive software that for some reason can't rely on even higher level domain specific APIs (like web frameworks) and also don't need to worry about Python versions prior to Python 3.6. That said, it's also the case that low cost ARM devices are becoming increasingly prevalent, with a lot of them running Linux, and a lot of folks writing Python applications that run on those devices. That creates an opportunity to take an obscure security problem that currently requires a lot of knowledge about Linux boot processes and provably unpredictable random number generation to diagnose and resolve, and instead turn it into a relatively mundane and easy-to-find-in-an-internet-search runtime exception. The cross-platform behaviour of ``os.urandom()`` ------------------------------------------------ On operating systems other than Linux, ``os.urandom()`` may already block waiting for the operating system's random number generator to be ready. This will happen at most once in the lifetime of the process, and the call is subsequently guaranteed to be non-blocking. Linux is unique in that, even when the operating system's random number generator doesn't consider itself ready for use in security sensitive operations, reading from the ``/dev/urandom`` device will return random values based on the entropy it has available. This behaviour is potentially problematic, so Linux 3.17 added a new ``getrandom()`` syscall that (amongst other benefits) allows callers to either block waiting for the random number generator to be ready, or else request an error return if the random number generator is not ready. Notably, the new API does *not* support the old behaviour of returning data that is not suitable for security sensitive use cases. Versions of Python prior up to and including Python 3.4 access the Linux ``/dev/urandom`` device directly. Python 3.5.0 and 3.5.1 called ``getrandom()`` in blocking mode in order to avoid the use of a file descriptor to access ``/dev/urandom``. While there were no specific problems reported due to ``os.urandom()`` blocking in user code, there *were* problems due to CPython implicitly invoking the blocking behaviour during interpreter startup and when importing the ``random`` module. Rather than trying to decouple SipHash initialization from the ``os.urandom()`` implementation, Python 3.5.2 switched to calling ``getrandom()`` in non-blocking mode, and falling back to reading from ``/dev/urandom`` if the syscall indicates it will block. As a result of the above, ``os.urandom()`` in all Python versions up to and including Python 3.5 propagate the behaviour of the underling ``/dev/urandom`` device to Python code. Problems with the behaviour of ``/dev/urandom`` on Linux -------------------------------------------------------- The Python ``os`` module has largely co-evolved with Linux APIs, so having ``os`` module functions closely follow the behaviour of their Linux operating system level counterparts when running on Linux is typically considered to be a desirable feature. However, ``/dev/urandom`` represents a case where the current behaviour is acknowledged to be problematic, but fixing it unilaterally at the kernel level has been shown to prevent some Linux distributions from booting (at least in part due to components like Python currently using it for non-security-sensitive purposes early in the system initialization process). As an analogy, consider the following two functions:: def generate_example_password(): """Generates passwords solely for use in code examples""" return generate_unpredictable_password() def generate_actual_password(): """Generates actual passwords for use in real applications""" return generate_unpredictable_password() If you think of an operating system's random number generator as a method for generating unpredictable, secret passwords, then you can think of Linux's ``/dev/urandom`` as being implemented like:: # Oversimplified artist's conception of the kernel code # implementing /dev/urandom def generate_unpredictable_password(): if system_rng_is_ready: return use_system_rng_to_generate_password() else: # we can't make an unpredictable password; silently return a # potentially predictable one instead: return "p4ssw0rd" In this scenario, the author of ``generate_example_password`` is fine - even if ``"p4ssw0rd"`` shows up a bit more often than they expect, it's only used in examples anyway. However, the author of ``generate_actual_password`` has a problem - how do they prove that their calls to ``generate_unpredictable_password`` never follow the path that returns a predictable answer? In real life it's slightly more complicated than this, because there might be some level of system entropy available -- so the fallback might be more like ``return random.choice(["p4ssword", "passw0rd", "p4ssw0rd"])`` or something even more variable and hence only statistically predictable with better odds than the author of ``generate_actual_password`` was expecting. This doesn't really make things more provably secure, though; mostly it just means that if you try to catch the problem in the obvious way -- ``if returned_password == "p4ssw0rd": raise UhOh`` -- then it doesn't work, because ``returned_password`` might instead be ``p4ssword`` or even ``pa55word``, or just an arbitrary 64 bit sequence selected from fewer than 2**64 possibilities. So this rough sketch does give the right general idea of the consequences of the "more predictable than expected" fallback behaviour, even though it's thoroughly unfair to the Linux kernel team's efforts to mitigate the practical consequences of this problem without resorting to breaking backwards compatibility. This design is generally agreed to be a bad idea. As far as we can tell, there are no use cases whatsoever in which this is the behavior you actually want. It has led to the use of insecure ``ssh`` keys on real systems, and many \*nix-like systems (including at least Mac OS X, OpenBSD, and FreeBSD) have modified their ``/dev/urandom`` implementations so that they never return predictable outputs, either by making reads block in this case, or by simply refusing to run any userspace programs until the system RNG has been initialized. Unfortunately, Linux has so far been unable to follow suit, because it's been empirically determined that enabling the blocking behavior causes some currently extant distributions to fail to boot. Instead, the new ``getrandom()`` syscall was introduced, making it *possible* for userspace applications to access the system random number generator safely, without introducing hard to debug deadlock problems into the system initialization processes of existing Linux distros. Consequences of ``getrandom()`` availability for Python ------------------------------------------------------- Prior to the introduction of the ``getrandom()`` syscall, it simply wasn't feasible to access the Linux system random number generator in a provably safe way, so we were forced to settle for reading from ``/dev/urandom`` as the best available option. However, with ``getrandom()`` insisting on raising an error or blocking rather than returning predictable data, as well as having other advantages, it is now the recommended method for accessing the kernel RNG on Linux, with reading ``/dev/urandom`` directly relegated to "legacy" status. This moves Linux into the same category as other operating systems like Windows, which doesn't provide a ``/dev/urandom`` device at all: the best available option for implementing ``os.urandom()`` is no longer simply reading bytes from the ``/dev/urandom`` device. This means that what used to be somebody else's problem (the Linux kernel development team's) is now Python's problem -- given a way to detect that the system RNG is not initialized, we have to choose how to handle this situation whenever we try to use the system RNG. It could simply block, as was somewhat inadvertently implemented in 3.5.0:: # artist's impression of the CPython 3.5.0-3.5.1 behavior def generate_unpredictable_bytes_or_block(num_bytes): while not system_rng_is_ready: wait return unpredictable_bytes(num_bytes) Or it could raise an error, as this PEP proposes (in *some* cases):: # artist's impression of the behavior proposed in this PEP def generate_unpredictable_bytes_or_raise(num_bytes): if system_rng_is_ready: return unpredictable_bytes(num_bytes) else: raise BlockingIOError Or it could explicitly emulate the ``/dev/urandom`` fallback behavior, as was implemented in 3.5.2rc1 and is expected to remain for the rest of the 3.5.x cycle:: # artist's impression of the CPython 3.5.2rc1+ behavior def generate_unpredictable_bytes_or_maybe_not(num_bytes): if system_rng_is_ready: return unpredictable_bytes(num_bytes) else: return (b"p4ssw0rd" * (num_bytes // 8 + 1))[:num_bytes] (And the same caveats apply to this sketch as applied to the ``generate_unpredictable_password`` sketch of ``/dev/urandom`` above.) There are five places where CPython and the standard library attempt to use the operating system's random number generator, and thus five places where this decision has to be made: * initializing the SipHash used to protect ``str.__hash__`` and friends against DoS attacks (called unconditionally at startup) * initializing the ``random`` module (called when ``random`` is imported) * servicing user calls to the ``os.urandom`` public API * the higher level ``random.SystemRandom`` public API * the new ``secrets`` module public API added by PEP 506 Currently, these five places all use the same underlying code, and thus make this decision in the same way. This whole problem was first noticed because 3.5.0 switched that underlying code to the ``generate_unpredictable_bytes_or_block`` behavior, and it turns out that there are some rare cases where Linux boot scripts attempted to run a Python program as part of system initialization, the Python startup sequence blocked while trying to initialize SipHash, and then this triggered a deadlock because the system stopped doing anything -- including gathering new entropy -- until the Python script was forcibly terminated by an external timer. This is particularly unfortunate since the scripts in question never processed untrusted input, so there was no need for SipHash to be initialized with provably unpredictable random data in the first place. This motivated the change in 3.5.2rc1 to emulate the old ``/dev/urandom`` behavior in all cases (by calling ``getrandom()`` in non-blocking mode, and then falling back to reading ``/dev/urandom`` if the syscall indicates that the ``/dev/urandom`` pool is not yet fully initialized.) A similar problem was found due to the ``random`` module calling ``os.urandom`` as a side-effect of import in order to seed the default global ``random.Random()`` instance. We have not received any specific complaints regarding direct calls to ``os.urandom()`` or ``random.SystemRandom()`` blocking with 3.5.0 or 3.5.1 - only problem reports due to the implicit blocking on interpreter startup and as a side-effect of importing the random module. Accordingly, this PEP proposes providing consistent shared behaviour for the latter three cases (ensuring that their behaviour is unequivocally suitable for all security sensitive operations), while updating the first two cases to account for that behavioural change. This approach should mean that the vast majority of Python users never need to even be aware that this change was made, while those few whom it affects will receive an exception at runtime that they can look up online and find suitable guidance on addressing. References ========== .. [1] os.urandom() should use Linux 3.17 getrandom() syscall (http://bugs.python.org/issue22181) .. [2] Python 3.5 running on Linux kernel 3.17+ can block at startup or on importing the random module on getrandom() (http://bugs.python.org/issue26839) .. [3] "import random" blocks on entropy collection on Linux with low entropy (http://bugs.python.org/issue25420) .. [4] os.urandom() doesn't block on Linux anymore (https://hg.python.org/cpython/rev/9de508dc4837) .. [5] Proposal to add os.getrandom() (http://bugs.python.org/issue26839#msg267803) .. [6] Add os.urandom_block() (http://bugs.python.org/issue27250) .. [7] Add random.cryptorandom() and random.pseudorandom, deprecate os.urandom() (http://bugs.python.org/issue27279) .. [8] Always use getrandom() in os.random() on Linux and add block=False parameter to os.urandom() (http://bugs.python.org/issue27266) For additional background details beyond those captured in this PEP, also see Victor Stinner's summary at http://haypo-notes.readthedocs.io/pep_random.html Copyright ========= This document has been placed into the public domain. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

On 22 Jun 2016, at 02:28, Nick Coghlan <ncoghlan@gmail.com> wrote:
Hi folks,
Over the weekend, Nathaniel Smith and I put together a proposal to allow security sensitive APIs (os.urandom, random.SystemRandom and the new secrets module) to throw BlockingIOError if the operating system's random number generator isn't ready.
In general I like this approach. One note inline below.
Limitations on scope --------------------
No changes are proposed for Windows or Mac OS X systems, as neither of those platforms provides any mechanism to run Python code before the operating system random number generator has been initialized. Mac OS X goes so far as to kernel panic and abort the boot process if it can't properly initialize the random number generator (although Apple's restrictions on the supported hardware platforms make that exceedingly unlikely in practice).
Similarly, no changes are proposed for other \*nix systems where ``os.urandom()`` will currently block waiting for the system random number generator to be initialized, rather than returning values that are potentially unsuitable for use in security sensitive applications.
You may want to be careful around this point. Solaris provides a getrandom() syscall as well, that Python *does* use. Furthermore, if other *nix OSes provide a getrandom() syscall then the current Python code will favour it over the urandom fallback: care should be taken to clarify what the expected plan is in these cases. Cory

[Nick Coghlan]
PEP: 522 Title: Allow BlockingIOError in security sensitive APIs on Linux ... Other than for ``random.SystemRandom`` (which is a relatively thin wrapper around ``os.urandom``), the ``random`` module has never made any guarantees that the numbers it generates are suitable for use in security sensitive operations,
To the contrary, it explicitly says it "should not be used for security purposes".
so the use of the system random number generator to seed the default Mersenne Twister instance is mainly beneficial as a harm mitigation measure for code that is using the ``random`` module inappropriately.
Except that's largely accidental. It so happens that using urandom() left Python immune to the "poor seeding" attacks in the PHP paper widely discussed when `secrets` was gestating, and it's entirely accidental that Python 3 (but not Python 2) happens to implement random.choice(), .randrange(), etc in such a way as to leave it resistant even to the PHP paper's "deduce MT state from partial outputs" attacks. Even wholly naive "generate a password" snippets using small alphabets with random.choice() are highly resistant to state-deducing attacks in Python 3. Those continue to be worth something. But the _real_ reason MT uses urandom() is that MT has massive internal state, and initialization wants the best chance it can get at picking any of the 2**19937-1 possible initial states. For example, seeding with time.time() and/or pid can't possibly get at more than an infinitesimal fraction of those. This has nothing to do with "security" - it has to do with best practice for simulations. Seeding the Twister (any PRNG with massive state) "fairly" is a puzzle, and seeding from urandom() was the best that could be done. Quite possible that, e.g., the system CSPRNG has only 512 bits of state, but that's still far better than brewing pseudo-nonsense out of a comparative handful of time.time() (etc) bits.
Since a single call to ``os.urandom()`` is cheap once the system random number generator has been initialized it makes sense to retain that as the default behaviour, but there's no need to issue a warning when falling back to a potentially more predictable alternative when necessary (in such cases, a warning will typically already have been issued as part of interpreter startup, as the only way for the call when importing the random module to fail without the implicit call during interpreter startup also failing if for the latter to have been skipped by entirely disabling the hash randomization mechanism).
Since the set of people who start simulations very early in the boot sequence is empty, I have no objection to any change here - so long as MT initialization continues using the OS RNG when possible.

The new exception would potentially be encountered in the following situations:
* Python code calling these APIs during Linux system initialization
I'm not sure that there is such use case in practice. Can you please try to describe an use case where you would need blocking system urandom *during the Python initialization*? It looks like my use case 1, but I consider that os.urandom() is *not* called on such use case: https://haypo-notes.readthedocs.io/pep_random.html#use-case-1-init-script
* Python code running on improperly initialized Linux systems (e.g. embedded hardware without adequate sources of entropy to seed the system random number generator, or Linux VMs that aren't configured to accept entropy from the VM host)
If the program doesn't use os.urandom(), well, we don't care, there is no issue :-) IMO the interesting use case is when the application really requires secure secret. That's my use case 2, a web server: https://haypo-notes.readthedocs.io/pep_random.html#use-case-2-web-server I chose to not give the choice to the developer and block on such case. IMO it's accepable because the application should not have to wait forever for urandom.
Changing ``os.urandom()`` on Linux ----------------------------------
This PEP proposes that in Python 3.6+, ``os.urandom()`` be updated to call the new Linux ``getrandom()`` syscall in non-blocking mode if available and raise ``BlockingIOError: system random number generator is not ready`` if the kernel reports that the call would block.
To be clear, the behaviour is unchanged on other platforms, right? I'm just trying to understand the scope of the PEP. It looks like as mine, it is written for Linux. (Even if other platforms may implement the same behaviour later, if needed.) If it's deliberate to restrict to Linux, you may be more explicit at least in the abstract. -- By the way, are you aware of other programming languages or applications using an exception when random would block? (It's not a requirement, I'm just curious.)
By contrast, if ``BlockingIOError`` is raised in those situations, then developers using Python 3.6+ can easily choose their desired behaviour:
1. Loop until the call succeeds (security sensitive)
Is this case different from a blocking os.urandom()?
2. Switch to using the random module (non-security sensitive)
Hum, I disagree on this point. I don't think that you should start with os.urandom() to fallback on random. In fact, I only know *one* use case for this: create the random.Random instance when the random module is imported. In my PEP, I proposed to have a special case for random.Random constructor, implemented in C (to not have to expose anything at the Python level).
3. Switch to reading ``/dev/urandom`` directly (non-security sensitive)
It is what I propose for the random.Random constructor when the random module is imported. Again, the question is if there is a real use case for it. And if yes, if the use case common enough to justify the change? The extreme case is that all applications using os.urandom() would need to be modifiy to add a try/except BlockingIOError. I only exagerate to try to understand the impact of your PEP. I only that only a few applications will use such try/except in practice. As I tried to explain in my PEP, with Python 3.5.2, "the bug" (block on random) became very unlikely.
Issuing a warning for potentially predictable internal hash initialization
I don't recall Python logging warnings for similar issues. But I don't recall similar issues neither :-)
The challenge for internal hash initialization is that it might be very important to initialize SipHash with a reliably unpredictable random seed (for processes that are exposed to potentially hostile input) or it might be totally unimportant (for processes that never have to deal with untrusted data).
From what I read, /dev/urandom is good even before it is considered as initialized, because the kernel collects various data, but don't increase the entropy estimator.
I'm not completely convinced that a warning is needed. I'm not against it neither. I am doubtful. :-) Well, let's say that we have a warning. What should the user do in such case? Is it an advice to dig the urandom issue and try to get more entropy? The warning is for users, no? I imagine that an application can work perfectly for the developer, but only emit the warning for some users depending how the deploy their application.
However, at the same time, since Python has no way to know whether any given invocation needs to handle untrusted data, when the default SipHash initialization fails this *might* indicate a genuine security problem, which should not be allowed to pass silently.
An alternative would be to provide a read-only flag which would indicate if the hash secret is considered as "secure" or not. Applications considered by security would check the flag and decide themself to emit a warning or not.
Accordingly, if internal hash initialization needs to fall back to a potentially predictable seed due to the system random number generator not being ready, it will also emit a warning message on ``stderr`` to say that the system random number generator is not available and that processing potentially hostile untrusted data should be avoided.
I know that many of you disagree with me, but I'm not sure that the hash DoS is an important issue. We should not overestimate the importance of this vulnerability.
Affected security sensitive applications ----------------------------------------
Security sensitive applications would need to either change their system configuration so the application is only started after the operating system random number generator is ready for security sensitive operations, or else change their code to busy loop until the operating system is ready::
def blocking_urandom(num_bytes): while True: try: return os.urandom(num_bytes) except BlockingIOError: pass
Such busy-loop may use a lot of CPU :-/ You need a time.sleep() or something like that, no? A blocking os.urandom() doesn't have such issue ;-) Is it possible that os.urandom() works, but the following os.urandom() call raises a BlockingIOError? If yes, there is an issue with "partial read", we should uses a dedicated exception to return partial data. Hopefully, I understood that the issue doesn't occur in pratice. os.urandom() starts with BlockingIOError. But once it "works", it will work forever. Well, at least on Linux. I don't know how Solaris behaves. I hope that it behaves as Linux (once it works, it always works). At least, I see that Solaris getrandom() can also fails with EAGAIN.
Affected non-security sensitive applications --------------------------------------------
Non-security sensitive applications that don't want to assume access to ``/dev/urandom`` (or assume a non-blocking implementation of that device) can be updated to use the ``random`` module as a fallback option::
def pseudorandom_fallback(num_bytes): try: return os.urandom(num_bytes) except BlockingIOError: random.getrandbits(num_bytes*8).to_bytes(num_bytes, "little")
Depending on the application, it may also be appropriate to skip accessing ``os.urandom`` at all, and instead rely solely on the ``random`` module.
Hum, I dislike such change. It overcomplicates applications for a corner-case. If you use os.urandom(), you already expect security. I prefer to simplify use cases to two cases: (1) you really need security (2) you really don't care of security. If you don't care, use directly the random module. Don't bother with os.urandom() nor having to add try/except BlockingIOError. No? I *hope* that a regular application will never see BlockingIOError on os.urandom() in the wild.
Affected Linux specific non-security sensitive applications -----------------------------------------------------------
Non-security sensitive applications that don't need to worry about cross platform compatibility and are willing to assume that ``/dev/urandom`` on Linux will always retain its current behaviour can be updated to access ``/dev/urandom`` directly::
def dev_urandom(num_bytes): with open("/dev/urandom", "rb") as f: return f.read(num_bytes)
Again, I'm against adding such complexity for a corner case. Just use os.urandom().
For additional background details beyond those captured in this PEP, also see Victor Stinner's summary at http://haypo-notes.readthedocs.io/pep_random.html
Oh, I didn't expect to have references to my document :-) I moved it to: https://haypo-notes.readthedocs.io/summary_python_random_issue.html http://haypo-notes.readthedocs.io/pep_random.html is now really a PEP ;-) Victor

On 23 June 2016 at 15:54, Victor Stinner <victor.stinner@gmail.com> wrote:
The new exception would potentially be encountered in the following situations:
* Python code calling these APIs during Linux system initialization
I'm not sure that there is such use case in practice.
Can you please try to describe an use case where you would need blocking system urandom *during the Python initialization*?
It looks like my use case 1, but I consider that os.urandom() is *not* called on such use case: https://haypo-notes.readthedocs.io/pep_random.html#use-case-1-init-script
My preference for an exception comes from the fact that we can never prove the non-existence of proprietary software that does certain things, but we *can* ensure that such code gets an easy to debug exception rather than a potential deadlock if it does exist. The argument chain runs: - if such software doesn't exist, it doesn't matter which behaviour we choose - if we're wrong and it does exist, we can choose how it fails: - blocking (with associated potential for init system deadlock) - throwing an exception Given the choice between debugging an apparent system hang and an unexpected exception when testing against a new version of a platform, I'll choose the exception every time.
* Python code running on improperly initialized Linux systems (e.g. embedded hardware without adequate sources of entropy to seed the system random number generator, or Linux VMs that aren't configured to accept entropy from the VM host)
If the program doesn't use os.urandom(), well, we don't care, there is no issue :-)
IMO the interesting use case is when the application really requires secure secret. That's my use case 2, a web server: https://haypo-notes.readthedocs.io/pep_random.html#use-case-2-web-server
I chose to not give the choice to the developer and block on such case. IMO it's accepable because the application should not have to wait forever for urandom.
Should not, but actually can, depending on the characteristics of the underlying system and its runtime environment.
Changing ``os.urandom()`` on Linux ----------------------------------
This PEP proposes that in Python 3.6+, ``os.urandom()`` be updated to call the new Linux ``getrandom()`` syscall in non-blocking mode if available and raise ``BlockingIOError: system random number generator is not ready`` if the kernel reports that the call would block.
To be clear, the behaviour is unchanged on other platforms, right?
Cory Benfield pointed out that the proposal as currently written isn't clear as to whether or not it applies to recent versions of Solaris and Illumos, as they also provide a getrandom() syscall.
I'm just trying to understand the scope of the PEP. It looks like as mine, it is written for Linux. (Even if other platforms may implement the same behaviour later, if needed.)
If it's deliberate to restrict to Linux, you may be more explicit at least in the abstract.
It's in the PEP title: "Allow BlockingIOError in security sensitive APIs on Linux" However, I need to update it to indicate it applies to any system that provides a non-blocking getrandom() syscall.
--
By the way, are you aware of other programming languages or applications using an exception when random would block? (It's not a requirement, I'm just curious.)
No, but I haven't really gone looking either. It's also worth keeping in mind that it's only in the last 12 months folks have even had the *option* of doing better than just reading from /dev/urandom and hoping it's been initialised properly.
By contrast, if ``BlockingIOError`` is raised in those situations, then developers using Python 3.6+ can easily choose their desired behaviour:
1. Loop until the call succeeds (security sensitive)
Is this case different from a blocking os.urandom()?
Yes, as it's up to the application to decide when it wants to check for the system RNG being ready, and how it wants to report that to the user. For example, it may decide to emit a runtime warning before it enters the busy loop (I'm actually having a discussion with Donald in another thread regarding a possible design for a "secrets.wait_for_system_rng()" API that meshes well with the other changes proposed in PEP 522).
2. Switch to using the random module (non-security sensitive)
Hum, I disagree on this point. I don't think that you should start with os.urandom() to fallback on random.
In fact, I only know *one* use case for this: create the random.Random instance when the random module is imported.
In my PEP, I proposed to have a special case for random.Random constructor, implemented in C (to not have to expose anything at the Python level).
We have two use cases for a fallback just in the standard library (SipHash initiliasition and random module initialisation). Rather than assuming no other use cases for the feature exist, we can expose the fallback mechanism we use ourselves and let people decide for themselves whether or not they want to do something similar.
3. Switch to reading ``/dev/urandom`` directly (non-security sensitive)
It is what I propose for the random.Random constructor when the random module is imported.
Again, the question is if there is a real use case for it. And if yes, if the use case common enough to justify the change?
The extreme case is that all applications using os.urandom() would need to be modifiy to add a try/except BlockingIOError. I only exagerate to try to understand the impact of your PEP. I only that only a few applications will use such try/except in practice.
That's where the idea of also adding secrets.wait_for_system_rng() comes, rather than having to wrap every library call in a try/except block (or risk having those APIs become blocking ones such that async developers feel obliged to call them in a separate thread)
As I tried to explain in my PEP, with Python 3.5.2, "the bug" (block on random) became very unlikely.
Aye, I agree with that (hence the references to this being an obscure, Linux-specific problem in PEP 522). However, I think it makes sense to stipulate that someone porting to Python 3.6 *has* unexpectedly encountered the new behaviour, and is trying to debug what has gone wrong with their application/system when comparing the two designs for usability.
Issuing a warning for potentially predictable internal hash initialization
I don't recall Python logging warnings for similar issues. But I don't recall similar issues neither :-)
It's a pretty unique problem, and not one we've been able to detect it in the past.
The challenge for internal hash initialization is that it might be very important to initialize SipHash with a reliably unpredictable random seed (for processes that are exposed to potentially hostile input) or it might be totally unimportant (for processes that never have to deal with untrusted data).
From what I read, /dev/urandom is good even before it is considered as initialized, because the kernel collects various data, but don't increase the entropy estimator.
I'm not completely convinced that a warning is needed. I'm not against it neither. I am doubtful. :-)
Well, let's say that we have a warning. What should the user do in such case? Is it an advice to dig the urandom issue and try to get more entropy?
The warning is for users, no? I imagine that an application can work perfectly for the developer, but only emit the warning for some users depending how the deploy their application.
It's a warning primarily for system integrators (i.e. the folks developing a distro, designing an embedded device or configuring a VM) that they need to either: - reconfigure the application to start later in the boot process (e.g. after the network comes up) - write a systemd PreExec snippet that waits for the system RNG to be initialised (that will be particularly easy if it can be written as "python3 -c 'import secrets; secrets.wait_for_system_rng()") - add a better entropy source to their system The kind of wording I'm thinking of is along the lines of: "Python hash initialization: using potentially predictable fallback hash seed; avoid handling untrusted potentially hostile data in this process"
However, at the same time, since Python has no way to know whether any given invocation needs to handle untrusted data, when the default SipHash initialization fails this *might* indicate a genuine security problem, which should not be allowed to pass silently.
An alternative would be to provide a read-only flag which would indicate if the hash secret is considered as "secure" or not.
Applications considered by security would check the flag and decide themself to emit a warning or not.
I really don't want to add any more knobs and dials that need to be documented and learned if we can possibly avoid it (and I think we can). In this case, turning off hash randomisation entirely will suppress the warning along with hash randomisation itself.
Accordingly, if internal hash initialization needs to fall back to a potentially predictable seed due to the system random number generator not being ready, it will also emit a warning message on ``stderr`` to say that the system random number generator is not available and that processing potentially hostile untrusted data should be avoided.
I know that many of you disagree with me, but I'm not sure that the hash DoS is an important issue.
We should not overestimate the importance of this vulnerability.
It was never particularly important (the payload multiplier on the Denial-of-Service isn't that big), but it was high profile and splashy, and it's relatively cheap to take into account (since folks that know it doesn't apply to them can still turn randomization off entirely)
Affected security sensitive applications ----------------------------------------
Security sensitive applications would need to either change their system configuration so the application is only started after the operating system random number generator is ready for security sensitive operations, or else change their code to busy loop until the operating system is ready::
def blocking_urandom(num_bytes): while True: try: return os.urandom(num_bytes) except BlockingIOError: pass
Such busy-loop may use a lot of CPU :-/ You need a time.sleep() or something like that, no?
Maybe - we can work out the exact details once I've added the secrets.wait_for_system_rng() proposal to the PEP.
A blocking os.urandom() doesn't have such issue ;-)
It also doesn't let an app fail gracefully if it opts not to support running without a pre-initialised system RNG :)
Is it possible that os.urandom() works, but the following os.urandom() call raises a BlockingIOError? If yes, there is an issue with "partial read", we should uses a dedicated exception to return partial data.
No, it's not possible with os.urandom(). (It *can* happen with /dev/random and with getentropy() on OpenBSD and Solaris, which is why folks say "don't use those for anything")
Hopefully, I understood that the issue doesn't occur in pratice. os.urandom() starts with BlockingIOError. But once it "works", it will work forever. Well, at least on Linux.
I don't know how Solaris behaves. I hope that it behaves as Linux (once it works, it always works). At least, I see that Solaris getrandom() can also fails with EAGAIN.
It's the same logic as Linux (once a CSPRNG is properly seeded it can never run out of entropy, but seeding it in the first place does require entropy collection)
Affected non-security sensitive applications --------------------------------------------
Non-security sensitive applications that don't want to assume access to ``/dev/urandom`` (or assume a non-blocking implementation of that device) can be updated to use the ``random`` module as a fallback option::
def pseudorandom_fallback(num_bytes): try: return os.urandom(num_bytes) except BlockingIOError: random.getrandbits(num_bytes*8).to_bytes(num_bytes, "little")
Depending on the application, it may also be appropriate to skip accessing ``os.urandom`` at all, and instead rely solely on the ``random`` module.
Hum, I dislike such change. It overcomplicates applications for a corner-case.
If you use os.urandom(), you already expect security. I prefer to simplify use cases to two cases: (1) you really need security (2) you really don't care of security. If you don't care, use directly the random module. Don't bother with os.urandom() nor having to add try/except BlockingIOError. No?
I *hope* that a regular application will never see BlockingIOError on os.urandom() in the wild.
Yeah, hence why I'm shifting more in favour of the secrets.wait_for_system_rng() idea (which folks can then use as inspiration to write their own "wait for the system RNG" helpers for earlier Python and operating system versions)
Affected Linux specific non-security sensitive applications -----------------------------------------------------------
Non-security sensitive applications that don't need to worry about cross platform compatibility and are willing to assume that ``/dev/urandom`` on Linux will always retain its current behaviour can be updated to access ``/dev/urandom`` directly::
def dev_urandom(num_bytes): with open("/dev/urandom", "rb") as f: return f.read(num_bytes)
Again, I'm against adding such complexity for a corner case. Just use os.urandom().
All of this would be triggered by *application* developers actually hitting the BlockingIOError and decide it was the appropriate course of application for *their* application. The point of this part of the PEP is to highlight that there are some really simple 3-5 functions that let developers get a wide variety of behaviours in ways that are compatible with single-source Python 2/3 code.
For additional background details beyond those captured in this PEP, also see Victor Stinner's summary at http://haypo-notes.readthedocs.io/pep_random.html
Oh, I didn't expect to have references to my document :-) I moved it to: https://haypo-notes.readthedocs.io/summary_python_random_issue.html
http://haypo-notes.readthedocs.io/pep_random.html is now really a PEP ;-)
Cool, I'll update the first reference and also and a reference to your draft PEP. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

On Jun 23, 2016, at 8:33 PM, Nick Coghlan <ncoghlan@gmail.com> wrote:
The argument chain runs:
- if such software doesn't exist, it doesn't matter which behaviour we choose - if we're wrong and it does exist, we can choose how it fails: - blocking (with associated potential for init system deadlock) - throwing an exception
Given the choice between debugging an apparent system hang and an unexpected exception when testing against a new version of a platform, I'll choose the exception every time.
I think the biggest argument to blocking is that there really exist two sort of situations that blocking can happen in: * It blocks for a tiny amount (maybe <1s) and nobody ever notices and people feel like things “just work”. * It blocks for a long amount of time (possibly forever depending on where in the boot sequence Python is being used) and it hangs for a long time (or forever). In the second case I think it’s pretty obvious that an exception is better than hanging forever, but in the first case an exception might actually cause people to go out of their way to do something bad to “stop the pain”. My personal preference is waffling back and forth between them based on which of the two above I feel are more likely to occur in practice. — Donald Stufft

On 06/23/2016 05:46 PM, Donald Stufft wrote:
On Jun 23, 2016, at 8:33 PM, Nick Coghlan wrote:
The argument chain runs:
- if such software doesn't exist, it doesn't matter which behaviour we choose - if we're wrong and it does exist, we can choose how it fails: - blocking (with associated potential for init system deadlock) - throwing an exception
Given the choice between debugging an apparent system hang and an unexpected exception when testing against a new version of a platform, I'll choose the exception every time.
I think the biggest argument to blocking is that there really exist two sort of situations that blocking can happen in:
* It blocks for a tiny amount (maybe <1s) and nobody ever notices and people feel like things “just work”. * It blocks for a long amount of time (possibly forever depending on where in the boot sequence Python is being used) and it hangs for a long time (or forever).
In the second case I think it’s pretty obvious that an exception is better than hanging forever, but in the first case an exception might actually cause people to go out of their way to do something bad to “stop the pain”. My personal preference is waffling back and forth between them based on which of the two above I feel are more likely to occur in practice.
Can we build in a small wait? As in, check every second for ten seconds and if we still don't have entropy then raise? -- ~Ethan~

On 23 June 2016 at 17:46, Donald Stufft <donald@stufft.io> wrote:
On Jun 23, 2016, at 8:33 PM, Nick Coghlan <ncoghlan@gmail.com> wrote:
The argument chain runs:
- if such software doesn't exist, it doesn't matter which behaviour we choose - if we're wrong and it does exist, we can choose how it fails: - blocking (with associated potential for init system deadlock) - throwing an exception
Given the choice between debugging an apparent system hang and an unexpected exception when testing against a new version of a platform, I'll choose the exception every time.
I think the biggest argument to blocking is that there really exist two sort of situations that blocking can happen in:
* It blocks for a tiny amount (maybe <1s) and nobody ever notices and people feel like things “just work”. * It blocks for a long amount of time (possibly forever depending on where in the boot sequence Python is being used) and it hangs for a long time (or forever).
In the second case I think it’s pretty obvious that an exception is better than hanging forever, but in the first case an exception might actually cause people to go out of their way to do something bad to “stop the pain”. My personal preference is waffling back and forth between them based on which of the two above I feel are more likely to occur in practice.
That's fair, and it's a large part of why I realised PEP 522 needed a standard library answer for "just wait until the system RNG is ready, please". I'll also note that I'm open to being convinced that it's OK for "import secrets" to be that answer - my main argument against it is just a general principle that imports shouldn't have side effects, and blocking waiting for an external state change is a side effect. Standing against that is the argument that we wouldn't want the recommended idiom for using the secrets module to become the boilerplatish: import secrets secrets.wait_for_system_rng() Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

On Jun 23, 2016, at 9:40 PM, Nick Coghlan <ncoghlan@gmail.com> wrote:
On 23 June 2016 at 17:46, Donald Stufft <donald@stufft.io> wrote:
On Jun 23, 2016, at 8:33 PM, Nick Coghlan <ncoghlan@gmail.com> wrote:
The argument chain runs:
- if such software doesn't exist, it doesn't matter which behaviour we choose - if we're wrong and it does exist, we can choose how it fails: - blocking (with associated potential for init system deadlock) - throwing an exception
Given the choice between debugging an apparent system hang and an unexpected exception when testing against a new version of a platform, I'll choose the exception every time.
I think the biggest argument to blocking is that there really exist two sort of situations that blocking can happen in:
* It blocks for a tiny amount (maybe <1s) and nobody ever notices and people feel like things “just work”. * It blocks for a long amount of time (possibly forever depending on where in the boot sequence Python is being used) and it hangs for a long time (or forever).
In the second case I think it’s pretty obvious that an exception is better than hanging forever, but in the first case an exception might actually cause people to go out of their way to do something bad to “stop the pain”. My personal preference is waffling back and forth between them based on which of the two above I feel are more likely to occur in practice.
That's fair, and it's a large part of why I realised PEP 522 needed a standard library answer for "just wait until the system RNG is ready, please".
I'll also note that I'm open to being convinced that it's OK for "import secrets" to be that answer - my main argument against it is just a general principle that imports shouldn't have side effects, and blocking waiting for an external state change is a side effect.
Standing against that is the argument that we wouldn't want the recommended idiom for using the secrets module to become the boilerplatish:
import secrets secrets.wait_for_system_rng()
Alternative here is to just make every function in secrets ensure it waits for the system RNG, possibly by calling said wait_for_system_rng() function if we still think it’s worth it to make it a public API with a global that gets set once it’s been recorded once. The fallback to /dev/random may be a bad idea though, even if it’s only done once per process, I can imagine a case where someone is using emphereal processes so they end up hitting /dev/random regularly. Using getrandom() for this is fine because that state is per machine not per process, but the Python level “has RNG been initialized” is per process so that could end up with an unintended side effect of hitting /dev/random a lot. — Donald Stufft

On 23 June 2016 at 18:47, Donald Stufft <donald@stufft.io> wrote:
Standing against that is the argument that we wouldn't want the recommended idiom for using the secrets module to become the boilerplatish:
import secrets secrets.wait_for_system_rng()
Alternative here is to just make every function in secrets ensure it waits for the system RNG, possibly by calling said wait_for_system_rng() function if we still think it’s worth it to make it a public API with a global that gets set once it’s been recorded once.
While we could definitely do that, I think the complexity of it would push me towards Victor's "just make os.urandom potentially blocking at system startup" proposal. If 522 is going to make sense, I think it needs to be framed in a way that makes blocking for the system RNG clearly an at-most-once-per-process activity.
The fallback to /dev/random may be a bad idea though, even if it’s only done once per process, I can imagine a case where someone is using emphereal processes so they end up hitting /dev/random regularly. Using getrandom() for this is fine because that state is per machine not per process, but the Python level “has RNG been initialized” is per process so that could end up with an unintended side effect of hitting /dev/random a lot.
That's the bug that lead to me changing the suggested code to try os.urandom() once first, before falling back to blocking on /dev/random. Once the system RNG is ready, that first call will always succeed, no matter how many new processes you start. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

2016-06-24 2:46 GMT+02:00 Donald Stufft <donald@stufft.io>:
I think the biggest argument to blocking is that there really exist two sort of situations that blocking can happen in:
* It blocks for a tiny amount (maybe <1s) and nobody ever notices and people feel like things “just work”. * It blocks for a long amount of time (possibly forever depending on where in the boot sequence Python is being used) and it hangs for a long time (or forever).
In the second case I think it’s pretty obvious that an exception is better than hanging forever, but in the first case an exception might actually cause people to go out of their way to do something bad to “stop the pain”. My personal preference is waffling back and forth between them based on which of the two above I feel are more likely to occur in practice.
Maybe I'm wrong, but *starting* to raise BlockingIOError looks like the opposite direction taken by Python with EINTR (PEP 475). We had to add try/except InterruptedError in many modules (asyncio, asyncio, io, multiprocessing, selectors, socket, socketserver, subprocess), but it was decided to fix the root issue: retry the syscal if it failed with EINTR directly in the C code, so you never have to handle InterruptedError at the Python level anymore. For EINTR, it was decided that the common case is to restart automatically the syscall. The rare case is when the user expects that the program is really interrupted, and this case requires to raise an exception in the signal handler. FYI The PEP 475 has a minor incompatible change: programs relying on EINTR with a signal handler not raising a Python exceptions were broken by this change. They had to modify their signal handler to raise an exception. I recall to have to fix *one* library and then..... nothing, nobody complained. I was suprised, I expected that the "rare" case was more common than that :-) To come back to urandom: the common case is to wait for random, the exception is to want to be notified and run special code. Maybe it's not worth to have to modify all libraries and applications for the exception, but maybe add a special function for the exception. In a different thread, I proposed to expose os.getrandom() even if my PEP (blocking os.urandom) is accepted, because getrandom() provides features not available only using os.urandom(). What do you think of making os.urandom() blocking on Linux but also add os.getrandom() to handle the exceptional case? Victor

On 24 June 2016 at 04:34, Victor Stinner <victor.stinner@gmail.com> wrote:
2016-06-24 2:46 GMT+02:00 Donald Stufft <donald@stufft.io>:
I think the biggest argument to blocking is that there really exist two sort of situations that blocking can happen in:
* It blocks for a tiny amount (maybe <1s) and nobody ever notices and people feel like things “just work”. * It blocks for a long amount of time (possibly forever depending on where in the boot sequence Python is being used) and it hangs for a long time (or forever).
In the second case I think it’s pretty obvious that an exception is better than hanging forever, but in the first case an exception might actually cause people to go out of their way to do something bad to “stop the pain”. My personal preference is waffling back and forth between them based on which of the two above I feel are more likely to occur in practice.
Maybe I'm wrong, but *starting* to raise BlockingIOError looks like the opposite direction taken by Python with EINTR (PEP 475).
The difference I see here is that EINTR really can happen at any time, while the transition from "system RNG is not ready" to "system RNG is ready" is a once-per-boot deal (and in most cases, the operating system itself handles making sure the RNG is initialised before it starts running userspace processes). As such, the idioms I currently have in PEP 522 are wrong - the "wait for the system RNG or not" decision wouldn't be one to be made on a per-call basis, but rather on a per-__main__ execution basis, with developers choosing which user experience they want to support on systems with a non-blocking /dev/urandom: * this application will fail if you run it before the system RNG is ready (so you may need to add "ExecStartPre=python3 -c 'import secrets; secrets.wait_for_system_rng()'" in your systemd unit file) * this application implicitly calls "secrets.wait_for_system_rng()" and hence may block waiting for the system RNG if you run it before the system RNG is ready The default state of Python 3.6+ applications would be the first one, and I think that's an entirely reasonable default - if you're writing userspace code that runs before the system RNG is ready, you're out of the world of normal software development and into the world of operating system developers, system integrators and embedded system designers. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

2016-06-24 22:05 GMT+02:00 Nick Coghlan <ncoghlan@gmail.com>:
As such, the idioms I currently have in PEP 522 are wrong - the "wait for the system RNG or not" decision wouldn't be one to be made on a per-call basis, but rather on a per-__main__ execution basis, with developers choosing which user experience they want to support on systems with a non-blocking /dev/urandom:
* this application will fail if you run it before the system RNG is ready (so you may need to add "ExecStartPre=python3 -c 'import secrets; secrets.wait_for_system_rng()'" in your systemd unit file)
In short, if an application is not run using systemd but directly on the command line, it *can* fail with a fatal BlockingIOError? Wait, I don't think that it is an acceptable behaviour from the user point of view. Compared to Python 2.7, Python 3.4 and Python 3.5.2 where os.urandom() never blocks nor raises an exception on Linux, such behaviour change can be seen as a major regression.
* this application implicitly calls "secrets.wait_for_system_rng()" and hence may block waiting for the system RNG if you run it before the system RNG is ready
It's hard to guess if os.urandom() is used in a third-party library. Maybe it's not. What if a new library version starts to use os.urandom()? Should you start to call secrets.wait_for_system_rng()? To be safe, I expect that *all* applications should start with secrets.wait_for_system_rng()... It doesn't make sense to have to put such code in *all* applications. The main advantage of the PEP 522 is to control how the "system urandom not initialized yet" case is handled. But you are more and more saying that secrets.wait_for_system_rng() should be used to not get BlockingIOError in most cases. Am I wrong? I expect that some libraries will start to use secrets.wait_for_system_rng() in their own code. ... At the end, it looks you basically reimplemented a blocking os.urandom(), no? -- Why do we have to bother *all* users with secrets.wait_for_system_rng(), while only a very few will really care of the exceptional case? Why not adding something for users who want to handle the exceptional case, but make os.urandom() blocking? Sorry, I'm repeating myself, but as I wrote, I don't know yet what is the best option, so I'm "testing" each option. Victor

On 24 June 2016 at 16:21, Victor Stinner <victor.stinner@gmail.com> wrote:
2016-06-24 22:05 GMT+02:00 Nick Coghlan <ncoghlan@gmail.com>:
As such, the idioms I currently have in PEP 522 are wrong - the "wait for the system RNG or not" decision wouldn't be one to be made on a per-call basis, but rather on a per-__main__ execution basis, with developers choosing which user experience they want to support on systems with a non-blocking /dev/urandom:
* this application will fail if you run it before the system RNG is ready (so you may need to add "ExecStartPre=python3 -c 'import secrets; secrets.wait_for_system_rng()'" in your systemd unit file)
In short, if an application is not run using systemd but directly on the command line, it *can* fail with a fatal BlockingIOError?
From the command line, the answer is equally simple: just run "python3 -c 'import secrets; secrets.wait_for_system_rng()'" before the command you actually care about.
As an added bonus, that will work even if the command you care about isn't written in Python 3, and even if it reads from /dev/urandom rather than using the new syscall.
Wait, I don't think that it is an acceptable behaviour from the user point of view.
Compared to Python 2.7, Python 3.4 and Python 3.5.2 where os.urandom() never blocks nor raises an exception on Linux, such behaviour change can be seen as a major regression.
The *only* way to get it to block (your PEP) or raise an exception (PEP 522) is to call os.urandom() (directly or indirectly) when the kernel RNG isn't ready - I consider the relevant analogy to be to PEP 476, where we turned the silent security failure of accepting an invalid or untrusted certificate (or one that didn't cover the named host) into the noisy error of failing to make the connection.
* this application implicitly calls "secrets.wait_for_system_rng()" and hence may block waiting for the system RNG if you run it before the system RNG is ready
It's hard to guess if os.urandom() is used in a third-party library. Maybe it's not. What if a new library version starts to use os.urandom()? Should you start to call secrets.wait_for_system_rng()?
To be safe, I expect that *all* applications should start with secrets.wait_for_system_rng()... It doesn't make sense to have to put such code in *all* applications.
Application developers porting to Python 3.6 can wait and see what their own testing reports and what their users report - they don't need to guess.
The main advantage of the PEP 522 is to control how the "system urandom not initialized yet" case is handled. But you are more and more saying that secrets.wait_for_system_rng() should be used to not get BlockingIOError in most cases. Am I wrong?
I'm saying I think it's an application level decision, not a library level decision.
I expect that some libraries will start to use secrets.wait_for_system_rng() in their own code.
... At the end, it looks you basically reimplemented a blocking os.urandom(), no?
Potentially, but one of the important aspects of PEP 522 is that we're not imposing that outcome by fiat - we're letting developers choose the behaviour they want on a case by case basis, and seeing what the emergent consensus on correct behaviour turns out to be. It's equally possible that the outcome will be that both Python and Linux developers conclude that this is an operating system integration issue, so systemd ends up adding a standard "kernelrng" target that components can wait for, and that then gets included as a requirement for getting to the singleuser state on most distros. If we *do* reach a point where "always call secrets.wait_for_system_rng() before using secrets, random.SystemRandom or os.urandom" is the idiomatic advice for Pythonistas, *then* we can make os.urandom() blocking, and secrets.wait_for_system_rng() would reduced to: def wait_for_system_rng(): os.urandom(1)
--
Why do we have to bother *all* users with secrets.wait_for_system_rng(), while only a very few will really care of the exceptional case?
We don't - only the ones that actually get the exception, since they're necessarily the ones the problem is relevant to. Runtime system configuration related exceptions aren't something to be avoided at all costs - if they were, we'd never have made the changes we did to the way Unicode handling works. A good example of this at the library level is Armin Ronacher's click command line helper - when you run that in the C locale under Python 3, it just fails immediately, since the actual problem is that something has gone wrong and your system locale isn't configured properly. The right answer is almost always to fix the locale configuration settings, not to change anything in the Python code.
Why not adding something for users who want to handle the exceptional case, but make os.urandom() blocking?
The main problem I have with the blocking solution is that if someone hits it unexpectedly, they're left staring at a blinking cursor (at best), and no helpful hints to get started on debugging the problem. If it's a component they didn't write, they also can't really give a good bug report beyond "It hangs when I try to run it". By contrast, PEP 522 gives them an immediate exception and error message: "BlockingIOError: system random number generator is not ready". If they're a developer themselves, they can plug that into Google and hopefully find a relevant answer (which we can virtually guarantee by preseeding Stack Overflow with a suitable response) If they're *not* the application developer, they can paste the traceback into a bug report or support ticket and say "Hey, what's going on here?". At which point, the developer or support tech handling the ticket can do the appropriate Google search and respond accordingly. Now, we could gain most of those debuggability benefits for a blocking solution by trying in non-blocking mode first, then falling back to blocking only if we get EAGAIN - that would let us print a Google-friendly warning message before we implicitly block. That's where the argument of adopting a consistent approach of "try non-blocking first, then maybe fall back to something else if it doesn't work" comes into play - if os.urandom() (and hence indirectly the secrets module) is trying in non-blocking mode and falling back to an alternative, *and* SipHash initialisation is doing that, *and* importing the random module is doing that, it sends a strong message to me that the base primitive here is actually "try to read the system RNG, and maybe fail to do so", rather than "read the system RNG and only return when the requested data is available" Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

2016-06-24 2:33 GMT+02:00 Nick Coghlan <ncoghlan@gmail.com>:
3. Switch to reading ``/dev/urandom`` directly (non-security sensitive)
It is what I propose for the random.Random constructor when the random module is imported.
Again, the question is if there is a real use case for it. And if yes, if the use case common enough to justify the change?
The extreme case is that all applications using os.urandom() would need to be modifiy to add a try/except BlockingIOError. I only exagerate to try to understand the impact of your PEP. I only that only a few applications will use such try/except in practice.
That's where the idea of also adding secrets.wait_for_system_rng() comes, rather than having to wrap every library call in a try/except block (or risk having those APIs become blocking ones such that async developers feel obliged to call them in a separate thread)
I expect that secrets.wait_for_system_rng() will be implemented as consuming at least 1 byte of entropy, to check if urandom is initialized, right? I'm not a big fan of this API: os.urandom() never blocks, secrets.wait_for_system_rng() helper. If you say that some users need to call secrets.wait_for_system_rng() first, for me there is an use case for blocking urandom. So I would expect a blocking urandom function in the os module directly. By the way, it would avoid "wasting" 1 random byte of entropy. Victor
participants (6)
-
Cory Benfield
-
Donald Stufft
-
Ethan Furman
-
Nick Coghlan
-
Tim Peters
-
Victor Stinner