PEP: Make os.urandom() blocking on Linux (version 2)

Hi, I completed my PEP. Here is a second version of my PEP. Changes: * I added new sections: - The bug - Use Cases - Fix system urandom - Denial-of-service when reading random * I added alternatives: - Leave os.urandom() unchanged, add os.getrandom() - Raise BlockingIOError in os.urandom() - Add an optional block parameter to os.urandom() I added 3 sections to try to describe the context of "the bug". For example, I think that it's important to mention that all operating systems loads entropy from the disk at the boot. For me, the last tricky question is the use case 2 (run a web server) on a VM or embedded when system urandom is not initialized yet and there is no entropy on disk yet (ex: first boot, or maybe second boot, of a VM). I read quickly that a VM connected to a network should be able to quickly initialized the system urandom. So I'm not sure that the use case 2 (web server) is really an issue in practice. Victor HTML version: https://haypo-notes.readthedocs.io/pep_random.html ++++++++++++++++++++++++++++++++++++++++ PEP: Make os.urandom() blocking on Linux ++++++++++++++++++++++++++++++++++++++++ Headers:: PEP: xxx Title: Make os.urandom() blocking on Linux Version: $Revision$ Last-Modified: $Date$ Author: Victor Stinner <victor.stinner@gmail.com> Status: Draft Type: Standards Track Content-Type: text/x-rst Created: 20-June-2016 Python-Version: 3.6 Abstract ======== Modify ``os.urandom()`` to block on Linux 3.17 and newer until the OS urandom is initialized. The bug ======= Python 3.5.0 was enhanced to use the new ``getrandom()`` syscall introduced in Linux 3.17 and Solaris 11.3. The problem is that users started to complain that Python 3.5 blocks at startup on Linux in virtual machines and embedded devices: see issues `#25420 <http://bugs.python.org/issue25420>`_ and `#26839 <http://bugs.python.org/issue26839>`_. On Linux, ``getrandom(0)`` blocks until the kernel initialized urandom with 128 bits of entropy. The issue #25420 describes a Linux build platform blocking at ``import random``. The issue #26839 describes a short Python script used to compute a MD5 hash, systemd-cron, script called very early in the init process. The system initialization blocks on this script which blocks on ``getrandom(0)`` to initialize Python. The Python initilization requires random bytes to implement a counter-measure against the hash denial-of-service (hash DoS), see: * `Issue #13703: Hash collision security issue <http://bugs.python.org/issue13703>`_ * `PEP 456: Secure and interchangeable hash algorithm <https://www.python.org/dev/peps/pep-0456/>`_ Importing the ``random`` module creates an instance of ``random.Random``: ``random._inst``. On Python 3.5, random.Random constructor reads 2500 bytes from ``os.urandom()`` to seed a Mersenne Twister RNG (random number generator). Other platforms may be affected by this bug, but in practice, only Linux systems use Python scripts to initialize the system. Use Cases ========= The following use cases are used to help to choose the right compromise between security and practicability. Use Case 1: init script ----------------------- Use a Python 3 script to initialize the system, like systemd-cron. If the script blocks, the system initialize is stuck too. The issue #26839 is a good example of this use case. Use Case 2: web server ---------------------- Run a Python 3 web server serving web pages using HTTP and HTTPS protocols. The server is started as soon as possible. The first target of the hash DoS attack was web server: it's important that the hash secret cannot be easily guessed by an attacker. If serving a web page needs a secret to create a cookie, create an encryption key, ..., the secret must be created with good entropy: again, it must be hard to guess the secret. A web server requires security. If a choice must be made between security and running the server with weak entropy, security is more important. If there is no good entropy: the server must block or fail with an error. The question is if it makes sense to start a web server on a host before system urandom is initialized. The issues #25420 and #26839 are restricted to the Python startup, not to generate a secret before the system urandom is initialized. Fix system urandom ================== Load entropy from disk at boot ------------------------------- Collecting entropy can take several minutes. To accelerate the system initialization, operating systems store entropy on disk at shutdown, and then reload entropy from disk at the boot. If a system collects enough entropy at least once, the system urandom will be initialized quickly, as soon as the entropy is reloaded from disk. Virtual machines ---------------- Virtual machines don't have a direct access to the hardware and so have less sources of entropy than bare metal. A solution is to add a `virtio-rng device <https://fedoraproject.org/wiki/Features/Virtio_RNG>`_ to pass entropy from the host to the virtual machine. Embedded devices ---------------- A solution for embedded devices is to plug an hardware RNG. For example, Raspberry Pi have an hardware RNG but it's not used by default. See: `Hardware RNG on Raspberry Pi <http://fios.sector16.net/hardware-rng-on-raspberry-pi/>`_. Denial-of-service when reading random ===================================== The ``/dev/random`` device should only used for very specific use cases. Reading from ``/dev/random`` on Linux is likely to block. Users don't like when an application blocks longer than 5 seconds to generate a secret. It is only expected for specific cases like generating explicitly an encryption key. When the system has no available entropy, choosing between blocking until entropy is available or falling back on lower quality entropy is a matter of compromise between security and practicability. The choice depends on the use case. On Linux, ``/dev/urandom`` is secure, it should be used instead of ``/dev/random``: * `Myths about /dev/urandom <http://www.2uo.de/myths-about-urandom/>`_ by Thomas Hühn: "Fact: /dev/urandom is the preferred source of cryptographic randomness on UNIX-like systems" Rationale ========= On Linux, reading the ``/dev/urandom`` can return "weak" entropy before urandom is fully initialized, before the kernel collected 128 bits of entropy. Linux 3.17 adds a new ``getrandom()`` syscall which allows to block until urandom is initialized. On Python 3.5.2, os.urandom() uses the ``getrandom(GRND_NONBLOCK)``, but falls back on reading the non-blocking ``/dev/urandom`` if ``getrandom(GRND_NONBLOCK)`` fails with ``EAGAIN``. Security experts promotes ``os.urandom()`` to genereate cryptographic keys. By the way, ``os.urandom()`` is preferred over ``ssl.RAND_bytes()`` for different reasons. This PEP proposes to modify os.urandom() to use ``getrandom()`` in blocking mode to not return weak entropy, but also ensure that Python will not block at startup. Changes ======= All changes described in this section are specific to the Linux platform. * Initialize hash secret from non-blocking system urandom * Initialize ``random._inst`` with non-blocking system urandom * Modify os.urandom() to block (until system urandom is initialized) A new ``_PyOS_URandom_Nonblocking()`` private method is added: try to call ``getrandom(GRND_NONBLOCK)``, but falls back on reading ``/dev/urandom`` if it fails with ``EAGAIN``. ``_PyRandom_Init()`` is modified to call ``_PyOS_URandom_Nonblocking()``. Moreover, a new ``random_inst_seed`` field is added to the ``_Py_HashSecret_t`` structure. ``random._inst`` (an instance of ``random.Random``) is initialized with the new ``random_inst_seed`` secret. A ("fuse") flag is used to ensure that this secret is only used once. If a second instance of random.Random is created, blocking ``os.urandom()`` is used. ``os.urandom()`` (C function ``_PyOS_URandom()``) is modified to always call ``getrandom(0)`` (blocking mode). Alternative =========== Never use blocking urandom in the random module ----------------------------------------------- The random module can use ``random_inst_seed`` as a seed, but add other sources of entropy like the process identifier (``os.getpid()``), the current time (``time.time()``), memory addresses, etc. Reading 2500 bytes from os.urandom() to initialize the Mersenne Twister RNG in random.Random is a deliberate choice to get access to the full range of the RNG. This PEP is a compromise between "security" and "feature". Python should not block at startup before the OS collected enough entropy. But on the regular use case (system urandom iniitalized), the random module should continue to its code to initialize the seed. Python 3.5.0 was blocked on ``import random``, not on building a second instance of ``random.Random``. Leave os.urandom() unchanged, add os.getrandom() ------------------------------------------------ os.urandom() remains unchanged: never block, but it can return weak entropy if system urandom is not initialized yet. A new ``os.getrandom()`` function is added: thin wrapper to the ``getrandom()`` syscall. Expected usage to write portable code:: def my_random(n): if hasattr(os, 'getrandom'): return os.getrandom(n, 0) return os.urandom(n) The problem with this change is that it expects that users understand well security and know well each platforms. Python has the tradition of hiding "implementation details". For example, ``os.urandom()`` is not a thin wrapper to the ``/dev/urandom`` device: it uses ``CryptGenRandom()`` on Windows, it uses ``getentropy()`` on OpenBSD, it tries ``getrandom()`` on Linux and Solaris or falls back on reading ``/dev/urandom``. Python already uses the best available system RNG depending on the platform. This PEP does not change the API which didn't change since the creation of Python: * ``os.urandom()``, ``random.SystemRandom`` and ``secrets`` for security * ``random`` module (except ``random.SystemRandom``) for all other usages Raise BlockingIOError in os.urandom() ------------------------------------- This idea was proposed as a compromise to let developers decide themself how to handle the case: * catch the exception and uses another weaker entropy source: read ``/dev/urandom`` on Linux, the Python ``random`` module (which is not secure at all), time, process identifier, etc. * don't catch the error, the whole program fails with this fatal exception First of all, no user complained yet that ``os.urandom()`` blocks. This point is currently theorical. The Python issues #25420 and #26839 were restricted to the Python startup: users complained that Python was blocked at startup. Even if reading /dev/urandom block on OpenBSD, FreeBSD, Mac OS X, etc. until urandom is initialized, no user complained yet because Python is not used in the process initializing the system and /dev/urandom is quickly initialized. It looks like only Linux users hit the problem on virtual machines or embedded devices, and only in some short Python scripts used to initialize the the system. Again, ``os.urandom()`` is not used in such script (at least, not yet). As `Leave os.urandom() unchanged, add os.getrandom()`_, the problem is that it makes the API more complex and so more error-prone. Add an optional block parameter to os.urandom() ----------------------------------------------- Add an optional block parameter to os.urandom(). The default value may be ``True`` (block by default) or ``False`` (non-blocking). The first technical issue is to implement ``os.urandom(block=False)`` on all platforms. On Linux 3.17 and newer has a well defined non-blocking API. See the `issue #27250: Add os.urandom_block() <http://bugs.python.org/issue27250>`_. As `Raise BlockingIOError in os.urandom()`_, it doesn't seem worth it to make the API more complex for a theorical (or at least very rare) use case. As `Leave os.urandom() unchanged, add os.getrandom()`_, the problem is that it makes the API more complex and so more error-prone. Annexes ======= Operating system random functions --------------------------------- ``os.urandom()`` uses the following functions: * OpenBSD: `getentropy() <http://man.openbsd.org/OpenBSD-current/man2/getentropy.2>`_ (OpenBSD 5.6) * Linux: `getrandom() <http://man7.org/linux/man-pages/man2/getrandom.2.html>`_ (Linux 3.17) -- see also `A system call for random numbers: getrandom() <https://lwn.net/Articles/606141/>`_ * Solaris: `getentropy() <https://docs.oracle.com/cd/E53394_01/html/E54765/getentropy-2.html#scrolltoc>`_, `getrandom() <https://docs.oracle.com/cd/E53394_01/html/E54765/getrandom-2.html>`_ (both need Solaris 11.3) * Windows: `CryptGenRandom() <https://msdn.microsoft.com/en-us/library/windows/desktop/aa379942%28v=vs.85%29.aspx>`_ (Windows XP) * UNIX, BSD: /dev/urandom, /dev/random * OpenBSD: /dev/srandom On Linux, commands to get the status of ``/dev/random`` (results are number of bytes):: $ cat /proc/sys/kernel/random/entropy_avail 2850 $ cat /proc/sys/kernel/random/poolsize 4096 Why using os.urandom()? ----------------------- Since ``os.urandom()`` is implemented in the kernel, it doesn't have some issues of user-space RNG. For example, it is much harder to get its state. It is usually built on a CSPRNG, so even if its state is get, it is hard to compute previously generated numbers. The kernel has a good knowledge of entropy sources and feed regulary the entropy pool. Links ===== * `Cryptographically secure pseudo-random number generator (CSPRNG) <https://en.wikipedia.org/wiki/Cryptographically_secure_pseudorandom_number_generator>`_ Copyright ========= This document has been placed in the public domain.

2016-06-23 23:27 GMT+02:00 Victor Stinner <victor.stinner@gmail.com>:
Use Case 1: init script -----------------------
Use a Python 3 script to initialize the system, like systemd-cron. If the script blocks, the system initialize is stuck too.
The issue #26839 is a good example of this use case.
For me, such script must not require secure secret. An application which require to generate a secure secret must run later, when the system is fully initialized. What do you think?
Use Case 2: web server ----------------------
Run a Python 3 web server serving web pages using HTTP and HTTPS protocols. The server is started as soon as possible.
The first target of the hash DoS attack was web server: it's important that the hash secret cannot be easily guessed by an attacker.
Maybe I should elaborate this point to explain that the specific case of hash secret is more in the practicability side than on the security side. *IMO* reading the non-blocking /dev/urandom is enough for the hash secret. From what I read, even if the system urandom is not considered as initialized, urandom is able to generate "good enough" entropy. So the hash secret is not easily predictable. Maybe I should read Ted Tso's emails to elaborate this point ;-)
Embedded devices ----------------
A solution for embedded devices is to plug an hardware RNG.
Honestly, I'm not fully convinced by my own solution :-) I'm not sure that all embedded devices are "extensible". Victor

On 23 June 2016 at 14:27, Victor Stinner <victor.stinner@gmail.com> wrote:
Raise BlockingIOError in os.urandom() -------------------------------------
This idea was proposed as a compromise to let developers decide themself how to handle the case:
* catch the exception and uses another weaker entropy source: read ``/dev/urandom`` on Linux, the Python ``random`` module (which is not secure at all), time, process identifier, etc. * don't catch the error, the whole program fails with this fatal exception
First of all, no user complained yet that ``os.urandom()`` blocks. This point is currently theorical. The Python issues #25420 and #26839 were restricted to the Python startup: users complained that Python was blocked at startup.
Even if reading /dev/urandom block on OpenBSD, FreeBSD, Mac OS X, etc. until urandom is initialized, no user complained yet because Python is not used in the process initializing the system and /dev/urandom is quickly initialized. It looks like only Linux users hit the problem on virtual machines or embedded devices, and only in some short Python scripts used to initialize the the system. Again, ``os.urandom()`` is not used in such script (at least, not yet).
As `Leave os.urandom() unchanged, add os.getrandom()`_, the problem is that it makes the API more complex and so more error-prone.
I have to admit, this is a pretty solid argument, especially if you supplement it with Donald's point that affected scripts and applications will likely split into "doesn't even notice that implicit delay" and "hangs the world after switching to Python 3.6, but the developer/integrator sees 'calling os.urandom() may hang the world on Linux system boot' in the Python 3.6 porting notes". I'll still keep iterating on PEP 522, but I'm to the point of being +0 on this approach if Guido decides he prefers it :) Cheers, Nick. P.S. DevNation/Red Hat Summit are on next week, so I'll try to get one more version of PEP 522 done before I leave, but will likely be busy for most of that time. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

On Jun 23, 2016, at 11:27 PM, Victor Stinner wrote:
Alternative ===========
Leave os.urandom() unchanged, add os.getrandom() ------------------------------------------------
os.urandom() remains unchanged: never block, but it can return weak entropy if system urandom is not initialized yet.
A new ``os.getrandom()`` function is added: thin wrapper to the ``getrandom()`` syscall.
Expected usage to write portable code::
def my_random(n): if hasattr(os, 'getrandom'): return os.getrandom(n, 0) return os.urandom(n)
I would actually expect that this would be handled in the secrets module, so the recommendation would be that most users wouldn't use os.urandom() or os.getrandom() unless they specifically wanted the low-level functions and knew what they were doing. Thus, "expected usage to write portable code" would be to use secrets.token_bytes(). Other than that, thanks for adding this alternative. Cheers, -Barry

2016-06-24 15:38 GMT+02:00 Barry Warsaw <barry@python.org>:
Expected usage to write portable code::
def my_random(n): if hasattr(os, 'getrandom'): return os.getrandom(n, 0) return os.urandom(n)
I would actually expect that this would be handled in the secrets module, so the recommendation would be that most users wouldn't use os.urandom() or os.getrandom() unless they specifically wanted the low-level functions and knew what they were doing. Thus, "expected usage to write portable code" would be to use secrets.token_bytes().
Oh ok. I will update this section. Victor
participants (3)
-
Barry Warsaw
-
Nick Coghlan
-
Victor Stinner