Mailman 3 Take a decision for os.urandom() in Python 3.6 - Security-SIG

Take a decision for os.urandom() in Python 3.6

Victor Stinner

5 Aug 2016 5 Aug '16

7:24 p.m.

Hi, Would it be possible to take a decision on the PEP 522 and PEP 524? Deadline for new features in Python 3.6 is in one month or something like that, no? My PEP: https://www.python.org/dev/peps/pep-0524/ "PEP 524 -- Make os.urandom() blocking on Linux" Nick's PEP: https://www.python.org/dev/peps/pep-0522/ "PEP 522 -- Allow BlockingIOError in security sensitive APIs" Victor

Attachments:

attachment.html (text/html — 615 bytes)

Show replies by date

Ethan Furman

5 Aug 5 Aug

7:42 p.m.

On 08/05/2016 12:24 PM, Victor Stinner wrote:

...

Would it be possible to take a decision on the PEP 522 and PEP 524? Deadline for new features in Python 3.6 is in one month or something like that, no?

My PEP: https://www.python.org/dev/peps/pep-0524/ "PEP 524 -- Make os.urandom() blocking on Linux"

Nick's PEP: https://www.python.org/dev/peps/pep-0522/ "PEP 522 -- Allow BlockingIOError in security sensitive APIs"

Can someone write a brief summary of the differences between the two? -- ~Ethan~

Victor Stinner

11:13 p.m.

Le 5 août 2016 9:42 PM, "Ethan Furman" <ethan@stoneleaf.us> a écrit :

...

Can someone write a brief summary of the differences between the two?

Oh, it's hard to summarize. Let me try. As you may expect, my summary is far from being fair :-D -- The two PEPs propose a very different behaviour when os.urandom() would block: raise an exception (522) or wait (524). The PEP 522 forces developers to explicitly handle a rare case (when urandom blocks). The PEP 524 proposes to be optimistic and hope that if urandom hangs, it doesn't hang too long. The corner case of the corner case is when urandom blocks really too long (longer than 60 seconds, or simply forever). The PEP 524 doesn't handle it (block). The PEP 522 makes the exceptional corner case as important as the common case (urandom just blocks a few seconds, or don't block at all). -- Both PEPs want to make Python more secure: don't fall back on the "weak" /dev/urandom (define weak: XXX) in os.urandom() before system urandom is initialized. Most differences between the two PEPs only impact applications calling os.urandom() very early during system initialization (before system urandom initialization) on a system with very slow entropy source or just no entropy (VM, embeded device, ...). The PEP 522 proposes to raise an exception on such case. It forces developers to modify their code to decide how to handle such corner case: wait a few seconds, switch to a weaker entropy source (and maybe log a waning/error), etc. The PEP 524 (mine) proposes to block. Applications don't need to be modified. The expectation is that the kernel will be able to get enough entropy fast enough. By the way, blocking on system urandom is not something new, SSH has the same behaviour for example (try SSH on such VM with no entropy...). -- The PEP 524 proposes also to add a new function os.getrandom() for people who understand low level stuff and security and want to enhance their application on the low entropy case. It allows to reimplement the PEP 522 on Linux in a few lines of pure Python, so give control when urandom would block (no black magic, just call os.getrandom(os.GRND_NOBLOCK) which raises BlockingIOError). The PEP 522 proposes a new function to wait for system urandom initialization. Something similar to the PEP 524 but it requires to modify all applications to use it (to get PEP 524 behaviour). -- IMO PEP 524 has a lower impact on backward compatibility and is easier to implement. The risk of the PEP 524 is that developers start to expect that os.urandom() will *never* block which simply cannot be implemented on all platforms. Both PEP are specific to Linux (even if Solaris will benefit of the same enhancement), but even just on Linux os.urandom() can still block (don't raise the expected BlockingIOError) on Linux older than 3.17. Victor

Guido van Rossum

6 Aug 6 Aug

2:39 a.m.

Thanks, Victor! I've (mostly) read both PEPs and your summary. It seems there are several risks that need to be weighed. 1. An important secret is generated using insufficient entropy. 2. An app blocks unnecessarily. 3. An app crashes unexpectedly. 4. Bad code gets cargo-culted (e.g. through StackOverflow). Both PEPs prevent bad secrets (1), which the status quo (in theory) could lead to this problem. PEP 524 (make os.urandom() block) can cause blocking (2), but prevents crashes (3). PEP 522 (make it raise) can cause crashes (3) but prevents blocking (2). Re (4): With PEP 524, people worried about blocking may be driven to unnecessarily write more complicated code using os.getrandom(). With PEP 522, people worried about crashes may be driven to unnecessarily call secrets.wait_for_system_rng() or put try/except blocks catching raise BlockingIOError around all their os.urandom()-based calls. There's some doubt that (2) or (3) will actually ever happen, because you'd have to be really early in the startup process. On several platforms it is known to be impossible (e.g. on Mac you run Python before the kernel has enough entropy), on other platforms there is no way to avoid the blocking (e.g. Windows), and the proposals really only differ on Linux (and a few other systems, like Solaris, that have getrandom()). My own biggest worry is about (4), cargo-culting -- undoubtedly people will worry about (2) or (3) and they will pass on "robust" code that is unnecessarily complicated and risks being wrong (since the failure mode is *very* hard to reproduce for testing). The cargo-cult code shown in PEP 524 (https://www.python.org/dev/peps/pep-0524/#best-effort-rng) is much worse than that recommended by PEP 522 (secrets.wait_for_system_rng() -- though I could also imagine people wrapping os.urandom() in try/except). But for me personally, it is much easier to stop worrying about a tiny chance of blocking (2) than it would be to believe that the chance of crashes (3) is truly so small that I don't have to do anything about it. Especially since on non-Linux platforms there is a chance of blocking anyway, with no way to prevent it. It's easy to stop worrying about something you can't control. (I don't really worry about that big asteroid that's going to hit the earth in the next millennium either. :-) So I'm in favor of PEP 524. I think I would be even more in favor of it if it didn't add os.getrandom(), since then the whole possibility of cargo-culting unnecessary countermeasures would be pretty much gone (though for the die-hards there's always ctypes...). But I also agree with the idea of exposing the platform's primitive operations when they exist. --Guido On Fri, Aug 5, 2016 at 4:13 PM, Victor Stinner <victor.stinner@gmail.com> wrote:

...

Le 5 août 2016 9:42 PM, "Ethan Furman" <ethan@stoneleaf.us> a écrit :

...
Can someone write a brief summary of the differences between the two?

Oh, it's hard to summarize. Let me try. As you may expect, my summary is far from being fair :-D

--

The two PEPs propose a very different behaviour when os.urandom() would block: raise an exception (522) or wait (524).

The PEP 522 forces developers to explicitly handle a rare case (when urandom blocks).

The PEP 524 proposes to be optimistic and hope that if urandom hangs, it doesn't hang too long.

The corner case of the corner case is when urandom blocks really too long (longer than 60 seconds, or simply forever).

The PEP 524 doesn't handle it (block). The PEP 522 makes the exceptional corner case as important as the common case (urandom just blocks a few seconds, or don't block at all).

--

Both PEPs want to make Python more secure: don't fall back on the "weak" /dev/urandom (define weak: XXX) in os.urandom() before system urandom is initialized.

Most differences between the two PEPs only impact applications calling os.urandom() very early during system initialization (before system urandom initialization) on a system with very slow entropy source or just no entropy (VM, embeded device, ...).

The PEP 522 proposes to raise an exception on such case. It forces developers to modify their code to decide how to handle such corner case: wait a few seconds, switch to a weaker entropy source (and maybe log a waning/error), etc.

The PEP 524 (mine) proposes to block. Applications don't need to be modified. The expectation is that the kernel will be able to get enough entropy fast enough. By the way, blocking on system urandom is not something new, SSH has the same behaviour for example (try SSH on such VM with no entropy...).

--

The PEP 524 proposes also to add a new function os.getrandom() for people who understand low level stuff and security and want to enhance their application on the low entropy case. It allows to reimplement the PEP 522 on Linux in a few lines of pure Python, so give control when urandom would block (no black magic, just call os.getrandom(os.GRND_NOBLOCK) which raises BlockingIOError).

The PEP 522 proposes a new function to wait for system urandom initialization. Something similar to the PEP 524 but it requires to modify all applications to use it (to get PEP 524 behaviour).

--

IMO PEP 524 has a lower impact on backward compatibility and is easier to implement.

The risk of the PEP 524 is that developers start to expect that os.urandom() will *never* block which simply cannot be implemented on all platforms. Both PEP are specific to Linux (even if Solaris will benefit of the same enhancement), but even just on Linux os.urandom() can still block (don't raise the expected BlockingIOError) on Linux older than 3.17.

Victor

_______________________________________________ Security-SIG mailing list Security-SIG@python.org https://mail.python.org/mailman/listinfo/security-sig

-- --Guido van Rossum (python.org/~guido)

Victor Stinner

8:32 a.m.

Le 6 août 2016 04:39, "Guido van Rossum" <guido@python.org> a écrit :

...

4. Bad code gets cargo-culted (e.g. through StackOverflow).

...

Re (4): With PEP 524, people worried about blocking may be driven to unnecessarily write more complicated code using os.getrandom(). With PEP 522, people worried about crashes may be driven to unnecessarily call secrets.wait_for_system_rng() or put try/except blocks catching raise BlockingIOError around all their os.urandom()-based calls.

What can we do to reduce this issue? Promote the best recipes in the documentation of the random and/or secrets module? Add Nick's secrets.wait_for_system_rng()? I have to confess that I don't like my own examples :-) I wrote them to show that you can reimplement the PEP 522 use cases and examples in a few lines. The worst example is "try system urandom, if it would block, use the random module". IMO this use case is artificial. If you need security, the random module must not be used. If you don't need security, why would you take the risk of blocking your application (2) with os.urandom()? Always use the random module no? Victor

Nick Coghlan

8:46 a.m.

On 6 August 2016 at 18:32, Victor Stinner <victor.stinner@gmail.com> wrote:

...

Le 6 août 2016 04:39, "Guido van Rossum" <guido@python.org> a écrit :

...
4. Bad code gets cargo-culted (e.g. through StackOverflow).

...
Re (4): With PEP 524, people worried about blocking may be driven to unnecessarily write more complicated code using os.getrandom(). With PEP 522, people worried about crashes may be driven to unnecessarily call secrets.wait_for_system_rng() or put try/except blocks catching raise BlockingIOError around all their os.urandom()-based calls.

What can we do to reduce this issue? Promote the best recipes in the documentation of the random and/or secrets module? Add Nick's secrets.wait_for_system_rng()?

At the moment, PEP 522 doesn't propose making the secrets API block implicitly. I was already starting to have doubts about that, and given Guido's feedback, I think I should change it so that it does. That would give the following overall outcome: - the random APIs will never block (but shouldn't be used for secrets) - the secrets APIs will block if they need to (including secrets.wait_for_system_rng()) - os.urandom() may raise BlockingIOError if you don't wait for the system RNG first - random.SystemRandom() may raise BlockingIOError if you don't wait for the system RNG first And if in the latter two cases someone is directed to the secrets module to wait for the system RNG to be ready (e.g. in the error message we raise), they may find that secrets offers a higher level API for whatever they were trying to do anyway. Meanwhile, folks that want to do something other than block if the system RNG isn't ready (like log potentially relevant details of the system encountering the lack of entropy) can just catch BlockingIOError, rather than needing to use platform specific APIs like os.getrandom(). Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

Nick Coghlan

10:14 a.m.

On 6 August 2016 at 18:46, Nick Coghlan <ncoghlan@gmail.com> wrote:

...

On 6 August 2016 at 18:32, Victor Stinner <victor.stinner@gmail.com> wrote:

...
Le 6 août 2016 04:39, "Guido van Rossum" <guido@python.org> a écrit :

...
4. Bad code gets cargo-culted (e.g. through StackOverflow).

...
Re (4): With PEP 524, people worried about blocking may be driven to unnecessarily write more complicated code using os.getrandom(). With PEP 522, people worried about crashes may be driven to unnecessarily call secrets.wait_for_system_rng() or put try/except blocks catching raise BlockingIOError around all their os.urandom()-based calls.

What can we do to reduce this issue? Promote the best recipes in the documentation of the random and/or secrets module? Add Nick's secrets.wait_for_system_rng()?

At the moment, PEP 522 doesn't propose making the secrets API block implicitly. I was already starting to have doubts about that, and given Guido's feedback, I think I should change it so that it does.

OK, I've made this change now: https://github.com/python/peps/commit/5392cf9fb86d983b2f06694b742318000ad8bd... It turned out to have the nice property of making secrets.token_bytes a blocking drop-in replacement for os.urandom, so I appended a "; see secrets.token_bytes()" to the proposed error message. This should make the "boilerplate" answer either using secrets.token_bytes unconditionally, or else a backwards compatibility dance to use it if available, and fall back to os.urandom otherwise. I also tried to make it more explicit that application frameworks like Django that can make more assumptions about their use cases can easily prevent the BlockingIOError from ever coming up by calling secrets.wait_for_system_rng() when it's available. Most of the other changes were clearing out references to things that have already been handled outside the PEP process (i.e. agreeing that os.getrandom() is useful to expose as a platform feature, agreeing that SipHash initialisation and random module initialisation shouldn't wait for the system RNG) Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

Victor Stinner

10:23 a.m.

An alternative would be to add to my PEP 524 an *optional* random.SystemRandomNonblock which is basically the PEP 522 (raise if it would block). "Optional"... or maybe make it always available but block(!) on some platforms? (Bad idea IMO) I dislike the idea of adding 2 new functions to generate random in the same PEP (getrandom, SystemRandomNonblock), it's already hard enough to pick to right one in Python 3.5... Victor Le 6 août 2016 10:46 AM, "Nick Coghlan" <ncoghlan@gmail.com> a écrit :

...

On 6 August 2016 at 18:32, Victor Stinner <victor.stinner@gmail.com> wrote:

...
Le 6 août 2016 04:39, "Guido van Rossum" <guido@python.org> a écrit :

...
4. Bad code gets cargo-culted (e.g. through StackOverflow).

...
Re (4): With PEP 524, people worried about blocking may be driven to unnecessarily write more complicated code using os.getrandom(). With PEP 522, people worried about crashes may be driven to unnecessarily call secrets.wait_for_system_rng() or put try/except blocks catching raise BlockingIOError around all their os.urandom()-based calls.

What can we do to reduce this issue? Promote the best recipes in the documentation of the random and/or secrets module? Add Nick's secrets.wait_for_system_rng()?

At the moment, PEP 522 doesn't propose making the secrets API block implicitly. I was already starting to have doubts about that, and given Guido's feedback, I think I should change it so that it does.

That would give the following overall outcome:

- the random APIs will never block (but shouldn't be used for secrets) - the secrets APIs will block if they need to (including secrets.wait_for_system_rng()) - os.urandom() may raise BlockingIOError if you don't wait for the system RNG first - random.SystemRandom() may raise BlockingIOError if you don't wait for the system RNG first

And if in the latter two cases someone is directed to the secrets module to wait for the system RNG to be ready (e.g. in the error message we raise), they may find that secrets offers a higher level API for whatever they were trying to do anyway.

Meanwhile, folks that want to do something other than block if the system RNG isn't ready (like log potentially relevant details of the system encountering the lack of entropy) can just catch BlockingIOError, rather than needing to use platform specific APIs like os.getrandom().

Cheers, Nick.

-- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

Nick Coghlan

2:29 p.m.

On 6 August 2016 at 20:23, Victor Stinner <victor.stinner@gmail.com> wrote:

...

An alternative would be to add to my PEP 524 an *optional* random.SystemRandomNonblock which is basically the PEP 522 (raise if it would block). "Optional"... or maybe make it always available but block(!) on some platforms? (Bad idea IMO)

No, we don't want anything new added to the random module for this - outside backwards compatibility considerations, random.SystemRandom should be superseded entirely for security sensitive purposes by the module level APIs in the secrets module.

...

I dislike the idea of adding 2 new functions to generate random in the same PEP (getrandom, SystemRandomNonblock), it's already hard enough to pick to right one in Python 3.5...

With the changes to PEP 522, secrets.token_bytes will be a cross-platform blocking API regardless of which underlying implementation model we choose - either inheriting that behaviour from os.urandom() (PEP 524), or adding it when encountering BlockingIOError (PEP 522). That means the essential question becomes: Should os.urandom() just be secrets.token_bytes() without a default number of bytes requested? Or does it make more sense to use it to expose the Linux sys.getrandom() non-blocking behaviour to Python code in a platform independent way? Since we're going to have the two level API anyway (os module vs secrets), and have two different behaviours we'd like to expose (blocking vs non-blocking with notification), the latter design is the one I ended up converging on: high level API with implicit blocking, low level API that never blocks, but may throw an exception. It's not where I expected to end up when I first wrote the PEP, but that's the PEP process for you :) Cheers, Nick.

...

Victor

Le 6 août 2016 10:46 AM, "Nick Coghlan" <ncoghlan@gmail.com> a écrit :

...
On 6 August 2016 at 18:32, Victor Stinner <victor.stinner@gmail.com> wrote:

...
Le 6 août 2016 04:39, "Guido van Rossum" <guido@python.org> a écrit :

...
4. Bad code gets cargo-culted (e.g. through StackOverflow).

...
Re (4): With PEP 524, people worried about blocking may be driven to unnecessarily write more complicated code using os.getrandom(). With PEP 522, people worried about crashes may be driven to unnecessarily call secrets.wait_for_system_rng() or put try/except blocks catching raise BlockingIOError around all their os.urandom()-based calls.

What can we do to reduce this issue? Promote the best recipes in the documentation of the random and/or secrets module? Add Nick's secrets.wait_for_system_rng()?

At the moment, PEP 522 doesn't propose making the secrets API block implicitly. I was already starting to have doubts about that, and given Guido's feedback, I think I should change it so that it does.

That would give the following overall outcome:

- the random APIs will never block (but shouldn't be used for secrets) - the secrets APIs will block if they need to (including secrets.wait_for_system_rng()) - os.urandom() may raise BlockingIOError if you don't wait for the system RNG first - random.SystemRandom() may raise BlockingIOError if you don't wait for the system RNG first

And if in the latter two cases someone is directed to the secrets module to wait for the system RNG to be ready (e.g. in the error message we raise), they may find that secrets offers a higher level API for whatever they were trying to do anyway.

Meanwhile, folks that want to do something other than block if the system RNG isn't ready (like log potentially relevant details of the system encountering the lack of entropy) can just catch BlockingIOError, rather than needing to use platform specific APIs like os.getrandom().

Cheers, Nick.

-- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

-- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

Guido van Rossum

5:21 p.m.

I apologize, Maybe I wasn't clear. I'm voting in favor of Victor's PEP 524, i.e. making os.urandom() always blocking, over introducing an exception so rare that it's impossible to provoke without mocking entirely. We may be trying to steer people away from os.urandom(), but it's a venerable API that has been around and stable since Python 2.4. Telling people "oh, BTW, you must now catch BlockingIOError in your code that calls os.urandom()" is a problem for straddling code, since BlockingIOError doesn't even exist in Python 2 (though its base class, OSError, does). Having to think about the consequences of os.urandom() blocking is necessary regardless of which PEP is implemented, since on some platforms it will block (though rarely). And the correct response is almost always "so let it block in those rare cases, it's just going to be a little hiccup". People working in an asyncio world may want to send secure random calls off to the thread pool using e.g. `loop.run_in_executor(None, os.urandom, 128)` -- but they have to think that way anyways because of other platforms, and they will have to do this for the recommended higher-level secure random APIs too. I really see a much bigger downside to adding the possibility that os.urandom() raises BlockingIOError, compared to accepting the possibility that it may block (which is hardly news). There is one thing that is still really unresolved for me, and that is a good understanding of how likely this feared event, "not having enough entropy" actually is, for environments where Python may actually be used. My main question is, can it occur in situations *other* than during very early startup? What's the answer for various platforms? Once I'm past this boot phase, can I safely assume os.urandom() will never block, or is there still a possibility for a system to run out of entropy later (say, by excessive calls to os.urandom(), possibly in another process)? The text of https://www.python.org/dev/peps/pep-0522/#adding-secrets-wait-for-system-rng suggests that that is *not* a possibility (since it recommends putting that call in __main__). Anyways, if the answer ends up being "yes, some systems may occasionally run out of entropy during normal operation", I would count that as a further point against PEP 522. But, assuming I am asked for a vote, my vote goes to Victor's PEP 524, making os.urandom() occasionally block even on Linux, and adding os.getrandom() on those platforms that have it. --Guido On Sat, Aug 6, 2016 at 7:29 AM, Nick Coghlan <ncoghlan@gmail.com> wrote:

...

On 6 August 2016 at 20:23, Victor Stinner <victor.stinner@gmail.com> wrote:

...
An alternative would be to add to my PEP 524 an *optional* random.SystemRandomNonblock which is basically the PEP 522 (raise if it would block). "Optional"... or maybe make it always available but block(!) on some platforms? (Bad idea IMO)

No, we don't want anything new added to the random module for this - outside backwards compatibility considerations, random.SystemRandom should be superseded entirely for security sensitive purposes by the module level APIs in the secrets module.

...
I dislike the idea of adding 2 new functions to generate random in the same PEP (getrandom, SystemRandomNonblock), it's already hard enough to pick to right one in Python 3.5...

With the changes to PEP 522, secrets.token_bytes will be a cross-platform blocking API regardless of which underlying implementation model we choose - either inheriting that behaviour from os.urandom() (PEP 524), or adding it when encountering BlockingIOError (PEP 522).

That means the essential question becomes: Should os.urandom() just be secrets.token_bytes() without a default number of bytes requested? Or does it make more sense to use it to expose the Linux sys.getrandom() non-blocking behaviour to Python code in a platform independent way?

Since we're going to have the two level API anyway (os module vs secrets), and have two different behaviours we'd like to expose (blocking vs non-blocking with notification), the latter design is the one I ended up converging on: high level API with implicit blocking, low level API that never blocks, but may throw an exception. It's not where I expected to end up when I first wrote the PEP, but that's the PEP process for you :)

Cheers, Nick.

...
Victor

Le 6 août 2016 10:46 AM, "Nick Coghlan" <ncoghlan@gmail.com> a écrit :

...
On 6 August 2016 at 18:32, Victor Stinner <victor.stinner@gmail.com> wrote:

...
Le 6 août 2016 04:39, "Guido van Rossum" <guido@python.org> a écrit :

...
4. Bad code gets cargo-culted (e.g. through StackOverflow).

...
Re (4): With PEP 524, people worried about blocking may be driven to unnecessarily write more complicated code using os.getrandom(). With PEP 522, people worried about crashes may be driven to unnecessarily call secrets.wait_for_system_rng() or put try/except blocks catching raise BlockingIOError around all their os.urandom()-based calls.

What can we do to reduce this issue? Promote the best recipes in the documentation of the random and/or secrets module? Add Nick's secrets.wait_for_system_rng()?

At the moment, PEP 522 doesn't propose making the secrets API block implicitly. I was already starting to have doubts about that, and given Guido's feedback, I think I should change it so that it does.

That would give the following overall outcome:

- the random APIs will never block (but shouldn't be used for secrets) - the secrets APIs will block if they need to (including secrets.wait_for_system_rng()) - os.urandom() may raise BlockingIOError if you don't wait for the system RNG first - random.SystemRandom() may raise BlockingIOError if you don't wait for the system RNG first

And if in the latter two cases someone is directed to the secrets module to wait for the system RNG to be ready (e.g. in the error message we raise), they may find that secrets offers a higher level API for whatever they were trying to do anyway.

Meanwhile, folks that want to do something other than block if the system RNG isn't ready (like log potentially relevant details of the system encountering the lack of entropy) can just catch BlockingIOError, rather than needing to use platform specific APIs like os.getrandom().

Cheers, Nick.

-- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

-- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

-- --Guido van Rossum (python.org/~guido)

Donald Stufft

5:34 p.m.

...

On Aug 6, 2016, at 1:21 PM, Guido van Rossum <guido@python.org> wrote:

There is one thing that is still really unresolved for me, and that is a good understanding of how likely this feared event, "not having enough entropy" actually is, for environments where Python may actually be used. My main question is, can it occur in situations *other* than during very early startup? What's the answer for various platforms? Once I'm past this boot phase, can I safely assume os.urandom() will never block, or is there still a possibility for a system to run out of entropy later (say, by excessive calls to os.urandom(), possibly in another process)? The text of https://www.python.org/dev/peps/pep-0522/#adding-secrets-wait-for-system-rng suggests that that is *not* a possibility (since it recommends putting that call in __main__).

Anyways, if the answer ends up being "yes, some systems may occasionally run out of entropy during normal operation", I would count that as a further point against PEP 522.

For all of the major platforms that I can think of, once os.urandom is out of this early boot phase it will never block (or it will never block ever on that platform). This covers: * Linux * Windows * OS X * FreeBSD * OpenBSD * Solaris (I think) I have no idea what more obscure platforms like AIX do, I suspect they’ll behave like older Linux though, where /dev/urandom will never block and might give bad data. This means that once you’ve gotten any data from a urandom that could possibly block, it will never block. I could be wrong though. Essentially, you’re waiting for the device to be fully initialized, and once it is initialized it is initialized, it will never revert to an “uninitialized” state. This has a side effect that if someone wanted to say, ensure that os.urandom was non-blocking before binding to a port with an asyncio daemon they could simply call ``os.urandom(1)`` which will either return immediately if the urandom device is already initialized or block until it is initialized.

...

But, assuming I am asked for a vote, my vote goes to Victor's PEP 524, making os.urandom() occasionally block even on Linux, and adding os.getrandom() on those platforms that have it.

I agree, though I’m neutral on os.getrandom. — Donald Stufft

Tim Peters

5:58 p.m.

[Guido]

...

... There is one thing that is still really unresolved for me, and that is a good understanding of how likely this feared event, "not having enough entropy" actually is, for environments where Python may actually be used. My main question is, can it occur in situations *other* than during very early startup? What's the answer for various platforms? Once I'm past this boot phase, can I safely assume os.urandom() will never block, or is there still a possibility for a system to run out of entropy later (say, by excessive calls to os.urandom(), possibly in another process)?

No such platforms have been identified in any of these messages, so "no" is the answer - for now ;-) Under the covers, all these things use _some_ crypto-strength but deterministic PRNG. So the only time they _may_ get in real trouble is at startup, waiting to initialize the CSPRNG's state from "enough" random noise (even then, sane environments save a file of gibberish before shutdown to use to seed the CSPRNG "immediately" at the next boot). The best systems periodically mix fresh "entropy" into the CSPRNG's state all along, but don't wait for it after initialization.

...

... But, assuming I am asked for a vote, my vote goes to Victor's PEP 524, making os.urandom() occasionally block even on Linux, and adding os.getrandom() on those platforms that have it.

+1 here :-)

Nick Coghlan

7 Aug 7 Aug

4:14 p.m.

On 7 August 2016 at 03:21, Guido van Rossum <guido@python.org> wrote:

...

There is one thing that is still really unresolved for me, and that is a good understanding of how likely this feared event, "not having enough entropy" actually is, for environments where Python may actually be used. My main question is, can it occur in situations *other* than during very early startup? What's the answer for various platforms? Once I'm past this boot phase, can I safely assume os.urandom() will never block, or is there still a possibility for a system to run out of entropy later (say, by excessive calls to os.urandom(), possibly in another process)? The text of https://www.python.org/dev/peps/pep-0522/#adding-secrets- wait-for-system-rng suggests that that is *not* a possibility (since it recommends putting that call in __main__).

Anyways, if the answer ends up being "yes, some systems may occasionally run out of entropy during normal operation", I would count that as a further point against PEP 522.

I see folks encountering the new exception proposed in one of two ways: 1. They're writing Linux system initialisation software, and forgot the system RNG may not be ready yet 2. They're running security sensitive Python software on a misconfigured hosting platform that isn't seeding the entropy pool correctly (either in a VM or on an embedded system) For the first case, I think either approach to blocking (implicit or explicit) is fine. However, the concern I have with PEP 524 is that in the second case, it makes it incredibly hard for an operations team (who probably aren't going to be Python experts, and are frequently going to be running software they didn't write) to debug the problem - rather than a crashed application with a full Python traceback (which they can take back to the dev team or vendor and ask "What does this mean?", or else look up on the internet themselves), all the platform operators will have to go on is "This application hangs at startup". strace should at least be able to tell them that it's hanging in a getrandom() kernel call, but it's still going to take a pretty capable sysadmin to be able to figure out what's going on. In a lot of ways, I see it as being similar to our dependency on the Linux platform locale being set correctly to get boundary processing right: if you get an exception, the problem *isn't* generally with the application, it's with the way Linux has been configured. The same holds here - if you get BlockingIOError from os.urandom under PEP 524, there's nothing wrong with your application, but there *is* something wrong with your environment (since security sensitive Python code should only be run after the system RNG is ready) Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

Nick Coghlan

4:17 p.m.

On 8 August 2016 at 02:14, Nick Coghlan <ncoghlan@gmail.com> wrote:

...

In a lot of ways, I see it as being similar to our dependency on the Linux platform locale being set correctly to get boundary processing right: if you get an exception, the problem *isn't* generally with the application, it's with the way Linux has been configured. The same holds here - if you get BlockingIOError from os.urandom under PEP 524, there's nothing wrong with your application, but there *is* something wrong with your environment (since security sensitive Python code should only be run after the system RNG is ready)

Oops, that reference should have been to PEP 522. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

Ethan Furman

4:28 p.m.

On 08/07/2016 09:14 AM, Nick Coghlan wrote:

...

On 7 August 2016 at 03:21, Guido van Rossum wrote:

...
There is one thing that is still really unresolved for me, and that is a good understanding of how likely this feared event, "not having enough entropy" actually is, for environments where Python may actually be used. My main question is, can it occur in situations *other* than during very early startup? What's the answer for various platforms? Once I'm past this boot phase, can I safely assume os.urandom() will never block, or is there still a possibility for a system to run out of entropy later (say, by excessive calls to os.urandom(), possibly in another process)? The text of [PEP 522] suggests that that is *not* a possibility (since it recommends putting that call in __main__).

Anyways, if the answer ends up being "yes, some systems may occasionally run out of entropy during normal operation", I would count that as a further point against PEP 522.

I see folks encountering the new exception proposed in one of two ways:

1. They're writing Linux system initialisation software, and forgot the system RNG may not be ready yet 2. They're running security sensitive Python software on a misconfigured hosting platform that isn't seeding the entropy pool correctly (either in a VM or on an embedded system)

For the first case, I think either approach to blocking (implicit or explicit) is fine.

However, the concern I have with PEP 524 is that in the second case, it makes it incredibly hard for an operations team (who probably aren't going to be Python experts, and are frequently going to be running software they didn't write) to debug the problem - rather than a crashed application with a full Python traceback (which they can take back to the dev team or vendor and ask "What does this mean?", or else look up on the internet themselves), all the platform operators will have to go on is "This application hangs at startup". strace should at least be able to tell them that it's hanging in a getrandom() kernel call, but it's still going to take a pretty capable sysadmin to be able to figure out what's going on.

In a lot of ways, I see it as being similar to our dependency on the Linux platform locale being set correctly to get boundary processing right: if you get an exception, the problem *isn't* generally with the application, it's with the way Linux has been configured. The same holds here - if you get BlockingIOError from os.urandom under PEP [522], there's nothing wrong with your application, but there *is* something wrong with your environment (since security sensitive Python code should only be run after the system RNG is ready)

+1 If I had not been involved in these discussions about early linux startup, virtual machines, and os.urandom I would be completely mystified by the error presented when I ran across it (stalled and eventually killed process), with no clue about the nature of the problem. At this point we have concrete examples of the harm caused by blocking on os.urandom -- do we have any actual use-cases where it is hurtful to raise instead? -- ~Ethan~

Donald Stufft

4:33 p.m.

...

On Aug 7, 2016, at 12:28 PM, Ethan Furman <ethan@stoneleaf.us> wrote:

At this point we have concrete examples of the harm caused by blocking on os.urandom -- do we have any actual use-cases where it is hurtful to raise instead?

The use cases there are basically any time it would have only blocked for say, half a second or so. It’s hard to point out a specific use case because we’ve never had an error raising there, we’ve either just silently given them bad data because /dev/urandom wasn’t initialized or we blocked and they didn’t notice because it only blocked for a short time. I suspect that the “can block for a short time” will be the dominant case, because the system generally gets entropy quite quickly in most scenarios. The only time it can’t really is if Python is the only thing running early enough in the boot process *and* that thing is calling os.urandom. The problem we had that started this thread was SipHash initialization calling a blocking urandom by a script called by systemd prior to the point where systemd would attempt to reseed urandom from previous boots and prior to the point that systemd parallelizes the boot process. Basically any other time the time to block will be relatively short (and in fact, you see daemons like OpenSSH blocking on start up for similar reasons). — Donald Stufft

Guido van Rossum

5:56 p.m.

Can we stop the discussion please? I have picked a winner. The loser may not like it, but the discussion is OVER. On Sun, Aug 7, 2016 at 9:33 AM, Donald Stufft <donald@stufft.io> wrote:

...

...
On Aug 7, 2016, at 12:28 PM, Ethan Furman <ethan@stoneleaf.us> wrote:

At this point we have concrete examples of the harm caused by blocking on os.urandom -- do we have any actual use-cases where it is hurtful to raise instead?

The use cases there are basically any time it would have only blocked for say, half a second or so. It’s hard to point out a specific use case because we’ve never had an error raising there, we’ve either just silently given them bad data because /dev/urandom wasn’t initialized or we blocked and they didn’t notice because it only blocked for a short time.

I suspect that the “can block for a short time” will be the dominant case, because the system generally gets entropy quite quickly in most scenarios. The only time it can’t really is if Python is the only thing running early enough in the boot process *and* that thing is calling os.urandom. The problem we had that started this thread was SipHash initialization calling a blocking urandom by a script called by systemd prior to the point where systemd would attempt to reseed urandom from previous boots and prior to the point that systemd parallelizes the boot process.

Basically any other time the time to block will be relatively short (and in fact, you see daemons like OpenSSH blocking on start up for similar reasons).

— Donald Stufft

-- --Guido van Rossum (python.org/~guido)

Ethan Furman

6:33 p.m.

On 08/07/2016 09:33 AM, Donald Stufft wrote:

...

On Aug 7, 2016, at 12:28 PM, Ethan Furman wrote:

...

Guido, not trying to change your mind, just trying to understand.

...

...
At this point we have concrete examples of the harm caused by blocking on os.urandom -- do we have any actual use-cases where it is hurtful to raise instead?

...

The problem we had that started this thread was SipHash initialization calling a blocking urandom by a script called by systemd prior to the point where systemd would attempt to reseed urandom from previous boots and prior to the point that systemd parallelizes the boot process.

So if we work around the problem in SipHash, the issue goes away? And does that work-around mean SipHash may not be robust for that instance of Python, but any Python process running that early should be short-lived anyway, so any security issues become vanishingly rare? -- ~Ethan~

Donald Stufft

6:45 p.m.

...

On Aug 7, 2016, at 2:33 PM, Ethan Furman <ethan@stoneleaf.us> wrote:

So if we work around the problem in SipHash, the issue goes away?

The issue goes away in the sense that starting the Python interpreter *at all* no longer relies on urandom being initialized. If someone uses Python early enough and calls os.urandom (directly or indirectly) then the same problem would occur again for that program. Working around the problem in SipHash simply moves the problem from anytime you try to use the Python interpreter early in the boot process, to anytime you ask for secure random from os.urandom early in the boot process.

...

And does that work-around mean SipHash may not be robust for that instance of Python, but any Python process running that early should be short-lived anyway, so any security issues become vanishingly rare?

This is correct. The security properties of SipHash basically only matter for something that accepts a lot of untrusted input *and* lives a long time. This basically ends up only pretty much only applying to some sort of network available daemon (not entirely, but it’s the main case). It’s also true that the quality of random from urandom doesn’t go from something entirely predictable to entirely random at the exact moment the pool is fully initialized. The quality of random numbers get better the closer to pool initialization you are. This isn’t good enough for many use cases, but for SipHash it’s likely going to get reasonably OK random even in these early boot cases for that particular use case. As a hypothetical, if we wanted to push the needle even further we could *not* work around the SipHash problem and push that need to work around it onto folks calling Python that early by setting a static PYTHONHASHSEED, but the cost is not likely worth the reward. — Donald Stufft

Victor Stinner

7:49 p.m.

I am sorry but I'm in holiday and I'm unable to understand if your (Guido) email means that the PEP 524 is accepted, or if the PEP still needs to be reworked? Can someone help me? I'm lost. :-( (Why is this specific topic so much annoying? :-)) Victor

Guido van Rossum

11:11 p.m.

Sorry, PEP 524 is accepted, and PEP 522 is rejected. Let os.urandom() be blocking, and let os.getrandom() be added. Congrats, Victor! On Sun, Aug 7, 2016 at 12:49 PM, Victor Stinner <victor.stinner@gmail.com> wrote:

...

I am sorry but I'm in holiday and I'm unable to understand if your (Guido) email means that the PEP 524 is accepted, or if the PEP still needs to be reworked?

Can someone help me? I'm lost. :-(

(Why is this specific topic so much annoying? :-))

Victor

-- --Guido van Rossum (python.org/~guido)

Victor Stinner

11:41 p.m.

2016-08-08 1:11 GMT+02:00 Guido van Rossum <guido@python.org>:

...

Sorry, PEP 524 is accepted, and PEP 522 is rejected. Let os.urandom() be blocking, and let os.getrandom() be added. Congrats, Victor!

Ok. I changed the status of my PEP 524 from Draft to Accepted. I will now start to work on the implementation. For Nick's PEP 522, I don't know if its status should be updated to Rejected or Superseded (by the PEP 524). I prefer to let Nick changes the status of his PEP ;-) Victor

Nick Coghlan

8 Aug 8 Aug

2:37 a.m.

On 8 August 2016 at 09:41, Victor Stinner <victor.stinner@gmail.com> wrote:

...

2016-08-08 1:11 GMT+02:00 Guido van Rossum <guido@python.org>:

...
Sorry, PEP 524 is accepted, and PEP 522 is rejected. Let os.urandom() be blocking, and let os.getrandom() be added. Congrats, Victor!

Ok. I changed the status of my PEP 524 from Draft to Accepted. I will now start to work on the implementation.

For Nick's PEP 522, I don't know if its status should be updated to Rejected or Superseded (by the PEP 524). I prefer to let Nick changes the status of his PEP ;-)

...

From a pure developer point of view, I completely understand Guido's

Rejected, but I'm still quite concerned by the lack of operator input into this discussion, particularly when we're going against what the Linux kernel developers themselves decided to do - Ted T'so flicked the equivalent switch for the Linux kernel (to make /dev/urandom blocking) and doing so caused some of the systems in their CI fleet to fail. perspective that blocking feels safer than risking throwing an exception, as well as wanting to be able to call the issue done and not worry about it anymore. However, from an operations perspective, it means the discussion will move downstream to see whether we (Fedora) agree this is the right behaviour for the *system* Python, or whether we should patch that to throw the error instead of implicitly blocking. Such divergence would be unfortunate (if we ultimately decide to go that way), but managing disagreements with upstreams about appropriate default behaviour is one of the reasons distros *have* the ability to carry patches in the first place. At the very least, I'll be proposing we do this while the 3.6 beta releases are in Fedora Rawhide as a way of gathering objective data about the scope of the problem from ABRT (Fedora's automatic bug reporting tool, which can automatically collect and submit Python stack traces, but can't readily detect system hangs). Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

Nick Coghlan

3:32 a.m.

On 8 August 2016 at 12:37, Nick Coghlan <ncoghlan@gmail.com> wrote:

...

However, from an operations perspective, it means the discussion will move downstream to see whether we (Fedora) agree this is the right behaviour for the *system* Python, or whether we should patch that to throw the error instead of implicitly blocking. Such divergence would be unfortunate (if we ultimately decide to go that way), but managing disagreements with upstreams about appropriate default behaviour is one of the reasons distros *have* the ability to carry patches in the first place.

At the very least, I'll be proposing we do this while the 3.6 beta releases are in Fedora Rawhide as a way of gathering objective data about the scope of the problem from ABRT (Fedora's automatic bug reporting tool, which can automatically collect and submit Python stack traces, but can't readily detect system hangs).

For folks curious as to what I mean here, this isn't a declaration that Fedora *is* going to diverge from upstream in terms of the way os.urandom() behaves in Python 3.6+. Rather, it's a statement that I think we need more data directly from Fedora's users before deciding whether or not it makes sense to abide by the cross-platform upstream behaviour, or carry a patch that changes the behaviour specifically for the system Python installation: https://lists.fedorahosted.org/archives/list/python- devel@lists.fedoraproject.org/thread/UAB7JJ5VPW2W2QEERZ4HIQZZB3QMB2H5/ While the interests of Linux distro users and CPython upstream users are generally pretty well aligned, the alignment isn't 100%, and distros carrying patches on behalf of their user base is what makes up the difference. The Fedora Rawhide experiment I'm proposing in that email to the Fedora Python list should give us the data we (Fedora) need to decide whether or not this is one of those cases where it makes sense for us to carry a patch - if we get zero hits from the exception in ABRT, then it means the default blocking behaviour should be relatively safe (since people won't be encountering it), so we can drop the patch before the F26 Beta release, and Guido will have a solid data point backing up his design instincts. If we *do* get hits on the exception, then exactly what we do will depend on the nature of those hits, and in particular whether or not the change is helping folks find misconfigured Fedora environments they hadn't previously noticed, or if they're spurious notifications in situations where just blocking for a few hundred milliseconds would have resolved the problem on its own (as tested by inserting a "python -c 'import os; os.getrandom(1)" before whatever application startup is triggering the new exception). Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

Nick Coghlan

4:22 a.m.

On 8 August 2016 at 13:32, Nick Coghlan <ncoghlan@gmail.com> wrote:

...

The Fedora Rawhide experiment I'm proposing in that email to the Fedora Python list should give us the data we (Fedora) need to decide whether or not this is one of those cases where it makes sense for us to carry a patch - if we get zero hits from the exception in ABRT, then it means the default blocking behaviour should be relatively safe (since people won't be encountering it), so we can drop the patch before the F26 Beta release, and Guido will have a solid data point backing up his design instincts. If we *do* get hits on the exception, then exactly what we do will depend on the nature of those hits, and in particular whether or not the change is helping folks find misconfigured Fedora environments they hadn't previously noticed, or if they're spurious notifications in situations where just blocking for a few hundred milliseconds would have resolved the problem on its own (as tested by inserting a "python -c 'import os; os.getrandom(1)" before whatever application startup is triggering the new exception).

I started thinking a bit more about the outcomes we'd be looking for from such an experiment [1], and that reminded of the fact we could potentially do that with a much lower level of divergence if the upstream implementation issued a runtime warning when it needed to fall back to blocking behaviour. That is, rather that just calling "getrandom(size, 0)" unconditionally, the current logic for trying "getrandom(size, GRND_NONBLOCK)" first could be kept, and only the fallback to reading from "/dev/random/" changed to instead call "getrandom(size, 0)" with a preceding call to PyErr_Warn. That warning on its own would address almost all my lingering concerns with the implicit blocking approach, since Python's normal warning control machinery can be used to turn it into an exception if that's the desired behaviour in a given context, and the only policy decision distros would need to make for their system Python is whether they treat that warning as an error by default or not. Regards, Nick. [1] https://lists.fedorahosted.org/archives/list/python-devel@lists.fedoraprojec... -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

Nick Coghlan

4:24 a.m.

On 8 August 2016 at 14:22, Nick Coghlan <ncoghlan@gmail.com> wrote:

...

On 8 August 2016 at 13:32, Nick Coghlan <ncoghlan@gmail.com> wrote:

...
The Fedora Rawhide experiment I'm proposing in that email to the Fedora Python list should give us the data we (Fedora) need to decide whether or not this is one of those cases where it makes sense for us to carry a patch - if we get zero hits from the exception in ABRT, then it means the default blocking behaviour should be relatively safe (since people won't be encountering it), so we can drop the patch before the F26 Beta release, and Guido will have a solid data point backing up his design instincts. If we *do* get hits on the exception, then exactly what we do will depend on the nature of those hits, and in particular whether or not the change is helping folks find misconfigured Fedora environments they hadn't previously noticed, or if they're spurious notifications in situations where just blocking for a few hundred milliseconds would have resolved the problem on its own (as tested by inserting a "python -c 'import os; os.getrandom(1)" before whatever application startup is triggering the new exception).

I started thinking a bit more about the outcomes we'd be looking for from such an experiment [1], and that reminded of the fact we could potentially do that with a much lower level of divergence if the upstream implementation issued a runtime warning when it needed to fall back to blocking behaviour. That is, rather that just calling "getrandom(size, 0)" unconditionally, the current logic for trying "getrandom(size, GRND_NONBLOCK)" first could be kept, and only the fallback to reading from "/dev/random/" changed to instead call "getrandom(size, 0)" with a preceding call to PyErr_Warn.

Sorry, unhelpful typo there: the fallback is to "/dev/urandom", I just mistyped it and missed the error on proofreading. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

Victor Stinner

9:59 a.m.

os.urandom() is already blocking in Python 3.5.0 and 3.5.1 :-) For example on Fedora, no need for rawhide: Fedora 24 provides Python 3.5.1 with a blocking os.urandom() :-) I don't know the exact Python 3.5 version of Ubuntu Xental. Victor

Nick Coghlan

12:40 p.m.

On 8 August 2016 at 19:59, Victor Stinner <victor.stinner@gmail.com> wrote:

...

os.urandom() is already blocking in Python 3.5.0 and 3.5.1 :-)

For example on Fedora, no need for rawhide: Fedora 24 provides Python 3.5.1 with a blocking os.urandom() :-)

Surprisingly, it doesn't, as due to the way the Fedora buildroots are set up in Koji the "HAVE_GETRANDOM_SYSCALL" configure check ends up returning False when the system Python RPM gets built: https://mail.python.org/pipermail/security-sig/2016-June/000060.html With 3.5.2 reverting to the old behaviour anyway, there's no compelling reason to address that build environment discrepancy for 3.5, but we (Fedora) are going to have to do something about it for Python 3.6 in F26 so that os.getrandom() gets defined properly and os.urandom() can be made blocking (with a warning when it does). Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

Barry Warsaw

3:57 p.m.

On Aug 08, 2016, at 11:59 AM, Victor Stinner wrote:

...

I don't know the exact Python 3.5 version of Ubuntu Xental.

Ubuntu 16.04 LTS (Xenial Xerus) has 3.5.1-ish in the primary archive pocket, but 3.5.2-ish in xenial-updates, which most people will have enabled. Of course, the devil is in the details, in this case the patches cherry picked and otherwise on top of those base upstream versions. Cheers, -Barry

3014

Age (days ago)

3017

Last active (days ago)

List overview

Download

28 comments

7 participants

participants (7)

Barry Warsaw
Donald Stufft
Ethan Furman
Guido van Rossum
Nick Coghlan
Tim Peters
Victor Stinner

Take a decision for os.urandom() in Python 3.6

tags

participants (7)