On Jun 9, 2016, at 9:27 AM, Cory Benfield <cory@lukasa.co.uk> wrote:
On 9 Jun 2016, at 13:53, Doug Hellmann <doug@doughellmann.com> wrote:
I agree with David. We shouldn't break existing behavior in a way that might lead to someone else's software being unusable.
What does ‘usable’ mean? Does it mean “the code must execute from beginning to end”? Or does it mean “the code must maintain the expected invariants”? If it’s the second, what reasonably counts as “the expected invariants”?
The code must not cause the user’s computer to completely freeze in a way that makes their VM appear to be failing to boot?
The problem here is that both definitions of ‘broken’ are unclear. If we leave os.urandom() as it is, there is a small-but-nonzero change that your program will hang, potentially indefinitely. If we change it back, there is a small-but-nonzero chance your program will generate you bad random numbers.
If we assume, for a moment, that os.urandom() doesn’t get called during Python startup (that is that we adopt Christian’s approach to deal with random and SipHash as separate concerns), what we’ve boiled down to is: your application called os.urandom() so early that you’ve got weak random numbers, does it hang or proceed? Those are literally our two options.
I agree those are the two options. I want the application developer to make the choice, not us.
These two options can be described a different way. If you didn’t actually need strong random numbers but were affected by the hang, that program failed obviously, and it failed closed. You *will* notice that your program didn’t start up, you’ll investigate, and you’ll take action. On the other hand, if you need strong random numbers but were affected by os.urandom() returning bad random numbers, you almost certainly will *not* notice, and your program will have failed *open*: that is, you are exposed to a security risk, and you have no way to be alerted to that fact.
For my part, I think the first failure mode is *vastly* better than the second, even if the first failure mode affects vastly more people than the second one does. Failing early, obviously, and safely is IMO much, much better than failing late, silently, and dangerously.
I’d argue that all the security disagreements that happen in this list boil down to weighting that differently. For my part, I want code that expects to be used in a secure context to fail *as loudly as possible* if it is unable to operate securely. And for that reason:
Adding a new API that does block allows anyone to call that when they want guaranteed random values, and the decision about whether to block or not can be placed in the application developer's hands.
I’d rather flip this around. Add a new API that *does not* block. Right now, os.urandom() is trying to fill two niches, one of which is security focused. I’d much rather decide that os.urandom() is the secure API and fail as loudly as possible when people are using it insecurely than to decide that os.urandom() is the *insecure* API and require changes.
This is because, again, people very rarely notice this kind of new API introduction unless their code explodes when they migrate. If you think you can find a way to blow up the secure crypto code only, I’m willing to have that patch too, but otherwise I really think that those who expect this code to be safe should be prioritised over those who expect it to be 100% available.
My ideal solution: change os.urandom() to throw an exception if the kernel CSPRNG is not seeded, and add a new function for saying you don’t care if the CSPRNG isn’t seeded, with all the appropriate “don’t use this unless you’re sure” warnings on it.
All of which fails to be backwards compatible (new exceptions and hanging behavior), which means you’re breaking apps. Introducing a new API lets the developers who care about strong random values use them without breaking anyone else. Doug