[Python-ideas] Python's Source of Randomness and the random.py module Redux

Thu Sep 10 15:10:09 CEST 2015

On September 10, 2015 at 8:29:16 AM, Paul Moore (p.f.moore at gmail.com) wrote:
> On 10 September 2015 at 12:26, Donald Stufft wrote:
> >> There is a fourth basic type. People (like me!) whose code absolutely
> >> doesn't have any security issues, but want a simple, convenient, fast
> >> RNG. Determinism is not an absolute requirement, but is very useful
> >> (for writing tests, maybe, or for offering a deterministic rerun
> >> option to the program). Simulation-style games often provide a way to
> >> find the "map seed", which allows users to share interesting maps -
> >> this is non-essential but a big quality-of-life benefit in such games.
> >
> > This group is the same as #3 except for the map seed thing which is
> > group #1. In particular, it wouldn’t hurt you if the random you were
> > using was cryptographically secure as long as it was fast and if you
> > needed determinism, it would hurt you to say so. Which is the point
> > that Theo was making.
>  
> I don't understand the phrase "if you needed determinism, it would
> hurt you to say so". Could you clarify?

I transposed some words, fixed:

"If you needed determinism, would it hurt you to say so?""

Essentially, other than typing a little bit more, why is:

    import random
    print(random.choice([“a”, “b”, “c”]))

better than

    import random;
    print(random.DetereministicRandom().choice([“a”, “b”, “C”]))

As far as I can tell, you've made your code and what properties it has much
clearer to someone reading it at the cost of 22 characters. If you're going to
reuse the DeterministicRandom class you can assign it to a variable and
actually end up saving characters if the variable you save it to can be
accessed at less than 6 characters.

>  
> >>
> >> IMO, the current module perfectly serves this fourth group.
> >
> > Making the user pick between Deterministic and Secure random would serve
> > this purpose too, especially in a language where "In the face of ambiguity,
> > refuse the temptation to guess" is one of the core tenets of the language. The
> > largest downside would be typing a few extra characters, which Python is not
> > a language that attempts to do things in the fewest number of characters.
>  
> And yet I know that I would routinely, and (this is the problem)
> without thinking, choose Deterministic, because I know that my use
> cases all get a (small) benefit from being able to capture the seed,
> but I also know I'm not doing security-related stuff.
>  
> No amount of making me choose is going to help me spot security
> implications that I've missed.

You're allowed to pick DeterministicRandom, you're even allowed to do it
without thinking. This isn't about making it impossible to ever insecurely use
random numbers, that's obviously a boil the ocean level of problem, this is
about trying to make it more likely that someone won't be hit by a fairly easy
to hit footgun if it does matter for them, even if they don't know it. It's
also about making code that is easier to understand on the surface, for example
without using the prior knowledge that it's using MT, tell me how you'd know
if this was safe or not:

    import random
    import string
    password = "".join(random.choice(string.ascii_letters) for _ in range(9))
    print("Your random password is",)

>  
> And also, calling the non-crypto choice "Deterministic" is unhelpful,
> because I *don't* want something deterministic, I want something
> random (I understand PRNGs aren't truly random, but "good enough for
> my purposes" is what I want, and "deterministic" reads to me as saying
> it's *not* good enough…)

But you *DO* want something deterministic, the *ONLY* way you can get this
small benefit of capturing the seed is if you can put that seed back into the
system and get a deterministic result. If the seed didn’t exactly determine the
output of the randomness then you wouldn't be able to do that. If you don't
need to be able to capture the seed and essentially "replay" the PRNG in a
deterministic way then there is exactly zero downsides to using a CSPRNG other
than speed, which is why Theo suggested using a very fast, modern CSPRNG to
solve the speed issues.

Can you point out one use case where cryptographically safe random numbers,
assuming we could generate them as quickly as you asked for them, would hurt
you unless you needed/wanted to be able to save the seed and thus require or
want deterministic results?

>  
> >> While I accept your point that far too many people are using insecure
> >> RNGs in "generate a random password" scripts, they are *not* the core
> >> target audience of the default module-level functions in the random
> >> module (did you find any examples of insecure use that *weren't*
> >> password generators?). We should educate people that this is bad
> >> practice, not change the module. Also, while it may be imperfect, it's
> >> still better than what many people *actually* do, which is to use
> >> "password" as a password on sensitive systems :-(
> >
> > You cannot document your way out of a UX problem.
>  
> What I'm trying to say is that this is an education problem more than
> a UX problem.
>  
> Personally, I think I know enough about security for my (not a
> security specialist) purposes. To that extent, if I'm working on
> something with security implications, I'm looking for things that say
> "Crypto" in the name. The rest of the time, I just use non-specialist
> stuff. It's a similar situation to that of the "statistics" module. If
> I'm doing "proper" maths, I'd go for numpy/scipy. If I just want some
> averages and I'm not bothered about numerical stability, rounding
> behaviour, etc, I'd go for the stdlib statistics package.
>  
> > The problem isn’t people doing this once on the command line to generate
> > a password, the problem is people doing it in applications where they
> > generate an API key, a session identifier, a random password which they
> > then give to their users. If you give a way to get the output of the MT
> > base random enough times, it can be used to determine what every random
> > it generated was and will be.
>  
> To me, that's crypto and I'd look to the cryptography module, or to
> something in the stdlib that explicitly said it was suitable for
> crypto.
>  
> Saying people write bad code isn't enough - how does the current
> module *encourage* them to write bad code? How much API change must we
> allow to cater for people who won't read the statement in the docs (in
> a big red box) "Warning: The pseudo-random generators of this module
> should not be used for security purposes." (Specifically people
> writing security related code who won't read the docs).

Reminder that this warning does not show up (in any color, much less red)
if you’re using ``help(random)`` or ``dir(random)`` to explore the random
module. It also does not show up in code review when you see someone doing
random.random.

It encourages you to write bad code, because it has a baked in assumption that
there is a sane default for a random number generator and expects people to
understand a fairly dificult concept, which is that not all "random" is equal.

For instance, you've already made the mistake of saying you wanted "random" not
deterministic, but the two are not mutually exlusive and deterministic is a
property that a source of random can have, and one that you need for one of the
features you say you like. 

>  
> > Here’s a game a friend of mine created where the purpose of the game is
> > to essentially unrandomize some random data, which is only possible
> > because it’s (purposely) using MT to make it possible
> > https://github.com/reaperhulk/dsa-ctf. This is not an ivory tower paranoia
> > case, it’s a real concern that will absolutely fix some insecure software
> > out there instead of telling them “welp typing a little bit extra once
> > an import is too much of a burden for me and really it’s your own fault
> > anyways”.
>  
> I don't understand how that game (which is an interesting way of
> showing people how attacks on crypto work, sure, but that's just
> education, which you dismissed above) relates to the issue here.
>  
> And I hope you don't really think that your quote is even remotely
> what I'm trying to say (I'm not that selfish) - my point is that not
> everything is security related. Not every application people write,
> and not every API in the stdlib. You're claiming that the random
> module is security related. I'm claiming it's not, it's documented as
> not being, and that's clear to the people who use it for its intended
> purpose. Telling those people that you want to make a module designed
> for their use harder to use because people for whom it's not intended
> can't read the documentation which explicitly states that it's not
> suitable for them, is doing a disservice to those people who are
> already using the module correctly for its stated purpose.

I'm claiming that the term random is ambiguously both security related and
not security related and we should either get rid of the default and expect
people to pick whether or not their use case is security related, or we should
assume that it is unless otherwise instructed. I don't particularly care what
the exact spelling of this looks like, random.(System|Secure)Random and
random.DeterministicRandom is just one option. Another option is to look at
something closer to what Go did and deprecate the "random" module and move the
MT based thing to ``math.random`` and the CSPRNG can be moved to something like
crypto.random.

>  
> By the same argument, we should remove the statistics module because
> it can be used by people with numerically unstable problems. (I doubt
> you'll find StackOverflow questions along these lines yet, but that's
> only because (a) the module's pretty new, and (b) it actually works
> pretty hard to handle the hard corner cases, but I bet they'll start
> turning up in due course, if only from the people who don't understand
> floating point...)
>

No, by this argument we shouldn't have a function called statistics in the
statistics module because there is no globally "right" answer for what the
default should be. Should it be mean? mode? median? Why is *your* use case the
"right" use case for the default option, particularly in a situation where
picking the wrong option can be disastrous.

-----------------
Donald Stufft
PGP: 0x6E3CBCE93372DCFA // 7C6B 7C5D 5E2B 6356 A926 F04F 6E3C BCE9 3372 DCFA