Does shuffle() produce uniform result ?

Wed Sep 5 01:01:47 EDT 2007

Steven D'Aprano <steve at REMOVE-THIS-cybersource.com.au> writes:
> > Right.  The idea is that those attacks don't exist and therefore the
> > output is computationally indistinguishable from random.
> 
> It is a huge leap from what the man page says, that they don't exist in 
> the unclassified literature at the time the docs were written, to what 
> you're saying, that they don't exist.

OK.  /dev/random vs /dev/urandom is a perennial topic in sci.crypt and
there are endless long threads about it there, so I tried to give you
the short version, but will give a somewhat longer version here.

There can't be an actual security proof without also proving P!=NP;
however, the cryptographic part of /dev/urandom is designed pretty
conservatively and seems to be in good shape even after all these years.

> The man page is clear: there is a possible vulnerability in /dev/urandom. 

There is the theoretical possibility of such a vulnerability, however
it also applies to /dev/random.  Note that the man page is neither a
kernel internals document nor a cryptography textbook, so it doesn't
describe the algorithm or the theory behind it.  For that you have to
look at the source code and the cryptographic literature, respectively.

For the random.c source code, see

  http://www.cs.berkeley.edu/~daw/rnd/linux-rand

It's an old version but hasn't changed all that much since then AFAIK.
There is a long comment at the top explaining how it works.  Short
version: /dev/random reads data from a bunch of stochastic sources
around the system (e.g. things like mouse motion) and makes a
seat-of-the-pants estimate of the amount of entropy coming from the
sources.  The sources are not considered to be uniform or
uncorrelated; they're just supposed to contain entropy.  The entropy
is distilled through a cryptographic hash function before being sent
out the API.  /dev/random throttles its output so that the number of
bits emitted is always lower than the estimated input entropy.
/dev/urandom is the same but doesn't do this throttling.

For the theory of what we hope (but have not proved) that these hash
functions do, see for example the HMAC papers:

    http://www-cse.ucsd.edu/~mihir/papers/hmac.html

> Any cryptographer worth his salt (pun intended) would be looking to close 
> that vulnerability BEFORE an attack is made public, and not just wait for 
> the attack to trickle down from the NSA to the script kiddies. The time 
> to close the stable door is _before_ the horse gets away.

No really, these functions are very well studied, in fact there are
known ways to construct collisions in the md5 compression function but
that doesn't appear to be of any use against how random.c uses it.

> I agree that this flaw doesn't sound like it will effect the application 
> being discussed, but a problem has already been found and a solution is 
> already known: block until there's enough entropy. That's what /dev/
> random does.

No, having enough entropy does NOT guarantee indistinguishability from
randomness.  Think of a 60-40 biased coin.  Each flip gives you 0.97
bits of entropy, so if you want 64 bits of entropy, 66 flips gives it
to you.  But it's very easy to spot the 60-40 bias and see that the
flips are not drawn uniformly from {heads,tails}.  Running the flips
through MD5 experimentally appears to get rid of any correlation
(except for some very tricky adversarial constructions based on
controlling the input) so that is what /dev/random does.  But that is
the exact same thing that urandom does.  It's possible that some
as-yet-unknown attack can find correlations in the md5 output and also
in SHA256 or whatever.

More recently there's been some theoretical advances, see these:

   http://en.wikipedia.org/wiki/Extractor
   http://citeseer.ist.psu.edu/barak03true.html

so maybe that stuff can be used in implementations.  I haven't looked
into it enough to know if it makes sense.  Chapter 10 of Schneier and
Ferguson's "Practical Cryptography" also discusses how to write these
system-level PRNG's.

> That doesn't follow. Antoon is specifically warning that /dev/urandom is 
> non-blocking. If you knew there was enough entropy available, you 
> wouldn't need /dev/random -- but how do you know there's enough entropy?

/dev/urandom can in fact screw you if you try to use it before the
system has gathered enough entropy to give good output, such as while
the system is still booting up.  That is not a cryptographic algorithm
vulnerability.  It could also screw you if your system is somehow
leaking information about its entropy pool (say through
electromagnetic emissions from the memory bus).  That is also not a
cryptographic algorithm vulnerability.  /dev/random does protect you
somewhat from both of these screws.  /dev/random does NOT protect you
against sufficiently strong hypothetical attacks against the crypto
algorithm.

Overall, the situation is:

 1) /dev/urandom is slightly bogus because it can produce output when
    there's almost no entropy in the system, i.e. during boot or
    immediately after boot.  But once the system has been running
    for a while, the urandom output (in a real PC) is pretty good.

 2) /dev/random is somewhat more bogus because it blocks when it's
    capable of producing good (computationally secure) output.  Note
    that /dev/random's entropy estimates could in fact be wrong--for
    example, in a virtual PC (say a simulated one), there might be ZERO
    actual entropy (restarting the simulation might get you back to 
    exactly the same state).

 3) /dev/random and /dev/urandom are in principle BOTH vulnerable to
    cryptographic attacks.  Some people incorrectly assume that this
    doesn't apply to /dev/random.  So from an information-theoretic
    point of view, /dev/random and /dev/urandom are both bogus, but
    from a computational security point of view, the algorithms look
    ok (as best as we can tell), and if anything is irreparably wrong
    with them then all of cryptography is in trouble, so really neither
    is bogus in this regard.

 4) The man page is fairly seriously bogus because it doesn't explain
    the real situation with either /dev/urandom or /dev/random.

 5) Really, these modern algorithms are much harder to attack than
    fans of pre-computer (e.g. WW2-era) cryptography seem to imagine.
    These algorithms are very fast on modern computers but wouldn't
    have been practical without computers.  Computers have extended
    the capabilities of both attackers and defenders, but it looks
    like they've given a lot more to the defenders (see Kahn "The
    Codebreakers" 2nd ed. note at the end: it says something like
    "in the long-running battle between cryptographers and cryptanalysts,
    it looks like the cryptographers have won).

 6) For a sort of philosophical note on how cryptography fits in with
    the P vs. NP problem, this article is pretty cool:

       http://www.cs.ucsd.edu/users/russell/average.ps

    A brief summary is at:

       http://weblog.fortnow.com/2004/06/impagliazzos-five-worlds.html

> For this specific application, it probably doesn't matter -- using /dev/
> urandom is surely overkill, and on a single-user Linux desktop you're 
> unlikely to have vast numbers of applications reading /dev/urandom 
> without your knowledge. But why not use /dev/random? What's the downside?

/dev/random is in fact quite slow, like a few bits per second, not
practical for dealing zillions of poker hands or whatever this
application was.  Once you exhaust the stored pool (a few hundred
bytes or whatever) it takes quite a long time to get any more output.
Also, it doesn't give the theoretical guarantees that some people
imagine that it does.  Aside from that, it's indistinguishable from
/dev/urandom once /dev/urandom has enough entropy to get started.

The right way for /dev/*random to work is block after system boot
until there is some reasonable fixed amount of entropy (say 256 bits)
gathered, then produce unlimited output without ever blocking again.
Some ongoing folding of new entropy into the entropy pool (see the
Fortuna algorithm in Schneier and Ferguson) is a good practical
precaution, but really high security applications should consider the
whole PC memory and CPU buses to be an insecure broadcast medium
anyway, so such applications generally do their most sensitive
operations in dedicated secure hardware.

Sorry to go on for so long but it connects to a never-ending source of
bogosity on sci.crypt (the other newsgroup I mostly hang out in),
where several times a year someone shows up selling some bogus
one-time-pad product based on supposed information-theoretic security,
they're misguided 100% of the time and get shot down, but it always
takes a while.  My guess is that one of them is responsible for the
ongoing sci.crypt sporge attack.