SHA-based subclass for random module

Sat Mar 20 20:45:29 EST 2004

python at rcn.com (Raymond Hettinger) writes:
> > Why md5 instead of sha?
> 
> You could go either way.  Make the decision based on how many bits you
> need (128 should be plenty) for generating a single 53-bit float and
> then take into account speed considerations (MD5 vs SHA-1) and the
> time to generate the bits going into the message digest (it takes
> longer to generate 160 bits than 128).

I think it's best to stick with SHA, because of its longer hash,
because it's a US federal standard, and because MD5 is already known
to not meet its security goals (see Dobbertin's paper on finding
collisions in MD5's compression function).  Python has already screwed
up choosing its PRNG twice out of speed considerations, first with
WHrandom, and when that wasn't random enough, then with MT, which
still isn't random enough.  If we're selecting a PRNG for a third
time, let's stop cutting corners and take care of the problem once and
for all.  Even the very non-optimized code that I posted runs at
reasonable speed.

> > What's Random.random(self) supposed to do?
> 
> It's called subclassing ;-)

OK, I misunderstood at first what was going on.

> > But doesn't support all the operations specified in the Random API.
> > What's up with that?
> 
> Sure it does.  That is what the subclassing is for.  Again be sure to
> incorporate the two submitted revisions resulting in:

Yes, I understand now.  You're simply taking MT output and hashing it.

> > I'm missing something here--if you're just trying to avoid possible
> > cycles, why do you want to use MT instead of a sequential counter?
> 
> Either my version or your original requires something that generates a
> sequence of randoms from a seed (mine does it with MT and yours with
> s0=sha1(s0)).  The advantages of MT are proven properties, and it can
> more efficiently generate an arbitrary number of bits (thanks to
> genrandbits()) without the messy pure python code for extracting a
> byte at a time.  It was also nice to not have to override jumpahead(),
> getstate(), setstate(), etc.

But MT has no security properties at all.  It's equivalent to just
generating the whole output stream by hashing a sequential counter
with a secret seed.

> If you use sha1 for the base generator, you gain cryptographic
> strength to the left which is useless for your application (lottery
> numbers are selected in a way that is strong to the left and right,
> but no one cares about inferring last weeks numbers from this weeks,

No, I wanted cryptographic strength to both the left and right,
otherwise I would have just hashed a counter.  I do care about
inferring last week's numbers from this week's.  If you're playing
poker with someone, you not only don't want him to see your cards
unless he bets against you, you also don't want him to see the cards
you were holding in the last hand when he folded.  

See Bruce Schneier's paper on Yarrow for desirable PRNG criteria:

   http://www.schneier.com/paper-yarrow.html

> > Applying a collision resistant hash function to itself is a standard
> > construction for making a PRF and is the basis of HMAC.
> 
> Re-reading your sentence carefully, does it say that SHA-1 is
> collision resistant.  The design goal for SHA-1, of necessity,
> produces collisions for messages over 160 bits in length and AFAICT
> there are no citations showing collision resistance at 160 bits.

Collision resistance means there's no practical way to FIND
collisions.  Obviously it doesn't mean that no collision exists.  It's
just like in cryptography, you don't want the adversary to be able to
find the secret key.  That doesn't mean there is no key.

> > My guess is the reason you haven't found any publications specific to
> > SHA1 with this construction is that nobody has found any results
> > against SHA1.  As far as anyone can tell, the output is
> > undistinguishable from random.
> 
> I wish you would abandond this line of reasoning.  In security
> applications, proofs rule and the absence of published flaws means
> nothing (consider the case of Engima where the designers relied on
> "nobody has found any results against" their cipher).

In that case, where's your proof that MT feeding SHA is secure?  I
haven't even seen any assertion made that MT even preserves all the
entropy in the seed that you supply.  The only thing I've heard about
MT is that it passes some statistical tests and is guaranteed to have
a long period because of its large internal state.  For all I know, it
takes the seed the you give it and crunches it down to 32 bits before
initializing that state.  Maybe this question can be answered with a
detailed analysis of MT, but it's simpler if we can avoid the need for
such analysis.

> In this particular case, the finding a short cycle would not really be
> a major flaw; the function still meets its design goal of producing a
> message digest for a given message such that it is difficult to
> produce another message with the same digest.  A few known 160-bit
> collisions would not be an issue except for those exact 160 bit
> messages.  IOW, even if some collisions are known, don't expect to
> find a paper on it.

What do you mean by that?  Finding a collision in SHA would be a major
cryptanalysis result, like the result against MD5.  Even a
distinguisher (i.e. given N gigabytes of output, a way to tell, with
probability greater than 50%, whether it's SHA output or true random
output) would be a publishable result (there are now several such
results published against RC4 and so RC4 is now deprecated for new
applications).

Or do you mean that if a collision were found by a spook agency, then
we wouldn't hear about it?  That will be true of any function we come
up with.  At least in SHA's case, if the NSA finds some problem,
they'll probably advise us to stop using SHA (without telling us the
reason), like they did with SHA-0.  They sure won't tell us that for
some home-cooked function.

Anyway, the current version of the double SHA generator (which
incorporates a sequential counter into the input of the first hash) is
guaranteed to have no short cycles, because of the sequential counter.

Note that the MT-MD5 generator has a distinguisher after about 2**64
calls, because the MD5 call can see birthday collisions on both its
input and its output rather than only on its output.  So it will
repeat outputs slightly more often than expected.

> All of that is moot, MT is fast, subclasses nicely, is threadsafe, has
> proven distribution properties, and has a known (large) period.  Why
> not digest it and produce your shielded random numbers?  Or is is just
> a matter of being glued to your originally published code (which would
> be correct if collision resistance is known)?

I have no attachment to that code, which was provided as a sample
implementation--a real version should be much better optimized, and
maybe even written in C.  But I have to ask you the same question: are
you glued to MT for some reason?  I'm in favor of staying independent
of MT, for these reasons:

 1) I'd rather use primitives that were designed from the beginning as
    security primitives; and also to use as few primitives as
    possible, so that there's fewer parts to go wrong.

 2) The double SHA construction is surely easier to analyze for
    security, at least if we assume SHA meets its design criteria.  MT
    doesn't have any such criteria for us to assume.

 3) The PRNG's goal is to generate a reproducable sequence, and we
    should take that to mean that if someone wants to port an
    application to some other system, they should be able to generate
    the same sequence.  It's one thing to depend on SHA, which is a
    standard and available in lots of environments, but if MT is
    needed then its implementation needs to be ported, and its code is
    a mess.  I don't know if even Jython supports MT, much less
    whether any non-Python systems support it.  It's best to keep the
    generator simple and free-standing, that maybe can even be used as
    a standard.

Basically making the PRNG depend on a hairy and obscure function like
MT for no reason other than a little implementation convenience
strikes me as "worse as better" in a rather pure form.  If we decide
we don't care about "leftward" security, it's simplest to just feed
SHA with a sequential counter.