question: why isn't a byte of a hash more uniform? how could I improve my code to cure that?

Fri Aug 7 16:00:59 EDT 2009

L�������������������������������� wrote:
> Hi all,
> I am a Python novice, and right now I would be happy to simply get my job
> done with it, but I could appreciate some thoughts on the issue below.
>
> I need to assign one of four numbers to names in a list. The assignment
> should be pseudo-random: no pattern whatsoever, but deterministic,
> reproducible, and close to uniform. My understanding was that hash functions
> would do the job. As I only needed 2 bits of treatment, I picked a byte of
> the hashes generated, and even taken mod 4 of it. See the code below.
>
> After I have written a short Python script that hashes my textfile line by
> line and collects the numbers next to the original, I checked what I got.
> Instead of getting around 25% in each treatment, the range is 17.8%-31.3%. I
> understand that the pseudo-randomness means that the treatments should not
> be neat and symmetric. Still, this variation is unacceptable for my purpose.
> My understanding was that good hash functions generate numbers that look
> completely random, and picking only a byte should not change that. I thought
> the promise was also to get close to uniformity:
> http://en.wikipedia.org/wiki/Hash_function#Uniformity. I tried all the
> hashes in the hashlib module, and picked bytes from the beginning and the
> end of the hashes, but treatments never were close to uniform (curiously,
> always the last treatment seems to be too rare).
>
> Maybe it is an obvious CS puzzle, I'm looking forward to standing corrected.
>
> Thanks!
>
> Laszlo
>
> The script:
>
> #! /usr/bin/python
>
> f = open('names.txt', 'r')
> g = open('nameshashed.txt', 'a')
>
> import hashlib
>
> for line in f:
>     line = line.rstrip()
>     h = str(hashlib.sha512(line).hexdigest())
>     s = line + ',' + str(ord(h[64])%4) + '\n'
>     g.write(s),
>
>   
The problem is indeed simple, but really nothing to do with Python.  
You're using a single character from a hexdigest, so the only 
possibilities are 0,1,2,...9,A,B, C, D,E, and F

Taking the ords of each of those, and moduloing with 4, you get
     4 - 0
     5 - 1
     4 - 2
     3 - 3

In other words, there's a greater than expected likelihood of 1, and 
less than expected likelihood of 3.  I'd figure the distribution ought 
to be:
     25%,  31%, 25%, and 19%

I'd use digest() instead of hexdigest(), and of course reduce the 
subscript  to 63 or less.

DaveA