question: why isn't a byte of a hash more uniform? how could I improve my code to cure that?
Tim Chase
python.list at tim.thechases.com
Fri Aug 7 13:19:47 EDT 2009
> After I have written a short Python script that hashes my textfile line by
> line and collects the numbers next to the original, I checked what I got.
> Instead of getting around 25% in each treatment, the range is 17.8%-31.3%.
That sounds suspiciously like 25% with a +/- 7% fluctuation one
might expect to see from non-random source data.
Remember that your outputs are driven purely by your inputs in a
deterministic fashion -- if your inputs are purely random, then
your outputs should more closely match your expected bin'ing. If
your inputs aren't random, you get a taste of your own medicine
("my file has just the number 42 on every line...why isn't my
output random?"). And randomness-of-hash-output is a red herring
since hashing is *not* random.
Your input is also finite -- an aspect which leaves you a far cry
from the full hash-space. If an md5 has 32 bytes (256 bits) of
data, your input would have to cover 2**256 possible inputs to
see the full profile of your outputs. That's a lot of input :)
-tkc
More information about the Python-list
mailing list