Am 01.01.2012 06:57, schrieb Paul McMillan:
I agree that doing anything is better than doing nothing. If we use the earlier suggestion and prepend everything with a fixed-length seed, we need quite a bit of entropy (and so a fairly long string) to make a lasting difference.
Your code at https://gist.github.com/0a91e52efa74f61858b5 reads about 2 MB (2**21 - 1) data from urandom. I'm worried that this is going to exhaust the OS's random pool and suck it dry. We shouldn't forget that Python is used for long running processes as well as short scripts. Your suggestion also increases the process size by 2 MB which is going to be an issue for mobile and embedded platforms.
How about this:
r = [ord(i) for i in os.urandom(256)] rs = os.urandom(4) # or 8 ? seed = rs[-1] + (rs[-2] << 8) + (rs[-3] << 16) + (rs[-4] << 24)
def _hash_string(s): """The algorithm behind compute_hash() for a string or a unicode.""" from pypy.rlib.rarithmetic import intmask length = len(s) if length == 0: return -1 x = intmask(seed + (ord(s) << 7)) i = 0 while i < length: o = ord(s[i]) x = intmask((1000003*x) ^ o ^ r[o % 0xff] i += 1 x ^= length return intmask(x)
This combines a random seed for the hash with your suggestion.
We also need to special case short strings. The above routine hands over the seed to attackers, if he is able to retrieve lots of single character hashes. The randomization shouldn't be used if we can prove that it's not possible to create hash collisions for strings shorter than X. For example 64bit FNV-1 has no collisions for 8 chars or less, 32bit FNV has no collisions for 4 or less cars.