Re: [Patches] fix float_hash and complex_hash for 64-bit *nix

On Wed, May 10, 2000 at 12:53:16AM -0400, Tim Peters wrote:
[Trent Mick]
Discussion:
Okay, it is debatable to call float_hash and complex_hash broken, but their code presumed that sizeof(long) was 32-bits. As a result the hashed values for floats and complex values were not the same on a 64-bit *nix system as on a 32-bit *nix system. With this patch they are.
The goal is laudable but the analysis seems flawed. For example, this new comment:
Firstly, I should have admitted my ignorance with regards to hash functions.
Looks to me like the real problem in the original was here:
x = hipart + (long)fractpart + (long)intpart + (expo << 15); ^^^^^^^^^^^^^
The difficulty is that intpart may *not* fit in 32 bits, so the cast of intpart to long is ill-defined when sizeof(long) == 4.
That is, the hash function truly is broken for "large" values with a fractional part, and I expect your after-patch code suffers the same problem:
Yes it did.
The solution to this is to break intpart in this branch into pieces no larger than 32 bits too
Okay here is another try (only for floatobject.c) for discussion. If it looks good then I will submit a patch for float and complex objects. So do the same for 'intpart' as was done for 'fractpart'. static long float_hash(v) PyFloatObject *v; { double intpart, fractpart; long x; fractpart = modf(v->ob_fval, &intpart); if (fractpart == 0.0) { // ... snip ... } else { int expo; long hipart; fractpart = frexp(fractpart, &expo); fractpart = fractpart * 2147483648.0; hipart = (long)fractpart; fractpart = (fractpart - (double)hipart) * 2147483648.0; x = hipart + (long)fractpart + (expo << 15); /* combine the fract parts */ intpart = frexp(intpart, &expo); intpart = intpart * 2147483648.0; hipart = (long)intpart; intpart = (intpart - (double)hipart) * 2147483648.0; x += hipart + (long)intpart + (expo << 15); /* add in the int parts */ } if (x == -1) x = -2; return x; }
Note this consequence under the Win32 Python:
With this change, on Linux32:
base = 2.**40 + 0.5 base 1099511627776.5 for i in range(32, 45): ... x = base + 2.**i ... print x, hash(x) ... 1.10380659507e+12 -2141945856 1.10810156237e+12 -2137751552 1.11669149696e+12 -2129362944 1.13387136614e+12 -2112585728 1.16823110451e+12 -2079031296 1.23695058125e+12 -2011922432 1.37438953472e+12 -1877704704 1.64926744166e+12 -1609269248 2.19902325555e+12 -2146107392 3.29853488333e+12 -1609236480 5.49755813888e+12 -1877639168 9.89560464998e+12 -2011824128 1.86916976722e+13 -2078900224
On Linux64:
base = 2.**40 + 0.5 base 1099511627776.5 for i in range(32, 45): ... x = base + 2.**i ... print x, hash(x) ... 1.10380659507e+12 2153021440 1.10810156237e+12 2157215744 1.11669149696e+12 2165604352 1.13387136614e+12 2182381568 1.16823110451e+12 2215936000 1.23695058125e+12 2283044864 1.37438953472e+12 2417262592 1.64926744166e+12 2685698048 2.19902325555e+12 2148859904 3.29853488333e+12 2685730816 5.49755813888e+12 2417328128 9.89560464998e+12 2283143168 1.86916976722e+13 2216067072
-- and that should also fix your 64-bit woes "by magic".
As you can see it did not, but for another reason. The summation of the parts overflows 'x'. Is this a problem? I.e., does it matter if a hash function returns an overflowed integral value (my hash function ignorance is showing)? And if this does not matter, does it matter that a hash returns different values on different platforms?
a hash function should never ignore any bit in its input.
Which brings up a question regarding instance_hash(), func_hash(), meth_hash(), HKEY_hash() [or whatever it is called], and other which cast a pointer to a long (discarding the upperhalf of the pointer on Win64). Do these really need to be fixed. Am I nitpicking too much on this whole thing? Thanks, Trent -- Trent Mick trentm@activestate.com

[Trent Mick]
... Okay here is another try (only for floatobject.c) for discussion. If it looks good then I will submit a patch for float and complex objects. So do the same for 'intpart' as was done for 'fractpart'.
static long float_hash(v) PyFloatObject *v; { double intpart, fractpart; long x;
fractpart = modf(v->ob_fval, &intpart);
if (fractpart == 0.0) { // ... snip ... } else { int expo; long hipart;
fractpart = frexp(fractpart, &expo); fractpart = fractpart * 2147483648.0;
It's OK to use "*=" in C <wink>. Would like a comment that this is 2**31 (which makes the code obvious <wink> instead of mysterious). A comment block at the top would help too, like /* Use frexp to get at the bits in intpart and fractpart. * Since the VAX D double format has 56 mantissa bits, which is the * most of any double format in use, each of these parts may have as * many as (but no more than) 56 significant bits. * So, assuming sizeof(long) >= 4, each part can be broken into two longs; * frexp and multiplication are used to do that. * Also, since the Cray double format has 15 exponent bits, which is the * most of any double format in use, shifting the exponent field left by * 15 won't overflow a long (again assuming sizeof(long) >= 4). */ And this code has gotten messy enough that it's probably better to pkg it in a utility function rather than duplicate it. Another approach would be to play with the bits directly, via casting tricks. But then you have to wrestle with platform crap like endianness.
hipart = (long)fractpart; fractpart = (fractpart - (double)hipart) * 2147483648.0;
x = hipart + (long)fractpart + (expo << 15); /* combine the fract parts */
intpart = frexp(intpart, &expo); intpart = intpart * 2147483648.0; hipart = (long)intpart; intpart = (intpart - (double)hipart) * 2147483648.0;
x += hipart + (long)intpart + (expo << 15); /* add in the int parts */
There's no point adding in (expo << 15) a second time.
With this change, on Linux32: ...
base = 2.**40 + 0.5 base 1099511627776.5 for i in range(32, 45): ... x = base + 2.**i ... print x, hash(x) ... 1.10380659507e+12 -2141945856 1.10810156237e+12 -2137751552 1.11669149696e+12 -2129362944 1.13387136614e+12 -2112585728 1.16823110451e+12 -2079031296 1.23695058125e+12 -2011922432 1.37438953472e+12 -1877704704 1.64926744166e+12 -1609269248 2.19902325555e+12 -2146107392 3.29853488333e+12 -1609236480 5.49755813888e+12 -1877639168 9.89560464998e+12 -2011824128 1.86916976722e+13 -2078900224
On Linux64:
base = 2.**40 + 0.5 base 1099511627776.5 for i in range(32, 45): ... x = base + 2.**i ... print x, hash(x) ... 1.10380659507e+12 2153021440 1.10810156237e+12 2157215744 1.11669149696e+12 2165604352 1.13387136614e+12 2182381568 1.16823110451e+12 2215936000 1.23695058125e+12 2283044864 1.37438953472e+12 2417262592 1.64926744166e+12 2685698048 2.19902325555e+12 2148859904 3.29853488333e+12 2685730816 5.49755813888e+12 2417328128 9.89560464998e+12 2283143168 1.86916976722e+13 2216067072
-- and that should also fix your 64-bit woes "by magic".
As you can see it did not, but for another reason.
I read your original complaint as that hash(double) yielded different results between two *64* bit platforms (Linux64 vs Win64), but what you showed above appears to be a comparison between a 64-bit platform and a 32-bit platform, and where presumably sizeof(long) is 8 on the former but 4 on the latter. If so, of *course* results may be different: hash returns a C long, and they're different sizes across these platforms. In any case, the results above aren't really different!
hex(-2141945856) # 1st result from Linux32 '0x80548000' hex(2153021440L) # 1st result from Linux64 '0x80548000L'
That is, the bits are the same. How much more do you want from me <wink>?
The summation of the parts overflows 'x'. Is this a problem? I.e., does it matter if a hash function returns an overflowed integral value (my hash function ignorance is showing)?
Overflow generally doesn't matter. In fact, it's usual <wink>; e.g., the hash for strings iterates over x = (1000003*x) ^ *p++; and overflows madly. The saving grace is that C defines integer overflow in such a way that losing the high bits on every operation yields the same result as if the entire result were computed to infinite precision and the high bits tossed only at the end. So overflow doesn't hurt this from being as reproducible as possible, given that Python's int size is different. Overflow can be avoided by using xor instead of addition, but addition is generally preferred because it helps to "scramble" the bits a little more.
And if this does not matter, does it matter that a hash returns different values on different platforms?
No, and it doesn't always stay the same from release to release on a single platform. For example, your patch above will change hash(double) on Win32!
a hash function should never ignore any bit in its input.
Which brings up a question regarding instance_hash(), func_hash(), meth_hash(), HKEY_hash() [or whatever it is called], and other which cast a pointer to a long (discarding the upperhalf of the pointer on Win64). Do these really need to be fixed. Am I nitpicking too much on this whole thing?
I have to apologize (although only semi-sincerely) for not being meaner about this when I did the first 64-bit port. I did that for my own use, and avoided the problem areas rather than fix them. But unless a language dies, you end up paying for every hole in the end, and the sooner they're plugged the less it costs. That is, no, you're not nitpicking too much! Everyone else probably thinks you are <wink>, *but*, they're not running on 64-bit platforms yet so these issues are still invisible to their gut radar. I'll bet your life that every hole remaining will trip up an end user eventually -- and they're the ones least able to deal with the "mysterious problems".

I have to admit I have no clue about the details of this debate any more, and I'm cowardly awaiting a patch submission that Tim approves of. (I'm hoping a day will come when Tim can check it in himself. :-) In the mean time, I'd like to emphasize the key invariant here: we must ensure that (a==b) => (hash(a)==hash(b)). One quick way to deal with this could be the following pseudo C: PyObject *double_hash(double x) { long l = (long)x; if ((double)l == x) return long_hash(l); ...double-specific code... } This code makes one assumption: that if there exists a long l equal to a double x, the cast (long)x should yield l... --Guido van Rossum (home page: http://www.python.org/~guido/)

[Guido]
I have to admit I have no clue about the details of this debate any more,
Na, there's no debate here. I believe I confused things by misunderstanding what Trent's original claim was (sorry, Trent!), but we bumped into real flaws in the current hash anyway (even on 32-bit machines). I don't think there's any actual disagreement about anything here.
and I'm cowardly awaiting a patch submission that Tim approves of.
As am I <wink>.
(I'm hoping a day will come when Tim can check it in himself. :-)
Well, all you have to do to make that happen is get a real job and then hire me <wink>.
In the mean time, I'd like to emphasize the key invariant here: we must ensure that (a==b) => (hash(a)==hash(b)).
Absolutely. That's already true, and is so non-controversial that Trent elided ("...") the code for that in his last post.
One quick way to deal with this could be the following pseudo C:
PyObject *double_hash(double x) { long l = (long)x; if ((double)l == x) return long_hash(l); ...double-specific code... }
This code makes one assumption: that if there exists a long l equal to a double x, the cast (long)x should yield l...
No, that fails on two counts: 1. If x is "too big" to fit in a long (and a great many doubles are), the cast to long is undefined. Don't know about all current platforms, but on the KSR platform such casts raised a fatal hardware exception. The current code already accomplishes this part in a safe way (which Trent's patch improves by using a symbol instead of the current hard-coded hex constant). 2. The key invariant needs to be preserved also when x is an exact integral value that happens to be (possibly very!) much bigger than a C long; e.g.,
long(1.23e300) # 1.23e300 is an integer! albeit not the one you think 12299999999999999456195024356787918820614965027709909500456844293279 60298864608335541984218516600989160291306221939122973741400364055485 57167627474369519296563706976894811817595986395177079943535811102573 51951343133141138298152217970719263233891682157645730823560232757272 73837119288529943287157489664L hash(1.23e300) == hash(_) 1
The current code already handles that correctly too. All the problems occur when the double has a non-zero fractional part, and Trent knows how to fix that now. hash(x) may differ across platforms because sizeof(long) differs across platforms, but that's just as true of strings as floats (i.e., Python has never computed platform-independent hashes -- if that bothers *you* (doesn't bother me), that's the part you should chime in on).
participants (3)
-
Guido van Rossum
-
Tim Peters
-
Trent Mick