[Python-Dev] Re: [Patches] fix float_hash and complex_hash for 64-bit *nix

Thu, 11 May 2000 00:13:29 -0400

[Trent Mick]
> ...
> Okay here is another try (only for floatobject.c) for discussion.
> If it looks good then I will submit a patch for float and complex
> objects. So do the same for 'intpart' as was done for 'fractpart'.
>
>
> static long
> float_hash(v)
>     PyFloatObject *v;
> {
>     double intpart, fractpart;
>     long x;
>
>     fractpart = modf(v->ob_fval, &intpart);
>
>     if (fractpart == 0.0) {
> 		// ... snip ...
>     }
>     else {
>         int expo;
>         long hipart;
>
>         fractpart = frexp(fractpart, &expo);
>         fractpart = fractpart * 2147483648.0;

It's OK to use "*=" in C <wink>.

Would like a comment that this is 2**31 (which makes the code obvious <wink>
instead of mysterious).  A comment block at the top would help too, like

/* Use frexp to get at the bits in intpart and fractpart.
 * Since the VAX D double format has 56 mantissa bits, which is the
 * most of any double format in use, each of these parts may have as
 * many as (but no more than) 56 significant bits.
 * So, assuming sizeof(long) >= 4, each part can be broken into two longs;
 * frexp and multiplication are used to do that.
 * Also, since the Cray double format has 15 exponent bits, which is the
 * most of any double format in use, shifting the exponent field left by
 * 15 won't overflow a long (again assuming sizeof(long) >= 4).
 */

And this code has gotten messy enough that it's probably better to pkg it in
a utility function rather than duplicate it.

Another approach would be to play with the bits directly, via casting
tricks.  But then you have to wrestle with platform crap like endianness.

>         hipart = (long)fractpart;
>         fractpart = (fractpart - (double)hipart) * 2147483648.0;
>
>         x = hipart + (long)fractpart + (expo << 15); /* combine
> the fract parts */
>
>         intpart = frexp(intpart, &expo);
>         intpart = intpart * 2147483648.0;
>         hipart = (long)intpart;
>         intpart = (intpart - (double)hipart) * 2147483648.0;
>
>         x += hipart + (long)intpart + (expo << 15); /* add in the
> int parts */

There's no point adding in (expo << 15) a second time.

> With this change, on Linux32:
> ...
> >>> base = 2.**40 + 0.5
> >>> base
> 1099511627776.5
> >>> for i in range(32, 45):
> ...     x = base + 2.**i
> ...     print x, hash(x)
> ...
> 1.10380659507e+12 -2141945856
> 1.10810156237e+12 -2137751552
> 1.11669149696e+12 -2129362944
> 1.13387136614e+12 -2112585728
> 1.16823110451e+12 -2079031296
> 1.23695058125e+12 -2011922432
> 1.37438953472e+12 -1877704704
> 1.64926744166e+12 -1609269248
> 2.19902325555e+12 -2146107392
> 3.29853488333e+12 -1609236480
> 5.49755813888e+12 -1877639168
> 9.89560464998e+12 -2011824128
> 1.86916976722e+13 -2078900224
>
>
> On Linux64:
>
> >>> base = 2.**40 + 0.5
> >>> base
> 1099511627776.5
> >>> for i in range(32, 45):
> ...     x = base + 2.**i
> ...     print x, hash(x)
> ...
> 1.10380659507e+12 2153021440
> 1.10810156237e+12 2157215744
> 1.11669149696e+12 2165604352
> 1.13387136614e+12 2182381568
> 1.16823110451e+12 2215936000
> 1.23695058125e+12 2283044864
> 1.37438953472e+12 2417262592
> 1.64926744166e+12 2685698048
> 2.19902325555e+12 2148859904
> 3.29853488333e+12 2685730816
> 5.49755813888e+12 2417328128
> 9.89560464998e+12 2283143168
> 1.86916976722e+13 2216067072

>>-- and that should also fix your 64-bit woes "by magic".

> As you can see it did not, but for another reason.

I read your original complaint as that hash(double) yielded different
results between two *64* bit platforms (Linux64 vs Win64), but what you
showed above appears to be a comparison between a 64-bit platform and a
32-bit platform, and where presumably sizeof(long) is 8 on the former but 4
on the latter.  If so, of *course* results may be different:  hash returns a
C long, and they're different sizes across these platforms.

In any case, the results above aren't really different!

>>> hex(-2141945856)  # 1st result from Linux32
'0x80548000'
>>> hex(2153021440L)  # 1st result from Linux64
'0x80548000L'
>>>

That is, the bits are the same.  How much more do you want from me <wink>?

> The summation of the parts overflows 'x'. Is this a problem? I.e., does
> it matter if a hash function returns an overflowed integral value (my
> hash function ignorance is showing)?

Overflow generally doesn't matter.  In fact, it's usual <wink>; e.g., the
hash for strings iterates over

    x = (1000003*x) ^ *p++;

and overflows madly.  The saving grace is that C defines integer overflow in
such a way that losing the high bits on every operation yields the same
result as if the entire result were computed to infinite precision and the
high bits tossed only at the end.  So overflow doesn't hurt this from being
as reproducible as possible, given that Python's int size is different.

Overflow can be avoided by using xor instead of addition, but addition is
generally preferred because it helps to "scramble" the bits a little more.

> And if this does not matter, does it matter that a hash returns different
> values on different platforms?

No, and it doesn't always stay the same from release to release on a single
platform.  For example, your patch above will change hash(double) on Win32!

>> a hash function should never ignore any bit in its input.

> Which brings up a question regarding instance_hash(), func_hash(),
> meth_hash(), HKEY_hash() [or whatever it is called], and other
> which cast a pointer to a long (discarding the upperhalf of the
> pointer on Win64). Do these really need to be fixed. Am I nitpicking
> too much on this whole thing?

I have to apologize (although only semi-sincerely) for not being meaner
about this when I did the first 64-bit port.  I did that for my own use, and
avoided the problem areas rather than fix them.  But unless a language dies,
you end up paying for every hole in the end, and the sooner they're plugged
the less it costs.

That is, no, you're not nitpicking too much!  Everyone else probably thinks
you are <wink>, *but*, they're not running on 64-bit platforms yet so these
issues are still invisible to their gut radar.  I'll bet your life that
every hole remaining will trip up an end user eventually -- and they're the
ones least able to deal with the "mysterious problems".