[Tutor] Hashing [And why hash() NEVER returns -1]

Fri, 10 Aug 2001 22:44:34 -0700 (PDT)

On Fri, 10 Aug 2001, charlie derr wrote:

> +Hi,
> +
> +Does anyone know exactly how the hash method derives a hash 
> +for tuples? Or strings?
> +
> +>>> a, b, c, d = (1,), (2,), (3,), (4,)
> +>>> map (hash, [a, b, c, d])
> +[-1660579480, -1660579477, -1660579478, -1660579475]
> +
> +The above code does seem slightly bewildering to me, anyone got 
> +any ideas?
> 
> no ideas, but i found something strange while playing around
> 
> 
> >>> hash((0,))
> -1660579479
> >>> hash((-1,))
> 1660579479
> >>> hash((-2,))
> 1660579479
> 
> aren't these values supposed to be unique?

Actually, no; hash values are only guaranteed to be "probably" unique
among different objects.  Think "Social Security Number".  *grin*

[note: this message is long, and it involves looking deeply into the
Python/C API.]

What ends up happening is that certain thing "collide" with the same hash
value, and they have to, since there are so many more "things" than
integers.  A good hash function does try to distribute things as evenly as
possible.  If you look in some algorithm book, you'll hear about things
like "linear probing" or "linear chaining", techniques to deal with hash
collisions.

Let's take a look at the string hashing function, stripped down to it's
C'ish essence, just to get a feel for the thing:

/******/
       /* (dyoo) I cut some code out to make it easier to read */ 
static long
string_hash(PyStringObject *a)
{
        register int len;
        register unsigned char *p;
        register long x;
        len = a->ob_size;
        p = (unsigned char *) a->ob_sval;
        x = *p << 7;
        while (--len >= 0)
                x = (1000003*x) ^ *p++;
        x ^= a->ob_size;
        if (x == -1)
                x = -2;
        return x;
}
/******/

It's doing some weird bitwise mess with the strings: it takes each
character of the string and looks at it as a number, spins it around with
some multplication against a prime, twists it with bitwise XORs... It's
definitely meant to mess with your head.  *grin* There's probably some
number theory involved with this function that guarantees that that most
strings will get different values.

Still, we know that Python integers can only have 2**32 distinct
values...

###
>>> hash(2**32L)
1
>>> hash(1)
1
###

Better get back to your question:

> >>> hash((-1,))
> 1660579479
> >>> hash((-2,))
> 1660579479
> 
> aren't these values supposed to be unique?

All Python hash functions apparently try to avoid return -1 as a hash
value.  The string_hash() function above forces the hash id to -2 if the
function even considered return -1, and the same thing happens with
integers too!  Take a look:

###
static long
int_hash(PyIntObject *v)
{
        /* XXX If this is changed, you also need to change the way
           Python's long, float and complex types are hashed. */
        long x = v -> ob_ival;
        if (x == -1)
                x = -2;
        return x;
}
###

But why?  Because '-1' is being used internally by Python as a "sentinel"
value.  If we take a closer look at the dictobject.c file, we'll see that
a lot of Python's internal functions use '-1' to signal that some hideous
error has occured:

/******/
/*** somewhere in the deep, dark corners of dictobject.h... ***/
        if (!PyString_Check(key) ||
            (hash = ((PyStringObject *) key)->ob_shash) == -1)
#endif
        {
                hash = PyObject_Hash(key);
                if (hash == -1) {
                        PyErr_Clear();
                        return NULL;
                }
        }
/******/

Hmmm.... I wonder...

###
>>> class TestNegative1:
...     def __hash__(self): return -1
...
>>> instance = TestNegative1()
>>> hash(instance)
-2
###

Cute.

I get the feeling we've gone way too deeply into the C code.  We should
stop now.  *grin*

Hope this helps!