[Python-ideas] incremental hashing in __hash__

M.-A. Lemburg mal at egenix.com
Thu Jan 5 04:29:33 EST 2017


On 28.12.2016 04:13, jab at math.brown.edu wrote:
> Suppose you have implemented an immutable Position type to represent
> the state of a game played on an MxN board, where the board size can
> grow quite large.
> ...
> 
> According to https://docs.python.org/3/reference/datamodel.html#object.__hash__
> :
> 
> 
> """
> it is advised to mix together the hash values of the components of the
> object that also play a part in comparison of objects by packing them
> into a tuple and hashing the tuple. Example:
> 
> def __hash__(self):
>     return hash((self.name, self.nick, self.color))
> 
> """
> 
> 
> Applying this advice to the use cases above would require creating an
> arbitrarily large tuple in memory before passing it to hash(), which
> is then just thrown away. It would be preferable if there were a way
> to pass multiple values to hash() in a streaming fashion, such that
> the overall hash were computed incrementally, without building up a
> large object in memory first.

I think there's a misunderstanding here: the hash(obj) built-in
merely interfaces to the obj.__hash__() method (or the tp_hash slot
for C types) and returns whatever these methods give.

It doesn't implement any logic by itself.

If you would like to implement a more efficient hash algorithm
for your types, just go ahead and write them as .__hash__()
method or tp_hash slot method and you're done.

The example from the docs is just to showcase an example of
how such a hash function should work, i.e. to mix in all
relevant data attributes.

In your case, you'd probably use a simple for loop to calculate
the hash without creating tuples or any other temporary
structures.

Here's the hash implementation tuples use as an example

/* The addend 82520, was selected from the range(0, 1000000) for
   generating the greatest number of prime multipliers for tuples
   upto length eight:

     1082527, 1165049, 1082531, 1165057, 1247581, 1330103, 1082533,
     1330111, 1412633, 1165069, 1247599, 1495177, 1577699

   Tests have shown that it's not worth to cache the hash value, see
   issue #9685.
*/

static Py_hash_t
tuplehash(PyTupleObject *v)
{
    Py_uhash_t x;  /* Unsigned for defined overflow behavior. */
    Py_hash_t y;
    Py_ssize_t len = Py_SIZE(v);
    PyObject **p;
    Py_uhash_t mult = _PyHASH_MULTIPLIER;
    x = 0x345678UL;
    p = v->ob_item;
    while (--len >= 0) {
        y = PyObject_Hash(*p++);
        if (y == -1)
            return -1;
        x = (x ^ y) * mult;
        /* the cast might truncate len; that doesn't change hash
stability */
        mult += (Py_hash_t)(82520UL + len + len);
    }
    x += 97531UL;
    if (x == (Py_uhash_t)-1)
        x = -2;
    return x;
}

As you can see, there's some magic going on there to make
sure that the hash values behave well when used as "keys"
for the dictionary implementation (which is their main
purpose in Python).

You are free to create your own hash implementation.
The only characteristic to pay attention to is to have
objects which compare equal give the same hash value.
This is needed to be able to map such objects to the same
dictionary slots.

There should be no need to have a special hash function which
works on iterables. As long as those iterable objects define
their own .__hash__() method or tp_slot, the hash() built-in
(and Python's dict implementation) will use these and, if needed,
those methods can then use an approach to build hash values
using iterators on the object's internal data along similar
lines as the above tuple implementation.

-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Experts (#1, Jan 05 2017)
>>> Python Projects, Coaching and Consulting ...  http://www.egenix.com/
>>> Python Database Interfaces ...           http://products.egenix.com/
>>> Plone/Zope Database Interfaces ...           http://zope.egenix.com/
________________________________________________________________________

::: We implement business ideas - efficiently in both time and costs :::

   eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
    D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
           Registered at Amtsgericht Duesseldorf: HRB 46611
               http://www.egenix.com/company/contact/
                      http://www.malemburg.com/



More information about the Python-ideas mailing list