On Thu, Jan 5, 2017 at 4:00 AM Matt Gilson <matt@getpattern.com> wrote:
But, I think that the problem with adding `__hash__` to collections.abc.Iterable is that not all iterables are immutable -- And if they aren't immutable, then allowing them to be hashed is likely to be a pretty bad idea...

Good point.  A better option is to add collections.abc.ImmutableIterable that derives from Iterable and provides __hash__.  Since tuple inherits from it, it can choose to delegate up.  Then I think everyone is happy. 

I'm still having a hard time being convinced that this is very much of an optimization at all ...

If you start hashing tuples that are large enough that memory is a concern, then that's going to also take a really long time and probably be prohibitive anyway.  Just for kicks, I decided to throw together a simple script to time how much penalty you pay for hashing a tuple:

class F(object):
    def __init__(self, arg):
        self.arg = arg

    def __hash__(self):
        return hash(tuple(self.arg))


class T(object):
    def __init__(self, arg):
        self.arg = tuple(arg)

    def __hash__(self):
        return hash(self.arg)


class C(object):
    def __init__(self, arg):
        self.arg = tuple(arg)
        self._hash = None

    def __hash__(self):
        if self._hash is None:
            self._hash = hash(tuple(self.arg))
        return self._hash

import timeit

print(timeit.timeit('hash(f)', 'from __main__ import F; f = F(list(range(500)))'))
print(timeit.timeit('hash(t)', 'from __main__ import T; t = T(list(range(500)))'))
print(timeit.timeit('hash(c)', 'from __main__ import C; c = C(list(range(500)))'))

results = []
for i in range(1, 11):
    n = i * 100
    t1 = timeit.timeit('hash(f)', 'from __main__ import F; f = F(list(range(%d)))' % i)
    t2 = timeit.timeit('hash(t)', 'from __main__ import T; t = T(list(range(%d)))' % i)
    results.append(t1/t2)
print(results)


F is going to create a new tuple each time and then hash it.  T already has a tuple, so we'll only pay the cost of hashing a tuple, not the cost of constructing a tuple and C caches the hash value and re-uses it once it is known.  C is the winner by a factor of 10 or more (no surprise there).  But the real interesting thing is that the the ratio of the timing results from hashing `F` vs. `T` is relatively constant in the range of my test (up to 1000 elements) and that ratio's value is approximately 1.3.  For most applications, that seems reasonable.  If you really need a speed-up, then I suppose you could recode the thing in Cython and see what happens, but I doubt that will be frequently necessary.  If you _do_ code it up in Cython, put it up on Pypi and see if people use it...


On Wed, Jan 4, 2017 at 5:04 PM, Neil Girdhar <mistersheik@gmail.com> wrote:
Couldn't you add __hash__ to collections.abc.Iterable ?  Essentially, expose __hash__ there; then all iterables automatically have a default hash that hashes their ordered contents.

On Wednesday, January 4, 2017 at 7:37:26 PM UTC-5, Steven D'Aprano wrote:
On Wed, Jan 04, 2017 at 04:38:05PM -0500, j...@math.brown.edu wrote:
> Instead of the proposals like "hash.from_iterable()", would it make sense
> to allow tuple.__hash__() to accept any iterable, when called as a
> classmethod?

The public API for calculating the hash of something is to call the
hash() builtin function on some object, e.g. to call tuple.__hash__ you
write hash((a, b, c)). The __hash__ dunder method is implementation, not
interface, and normally shouldn't be called directly.

Unless I'm missing something obvious, your proposal would require the
caller to call the dunder methods directly:

class X:
    def __hash__(self):
        return tuple.__hash__(iter(self))

I consider that a poor interface design.

But even if we decide to make an exception in this case, tuple.__hash__
is currently an ordinary instance method right now. There's probably
code that relies on that fact and expects that:

    tuple.__hash__((a, b, c))

is currently the same as

    (a, b, c).__hash__()


(Starting with the hash() builtin itself, I expect, although that is
easy enough to fix if needed.) Your proposal will break backwards
compatibility, as it requires a change in semantics:

(1) (a, b, c).__hash__() must keep the current behaviour, which
means behaving like a bound instance method;

(2) But tuple.__hash__ will no longer return an unbound method (actually
a function object, but the difference is unimportant) and instead will
return something that behaves like a bound class method.

Here's an implementation which does this:

http://code.activestate.com/recipes/577030-dualmethod-descriptor/

so such a thing is possible. But it breaks backwards-compatability and
introduces something which I consider to be an unclean API (calling a
dunder method directly). Unless there's a *really* strong advantage to

    tuple.__hash__(...)

over

    hash.from_iterable(...)

(or equivalent), I would be against this change.



> (And similarly with frozenset.__hash__(), so that the fast C
> implementation of that algorithm could be used, rather than the slow
> collections.Set._hash() implementation. Then the duplicated implementation
> in _collections_abc.py's Set._hash() could be removed completely,
> delegating to frozenset.__hash__() instead.)

This is a good point. Until now, I've been assuming that
hash.from_iterable should consider order. But frozenset shows us that
sometimes the hash should *not* consider order.

This hints that perhaps the hash.from_iterable() should have its own
optional dunder method. Or maybe we need two functions: an ordered
version and an unordered version.

Hmmm... just tossing out a wild idea here... let's get rid of the dunder
method part of your suggestion, and add new public class methods to
tuple and frozenset:

    tuple.hash_from_iter(iterable)
    frozenset.hash_from_iter(iterable)


That gets rid of all the objections about backwards compatibility, since
these are new methods. They're not dunder names, so there are no
objections to being used as part of the public API.

A possible objection is the question, is this functionality *actually*
important enough to bother?

Another possible objection: are these methods part of the sequence/set
API? If not, do they really belong on the tuple/frozenset? Maybe they
belong elsewhere?



> Would this API more cleanly communicate the algorithm being used and the
> implementation,

No. If you want to communicate the algorithm being used, write some
documentation.

Seriously, the public API doesn't communicate the algorithm used for the
implementation. How can it? We can keep the same interface and change
the implementation, or change the interface and keep the implementation.
The two are (mostly) independent.



> while making a smaller increase in API surface area
> compared to introducing a new function?

It's difficult to quantify "API surface area". On the one hand, we have
the addition of one or two new functions or methods. Contrast with:

    * introducing a new kind of method into the built-ins (one which
      behaves like a classmethod when called from the class, and like
      an instance method when called from an instance);

    * changing tuple.__hash__ from an ordinary method to one of the
      above special methods;

    * and likewise for frozenset.__hash__;

    * change __hash__ from "only used as implementation, not as
      interface" to "sometimes used as interface".


To me, adding one or two new methods/functions is the smaller, or at
least less disruptive, change.



--
Steve
_______________________________________________
Python-ideas mailing list
Python...@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/

_______________________________________________
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/



--

pattern-sig.png

Matt Gilson // SOFTWARE ENGINEER

E: matt@getpattern.com // P: 603.892.7736


We’re looking for beta testers.  Go here to sign up!