Suppose you have implemented an immutable Position type to represent the state of a game played on an MxN board, where the board size can grow quite large. Or suppose you have implemented an immutable, ordered collection type. For example, the collections-extended package provides a frozensetlist[1]. One of my own packages provides a frozen, ordered bidirectional mapping type.[2] These types should be hashable so that they can be inserted into sets and mappings. The order-sensitivity of the contents prevents them from using the built-in collections.Set._hash() helper in their __hash__ implementations, to keep from unnecessarily causing hash collisions for objects that compare unequal due only to having a different ordering of the same set of contained items. According to https://docs.python.org/3/reference/datamodel.html#object.__hash__ : """ it is advised to mix together the hash values of the components of the object that also play a part in comparison of objects by packing them into a tuple and hashing the tuple. Example: def __hash__(self): return hash((self.name, self.nick, self.color)) """ Applying this advice to the use cases above would require creating an arbitrarily large tuple in memory before passing it to hash(), which is then just thrown away. It would be preferable if there were a way to pass multiple values to hash() in a streaming fashion, such that the overall hash were computed incrementally, without building up a large object in memory first. Should there be better support for this use case? Perhaps hash() could support an alternative signature, allowing it to accept a stream of values whose combined hash would be computed incrementally in *constant* space and linear time, e.g. "hash(items=iter(self))". In the meantime, what is the best way to incrementally compute a good hash value for such objects using built-in Python routines? (As a library author, it would be preferable to use a routine with explicit support for computing a hash incrementally, rather than having to worry about how to correctly combine results from multiple calls to hash(contained_item) in library code. (Simply XORing such results together would not be order-sensitive, and so wouldn't work.) Using a routine with explicit support for incremental hashing would allow libraries to focus on doing one thing well.[3,4,5]) I know that hashlib provides algorithms that support incremental hashing, but those use at least 128 bits. Since hash() throws out anything beyond sys.hash_info.hash_bits (e.g. 64) bits, anything in hashlib seems like overkill. Am I right in thinking that's the wrong tool for the job? On the other hand, would binascii.crc32 be suitable, at least for 32-bit systems? (And is there some 64-bit incremental hash algorithm available for 64-bit systems? It seems Python has no support for crc64 built in.) For example: import binascii, struct class FrozenOrderedCollection: def __hash__(self): if hasattr(self, '__hashval'): # Computed lazily. return self.__hashval hv = crc32(b'FrozenOrderedCollection') for i in self: hv = binascii.crc32(struct.pack('@l', hash(i)), hv) hv &= 0xffffffff self.__hashval = hv return hv Note that this example illustrates two other common requirements of these use cases: (i) lazily computing the hash value on first use, and then caching it for future use (ii) priming the overall hash value with some class-specific initial value, so that if an instance of a different type of collection, which comprised the same items but which compared unequal, were to compute its hash value out of the same constituent items, we make sure our hash value differs. (On that note, should the documentation in https://docs.python.org/3/reference/datamodel.html#object.__hash__ quoted above be updated to add this advice? The current advice to "return hash((self.name, self.nick, self.color))" would cause a hash collision with a tuple of the same values, even though the tuple should presumably compare unequal with this object.) To summarize these questions: 1. Should hash() add support for incremental hashing? 2. In the meantime, what is the best way to compute a hash of a combination of many values incrementally (in constant space and linear time), using only what's available in the standard library? Ideally there is some routine available that uses exactly hash_info.hash_bits number of bits, and that does the combining of incremental results for you. 3. Should the https://docs.python.org/3/reference/datamodel.html#object.__hash__ documentation be updated to include suitable advice for these use cases, in particular, that the overall hash value should be computed lazily, incrementally, and should be primed with a class-unique value? Thanks in advance for a helpful discussion, and best wishes. Josh References: [1] http://collections-extended.lenzm.net/api.html#collections_extended.frozense... [2] https://bidict.readthedocs.io/en/dev/api.html#bidict.frozenorderedbidict [3] http://stackoverflow.com/questions/2909106/python-whats-a-correct-and-good-w... [4] http://stackoverflow.com/a/2909572/161642 [5] http://stackoverflow.com/a/27952689/161642
You could always try to make a Python version of the C tuple hashing function[1] (requires the total # of elements) or PyPy's[2] (seems like it would allow true incremental hashing). API idea: hasher = IncrementalHasher() hasher.add(one_item_to_hash) # updates hasher.hash property with result # repeat return hasher.hash [1]: https://hg.python.org/cpython/file/dcced3bd22fe/Objects/tupleobject.c#l331 [2]: https://bitbucket.org/pypy/pypy/src/d8febc18447e1f785a384d52413a345d7b3db423... -- Ryan (ライアン) Yoko Shimomura > ryo (supercell/EGOIST) > Hiroyuki Sawano >> everyone else http://kirbyfan64.github.io/ On Dec 27, 2016 9:15 PM, <jab@math.brown.edu> wrote: Suppose you have implemented an immutable Position type to represent the state of a game played on an MxN board, where the board size can grow quite large. Or suppose you have implemented an immutable, ordered collection type. For example, the collections-extended package provides a frozensetlist[1]. One of my own packages provides a frozen, ordered bidirectional mapping type.[2] These types should be hashable so that they can be inserted into sets and mappings. The order-sensitivity of the contents prevents them from using the built-in collections.Set._hash() helper in their __hash__ implementations, to keep from unnecessarily causing hash collisions for objects that compare unequal due only to having a different ordering of the same set of contained items. According to https://docs.python.org/3/reference/datamodel.html# object.__hash__ : """ it is advised to mix together the hash values of the components of the object that also play a part in comparison of objects by packing them into a tuple and hashing the tuple. Example: def __hash__(self): return hash((self.name, self.nick, self.color)) """ Applying this advice to the use cases above would require creating an arbitrarily large tuple in memory before passing it to hash(), which is then just thrown away. It would be preferable if there were a way to pass multiple values to hash() in a streaming fashion, such that the overall hash were computed incrementally, without building up a large object in memory first. Should there be better support for this use case? Perhaps hash() could support an alternative signature, allowing it to accept a stream of values whose combined hash would be computed incrementally in *constant* space and linear time, e.g. "hash(items=iter(self))". In the meantime, what is the best way to incrementally compute a good hash value for such objects using built-in Python routines? (As a library author, it would be preferable to use a routine with explicit support for computing a hash incrementally, rather than having to worry about how to correctly combine results from multiple calls to hash(contained_item) in library code. (Simply XORing such results together would not be order-sensitive, and so wouldn't work.) Using a routine with explicit support for incremental hashing would allow libraries to focus on doing one thing well.[3,4,5]) I know that hashlib provides algorithms that support incremental hashing, but those use at least 128 bits. Since hash() throws out anything beyond sys.hash_info.hash_bits (e.g. 64) bits, anything in hashlib seems like overkill. Am I right in thinking that's the wrong tool for the job? On the other hand, would binascii.crc32 be suitable, at least for 32-bit systems? (And is there some 64-bit incremental hash algorithm available for 64-bit systems? It seems Python has no support for crc64 built in.) For example: import binascii, struct class FrozenOrderedCollection: def __hash__(self): if hasattr(self, '__hashval'): # Computed lazily. return self.__hashval hv = crc32(b'FrozenOrderedCollection') for i in self: hv = binascii.crc32(struct.pack('@l', hash(i)), hv) hv &= 0xffffffff self.__hashval = hv return hv Note that this example illustrates two other common requirements of these use cases: (i) lazily computing the hash value on first use, and then caching it for future use (ii) priming the overall hash value with some class-specific initial value, so that if an instance of a different type of collection, which comprised the same items but which compared unequal, were to compute its hash value out of the same constituent items, we make sure our hash value differs. (On that note, should the documentation in https://docs.python.org/3/reference/datamodel.html#object.__hash__ quoted above be updated to add this advice? The current advice to "return hash((self.name, self.nick, self.color))" would cause a hash collision with a tuple of the same values, even though the tuple should presumably compare unequal with this object.) To summarize these questions: 1. Should hash() add support for incremental hashing? 2. In the meantime, what is the best way to compute a hash of a combination of many values incrementally (in constant space and linear time), using only what's available in the standard library? Ideally there is some routine available that uses exactly hash_info.hash_bits number of bits, and that does the combining of incremental results for you. 3. Should the https://docs.python.org/3/reference/datamodel.html# object.__hash__ documentation be updated to include suitable advice for these use cases, in particular, that the overall hash value should be computed lazily, incrementally, and should be primed with a class-unique value? Thanks in advance for a helpful discussion, and best wishes. Josh References: [1] http://collections-extended.lenzm.net/api.html#collections_extended. frozensetlist [2] https://bidict.readthedocs.io/en/dev/api.html#bidict.frozenorderedbidict [3] http://stackoverflow.com/questions/2909106/python- whats-a-correct-and-good-way-to-implement-hash#comment28193015_19073010 [4] http://stackoverflow.com/a/2909572/161642 [5] http://stackoverflow.com/a/27952689/161642 _______________________________________________ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/
I actually have been poking around that code already. I also found https://github.com/vperron/python-superfasthash/blob/master/superfasthash.py in case of interest. But it still seems like library authors with this use case should keep their library code free of implementation details like this, and instead use a higher-level API provided by Python. Thanks, Josh On Tue, Dec 27, 2016 at 10:28 PM, Ryan Gonzalez <rymg19@gmail.com> wrote:
You could always try to make a Python version of the C tuple hashing function[1] (requires the total # of elements) or PyPy's[2] (seems like it would allow true incremental hashing). API idea:
hasher = IncrementalHasher() hasher.add(one_item_to_hash) # updates hasher.hash property with result # repeat return hasher.hash
[1]: https://hg.python.org/cpython/file/dcced3bd22fe/ Objects/tupleobject.c#l331 [2]: https://bitbucket.org/pypy/pypy/src/d8febc18447e1f785a384d52413a34 5d7b3db423/rpython/rlib/objectmodel.py#objectmodel.py-562
-- Ryan (ライアン) Yoko Shimomura > ryo (supercell/EGOIST) > Hiroyuki Sawano >> everyone else http://kirbyfan64.github.io/
On 12/27/16 10:13 PM, jab@math.brown.edu wrote:
Applying this advice to the use cases above would require creating an arbitrarily large tuple in memory before passing it to hash(), which is then just thrown away. It would be preferable if there were a way to pass multiple values to hash() in a streaming fashion, such that the overall hash were computed incrementally, without building up a large object in memory first.
Should there be better support for this use case? Perhaps hash() could support an alternative signature, allowing it to accept a stream of values whose combined hash would be computed incrementally in *constant* space and linear time, e.g. "hash(items=iter(self))". You can write a simple function to use hash iteratively to hash the entire stream in constant space and linear time:
def hash_stream(them): val = 0 for it in them: val = hash((val, it)) return val Although this creates N 2-tuples, they come and go, so the memory use won't grow. Adjust the code as needed to achieve canonicalization before iterating. Or maybe I am misunderstanding the requirements? --Ned.
On Wed, Dec 28, 2016 at 11:48 AM, Ned Batchelder <ned@nedbatchelder.com> wrote:
You can write a simple function to use hash iteratively to hash the entire stream in constant space and linear time:
def hash_stream(them): val = 0 for it in them: val = hash((val, it)) return val
Although this creates N 2-tuples, they come and go, so the memory use won't grow. Adjust the code as needed to achieve canonicalization before iterating.
Or maybe I am misunderstanding the requirements?
This is better than solutions like http://stackoverflow.com/a/ 27952689/161642 in the sense that it's a little higher level (no bit shifting or magic numbers). But it's not clear that it's any better in the sense that you're still rolling your own incremental hash algorithm out of a lower-level primitive that doesn't document support for this, and therefore taking responsibility yourself for how well it distributes values into buckets. Are you confident this results in good hash performance? Is this better than a solution built on top of a hash function with an explicit API for calculating a hash incrementally, such as the crc32 example I included? (And again, this would ideally be a sys.hash_info.hash_bits -bit algorithm.) Don't we still probably want either: 1) Python to provide some such hash_stream() function as a built-in, or failing that, 2) the https://docs.python.org/3/reference/datamodel.html#object.__hash__ documentation to bless this as the recommended solution to this problem, thereby providing assurance of its performance? If that makes sense, I'd be happy to file an issue, and include the start of a patch providing either 1 or 2. Thanks very much for the helpful response.
On 12/28/16 12:27 PM, jab@math.brown.edu wrote:
On Wed, Dec 28, 2016 at 11:48 AM, Ned Batchelder <ned@nedbatchelder.com <mailto:ned@nedbatchelder.com>> wrote:
You can write a simple function to use hash iteratively to hash the entire stream in constant space and linear time:
def hash_stream(them): val = 0 for it in them: val = hash((val, it)) return val
Although this creates N 2-tuples, they come and go, so the memory use won't grow. Adjust the code as needed to achieve canonicalization before iterating.
Or maybe I am misunderstanding the requirements?
This is better than solutions like http://stackoverflow.com/a/27952689/161642 <http://stackoverflow.com/a/27952689/161642> in the sense that it's a little higher level (no bit shifting or magic numbers).
But it's not clear that it's any better in the sense that you're still rolling your own incremental hash algorithm out of a lower-level primitive that doesn't document support for this, and therefore taking responsibility yourself for how well it distributes values into buckets.
Are you confident this results in good hash performance? Is this better than a solution built on top of a hash function with an explicit API for calculating a hash incrementally, such as the crc32 example I included? (And again, this would ideally be a sys.hash_info.hash_bits -bit algorithm.)
I don't have the theoretical background to defend this function. But it seems to me that if we believe that hash((int, thing)) distributes well, then how could this function not distribute well? --Ned.
On 12/27/2016 07:13 PM, jab@math.brown.edu wrote:
According to the docs [6]:
""" it is advised to mix together the hash values of the components of the object that also play a part in comparison of objects by packing them into a tuple and hashing the tuple. Example:
def __hash__(self): return hash((self.name, self.nick, self.color))
"""
Applying this advice to the use cases above would require creating an arbitrarily large tuple in memory before passing it to hash(), which is then just thrown away. It would be preferable if there were a way to pass multiple values to hash() in a streaming fashion, such that the overall hash were computed incrementally, without building up a large object in memory first.
Part of the reason for creating __hash__ like above is that: - it's simple - it's reliable However, it's not the only way to have a hash algorithm that works; in fact, the beginning of the sentence you quoted says:
The only required property is that objects which compare equal have the same hash value;
In other words, objects that do not compare equal can also have the same hash value (although too much of that will reduce the efficiency of Python's containers).
(ii) priming the overall hash value with some class-specific initial value, so that if an instance of a different type of collection, which comprised the same items but which compared unequal, were to compute its hash value out of the same constituent items, we make sure our hash value differs.
This is unnecessary: hashes are compared first as a way to weed out impossible matches, but when the hashes are the same an actual __eq__ test is still done [7]. -- ~Ethan~ [6] https://docs.python.org/3/reference/datamodel.html#object.__hash__ [7] some test code to prove above points: --- 8< ------------------------------------------------------------ from unittest import main, TestCase class Eggs(object): def __init__(self, value): self.value = value def __hash__(self): return hash(self.value) def __eq__(self, other): if isinstance(other, self.__class__): return self.value == other.value return NotImplemented class Spam(object): def __init__(self, value): self.value = value def __hash__(self): return hash(self.value) def __eq__(self, other): if isinstance(other, self.__class__): return self.value == other.value return NotImplemented e10 = Eggs(1) e20 = Eggs(2) e11 = Eggs(1) e21 = Eggs(2) s10 = Spam(1) s20 = Spam(2) s11 = Spam(1) s21 = Spam(2) bag = {} bag[e10] = 1 bag[s10] = 2 bag[e20] = 3 bag[s20] = 4 class TestEqualityAndHashing(TestCase): def test_equal(self): # same class, same value --> equal self.assertEqual(e10, e11) self.assertEqual(e20, e21) self.assertEqual(s10, s11) self.assertEqual(s20, s21) def test_not_equal(self): # different class, same value --> not equal self.assertEqual(e10.value, s10.value) self.assertNotEqual(e10, s10) self.assertEqual(e20.value, s20.value) self.assertNotEqual(e20, s20) def test_same_hash(self): # same class, same value, same hash self.assertEqual(hash(e10), hash(e11)) self.assertEqual(hash(e20), hash(e21)) self.assertEqual(hash(s10), hash(s11)) self.assertEqual(hash(s20), hash(s21)) # different class, same value, same hash self.assertEqual(hash(e10), hash(s10)) self.assertEqual(hash(e11), hash(s11)) self.assertEqual(hash(e20), hash(s20)) self.assertEqual(hash(e21), hash(s21)) def test_as_key(self): # different objects from different classes with same hash should still be distinct self.assertEqual(len(bag), 4) self.assertEqual(bag[e10], 1) self.assertEqual(bag[s10], 2) self.assertEqual(bag[e20], 3) self.assertEqual(bag[s20], 4) # different objects from same classes with same hash should not be distinct self.assertEqual(bag[e11], 1) self.assertEqual(bag[s11], 2) self.assertEqual(bag[e21], 3) self.assertEqual(bag[s21], 4) main() --- 8< ------------------------------------------------------------
On Wed, Dec 28, 2016 at 12:10 PM, Ethan Furman <ethan@stoneleaf.us> wrote:
In other words, objects that do not compare equal can also have the same hash value (although too much of that will reduce the efficiency of Python's containers).
Yes, I realize that unequal objects can return the same hash value with only performance, and not correctness, suffering. It's the performance I'm concerned about. That's what I meant by "...to keep from unnecessarily causing hash collisions..." in my original message, but sorry this wasn't clearer. We should be able to do this in a way that doesn't increase hash collisions unnecessarily.
On Wed, Dec 28, 2016 at 12:44:55PM -0500, jab@math.brown.edu wrote:
On Wed, Dec 28, 2016 at 12:10 PM, Ethan Furman <ethan@stoneleaf.us> wrote:
In other words, objects that do not compare equal can also have the same hash value (although too much of that will reduce the efficiency of Python's containers).
Yes, I realize that unequal objects can return the same hash value with only performance, and not correctness, suffering. It's the performance I'm concerned about. That's what I meant by "...to keep from unnecessarily causing hash collisions..." in my original message, but sorry this wasn't clearer. We should be able to do this in a way that doesn't increase hash collisions unnecessarily.
With respect Josh, I feel that this thread is based on premature optimization. It seems to me that you're *assuming* that anything less than some theoretically ideal O(1) space O(N) time hash function is clearly and obviously unsuitable. Of course I might be completely wrong. Perhaps you have implemented your own __hash__ methods as suggested by the docs, as well as Ned's version, and profiled your code and discovered that __hash__ is a significant bottleneck. In which case, I'll apologise for doubting you, but in my defence I'll say that the language you have used in this thread so far gives no hint that you've actually profiled your code and found the bottleneck. As I see it, this thread includes a few questions: (1) What is a good way to generate a hash one item at a time? I think Ned's answer is "good enough", subject to evidence to the contrary. If somebody cares to spend the time to analyse it, that's excellent, but we're volunteers and its the holiday period and most people have probably got better things to do. But we shouldn't let the perfect be the enemy of the good. But for what it's worth, I've done an *extremely* quick and dirty test to see whether the incremental hash function gives a good spread of values, by comparing it to the standard hash() function. py> import statistics py> def incrhash(iterable): ... h = hash(()) ... for x in iterable: ... h = hash((h, x)) ... return h ... py> py> data1 = [] py> data2 = [] py> for i in range(1000): ... it = range(i, i+100) ... data1.append(hash(tuple(it))) ... data2.append(incrhash(it)) ... py> # Are there any collisions? ... len(set(data1)), len(set(data2)) (1000, 1000) py> # compare spread of values ... statistics.stdev(data1), statistics.stdev(data2) (1231914201.0980585, 1227850884.443638) py> max(data1)-min(data1), max(data2)-min(data2) (4287398438, 4287569008) Neither the built-in hash() nor the incremental hash gives any collisions over this (admittedly small) data set, and both have very similar spreads of values as measured by either the standard deviation or the statistical range. The means are quite different: py> statistics.mean(data1), statistics.mean(data2) (-8577110.944, 2854438.568) but I don't think that matters. So that's good enough for me. (2) Should Ned's incremental hash, or some alternative with better properties, go into the standard library? I'm not convinced that your examples are common enough that the stdlib should be burdened with supporting it. On the other hand, I don't think it is an especially *large* burden, so perhaps it could be justified. Count me as sitting on the fence on this one. Perhaps a reasonable compromise would be to include it as a recipe in the docs. (3) If it does go in the stdlib, where should it go? I'm suspicious of functions that change their behaviour depending on how they are called, so I'm not keen on your suggestion of adding a second API to the hash built-in: hash(obj) # return hash of obj hash(iterable=obj) # return incrementally calculated hash of obj That feels wrong to me. I'd rather add a generator to the itertools module: itertools.iterhash(iterable) # yield incremental hashes or, copying the API of itertools.chain, add a method to hash: hash.from_iterable(iterable) # return hash calculated incrementally -- Steve
On Thu, Dec 29, 2016 at 7:20 PM, Steven D'Aprano <steve@pearwood.info> wrote:
I'd rather add a generator to the itertools module:
itertools.iterhash(iterable) # yield incremental hashes
or, copying the API of itertools.chain, add a method to hash:
hash.from_iterable(iterable) # return hash calculated incrementally
The itertools module is mainly designed to be consumed lazily. The hash has to be calculated eagerly, so it's not really a good fit for itertools. The only real advantage of this "hash from iterable" over hash(tuple(it)) is avoiding the intermediate tuple, so I'd want to see evidence that that's actually significant. ChrisA
On 29 December 2016 at 18:35, Chris Angelico <rosuav@gmail.com> wrote:
On Thu, Dec 29, 2016 at 7:20 PM, Steven D'Aprano <steve@pearwood.info> wrote:
I'd rather add a generator to the itertools module:
itertools.iterhash(iterable) # yield incremental hashes
or, copying the API of itertools.chain, add a method to hash:
hash.from_iterable(iterable) # return hash calculated incrementally
The itertools module is mainly designed to be consumed lazily. The hash has to be calculated eagerly, so it's not really a good fit for itertools.
I understood the "iterhash" suggestion to be akin to itertools.accumulate: >>> for value, tally in enumerate(accumulate(range(10))): print(value, tally) ... 0 0 1 1 2 3 3 6 4 10 5 15 6 21 7 28 8 36 9 45 However, I think including Ned's recipe (or something like it) in https://docs.python.org/3/reference/datamodel.html#object.__hash__ as a tool for avoiding large temporary tuple allocations may be a better way to start off as: 1. It's applicable to all currently released versions of Python, not just to 3.7+ 2. It provides more scope for people to experiment with their own variants of the idea before committing to a *particular* version somewhere in the standard library Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia
Updating the docs sounds like the more important change for now, given 3.7+. But before the docs make an official recommendation for that recipe, were the analyses that Steve and I did sufficient to confirm that its hash distribution and performance is good enough at scale, or is more rigorous analysis necessary? I've been trying to find a reasonably detailed and up-to-date reference on Python hash() result requirements and analysis methodology, with instructions on how to confirm if they're met, but am still looking. Would find that an interesting read if it's out there. But I'd take just an authoritative thumbs up here too. Just haven't heard one yet. And regarding any built-in support that might get added, I just want to make sure Ryan Gonzalez's proposal (the first reply on this thread) didn't get buried: hasher = IncrementalHasher() hasher.add(one_item_to_hash) # updates hasher.hash property with result # repeat return hasher.hash I think this is the only proposal so far that actually adds an explicit API for performing an incremental update. (i.e. The other "hash_stream(iterable)" -style proposals are all-or-nothing.) This would bring Python's built-in hash() algorithm's support up to parity with the other algorithms in the standard library (hashlib, crc32). Maybe that's valuable?
On Fri, Dec 30, 2016 at 9:29 AM, <jab@math.brown.edu> wrote:
Updating the docs sounds like the more important change for now, given 3.7+. But before the docs make an official recommendation for that recipe, were the analyses that Steve and I did sufficient to confirm that its hash distribution and performance is good enough at scale, or is more rigorous analysis necessary?
I've been trying to find a reasonably detailed and up-to-date reference on Python hash() result requirements and analysis methodology, with instructions on how to confirm if they're met, but am still looking. Would find that an interesting read if it's out there. But I'd take just an authoritative thumbs up here too. Just haven't heard one yet.
I'm not an expert on hash table implementation subtleties, but I can quote Objects/dictobject.c: "Most hash schemes depend on having a "good" hash function, in the sense of simulating randomness. Python doesn't ..." https://github.com/python/cpython/blob/d0a2f68a5f972e4474f5ecce182752f6f7a22... (Basically IIUC the idea is that Python dicts use a relatively sophisticated probe sequence to resolve collisions which allows the use of relatively "poor" hashes. There are interesting trade-offs here...) -n -- Nathaniel J. Smith -- https://vorpus.org
On Fri, Dec 30, 2016 at 10:30 PM, Nathaniel Smith <njs@pobox.com> wrote:
...
"Most hash schemes depend on having a "good" hash function, in the sense of
simulating randomness. Python doesn't ..."
https://github.com/python/cpython/blob/d0a2f68a/Objects/dictobject.c#L133 ... Thanks for that link, fascinating to read the rest of that comment!! Btw, the part you quoted seemed like more a defense for what followed, i.e. the choice to make hash(some_int) == some_int. I'm not sure how much the part you quoted applies generally. e.g. The https://docs.python.org/3/ reference/datamodel.html#object.__hash__ docs don't say, "Don't worry about your __hash__ implementation, dict's collision resolution strategy is so good it probably doesn't matter anyway." But it also doesn't have any discussion of any of the tradeoffs you mentioned that might be worth considering. What should you do when there are arbitrarily many "components of the object that play a part in comparison of objects"? The "hash((self._name, self._nick, self._color))" code snippet is the only example it gives. This leaves people who have use cases like mine wondering whether it's still advised to scale this up to the arbitrarily many items that instances of their class can contain. If not, then what is advised? Passing a tuple of fewer items to a single hash() call, e.g. hash(tuple(islice(self, CUTOFF)))? Ned's recipe of pairwise-accumulating hash() results over all the items? Or only pairwise-accumulating up to a certain cutoff? Stephen J. Turnbull's idea to use fewer accumulation steps and larger-than-2-tuples? Passing all the items into some other cheap, built-in hash algorithm that actually has an incremental update API (crc32?)? Still hoping someone can give some authoritative advice, and hope it's still reasonable to be asking. If not, I'll cut it out. On Fri, Dec 30, 2016 at 10:35 PM, Chris Angelico <rosuav@gmail.com> wrote:
How likely is it that you'll have this form of collision, rather than some other? Remember, collisions *will* happen, so don't try to eliminate them all; just try to minimize the chances of *too many* collisions. So if you're going to be frequently dealing with (1,2,3) and (1,3,2) and (2,1,3) and (3,1,2), then sure, you need to care about order; but otherwise, one possible cause of a collision is no worse than any other. Keep your algorithm simple, and don't sweat the details that you aren't sure matter.
In the context of writing a collections library, and not an application, it should work well for a diversity of workloads that your users could throw at it. In that context, it's hard to know what to do with advice like this. "Heck, just hash the first three items and call it a day!"
On 31 December 2016 at 15:13, <jab@math.brown.edu> wrote:
On Fri, Dec 30, 2016 at 10:35 PM, Chris Angelico <rosuav@gmail.com> wrote:
How likely is it that you'll have this form of collision, rather than some other? Remember, collisions *will* happen, so don't try to eliminate them all; just try to minimize the chances of *too many* collisions. So if you're going to be frequently dealing with (1,2,3) and (1,3,2) and (2,1,3) and (3,1,2), then sure, you need to care about order; but otherwise, one possible cause of a collision is no worse than any other. Keep your algorithm simple, and don't sweat the details that you aren't sure matter.
In the context of writing a collections library, and not an application, it should work well for a diversity of workloads that your users could throw at it. In that context, it's hard to know what to do with advice like this. "Heck, just hash the first three items and call it a day!"
Yes, this is essentially what we're suggesting you do - start with a "good enough" hash that may have scalability problems (e.g. due to memory copying) or mathematical distribution problems (e.g. due to a naive mathematical combination of values), and then improve it over time based on real world usage of the library. Alternatively, you could take the existing tuple hashing algorithm and reimplement that in pure Python: https://hg.python.org/cpython/file/tip/Objects/tupleobject.c#l336 Based on microbenchmarks, you could then find the size breakpoint where it makes sense to switch between "hash(tuple(self))" (with memory copying, but a more optimised implementation of the algorithm) and a pure Python "tuple_hash(self)". In either case, caching the result on the instance would minimise the computational cost over the lifetime of the object. Cheers, Nick. P.S. Having realised that the tuple hash *algorithm* can be re-used without actually creating a tuple, I'm more amenable to the idea of exposing a "hash.from_iterable" callable that produces the same result as "hash(tuple(iterable))" without actually creating the intermediate tuple. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia
On Sat, Dec 31, 2016 at 7:17 AM, Nick Coghlan <ncoghlan@gmail.com> wrote:
On 31 December 2016 at 15:13, <jab@math.brown.edu> wrote:
On Fri, Dec 30, 2016 at 10:35 PM, Chris Angelico <rosuav@gmail.com> wrote:
How likely is it that you'll have this form of collision, rather than some other? Remember, collisions *will* happen, so don't try to eliminate them all; just try to minimize the chances of *too many* collisions. So if you're going to be frequently dealing with (1,2,3) and (1,3,2) and (2,1,3) and (3,1,2), then sure, you need to care about order; but otherwise, one possible cause of a collision is no worse than any other. Keep your algorithm simple, and don't sweat the details that you aren't sure matter.
In the context of writing a collections library, and not an application, it should work well for a diversity of workloads that your users could throw at it. In that context, it's hard to know what to do with advice like this. "Heck, just hash the first three items and call it a day!"
Yes, this is essentially what we're suggesting you do - start with a "good enough" hash that may have scalability problems (e.g. due to memory copying) or mathematical distribution problems (e.g. due to a naive mathematical combination of values), and then improve it over time based on real world usage of the library.
Alternatively, you could take the existing tuple hashing algorithm and reimplement that in pure Python: https://hg.python.org/cpython/ file/tip/Objects/tupleobject.c#l336
Based on microbenchmarks, you could then find the size breakpoint where it makes sense to switch between "hash(tuple(self))" (with memory copying, but a more optimised implementation of the algorithm) and a pure Python "tuple_hash(self)". In either case, caching the result on the instance would minimise the computational cost over the lifetime of the object.
Cheers, Nick.
P.S. Having realised that the tuple hash *algorithm* can be re-used without actually creating a tuple, I'm more amenable to the idea of exposing a "hash.from_iterable" callable that produces the same result as "hash(tuple(iterable))" without actually creating the intermediate tuple.
Nice! I just realized, similar to tupleobject.c's "tuplehash" routine, I think the frozenset_hash algorithm (implemented in setobject.c) can't be reused without actually creating a frozenset either. As mentioned, a set hashing algorithm is exposed as collections.Set._hash() in _collections_abc.py, which can be passed an iterable, but that routine is implemented in Python. So here again it seems like you have to choose between either creating an ephemeral copy of your data so you can use the fast C routine, or streaming your data to a slower Python implementation. At least in this case the Python implementation is built-in though. Given my current shortage of information, for now I'm thinking of handling this problem in my library by exposing a parameter that users can tune if needed. See bidict/_frozen.py in https://github.com/jab/bidict/ commit/485bf98#diff-215302d205b9f3650d58ee0337f77297, and check out the _HASH_NITEMS_MAX attribute. I have to run for now, but thanks again everyone, and happy new year!
On Sat, Dec 31, 2016 at 5:39 PM, <jab@math.brown.edu> wrote:
a set hashing algorithm is exposed as collections.Set._hash() in _collections_abc.py, which can be passed an iterable
(By which I meant to say, "which can be passed a set-like iterable such as a Keys- or ItemsView of an existing mapping, so that the hash can be computed from the existing data rather than copying it all into a new frozenset." Should have been more precise.)
Instead of the proposals like "hash.from_iterable()", would it make sense to allow tuple.__hash__() to accept any iterable, when called as a classmethod? (And similarly with frozenset.__hash__(), so that the fast C implementation of that algorithm could be used, rather than the slow collections.Set._hash() implementation. Then the duplicated implementation in _collections_abc.py's Set._hash() could be removed completely, delegating to frozenset.__hash__() instead.) Would this API more cleanly communicate the algorithm being used and the implementation, while making a smaller increase in API surface area compared to introducing a new function?
jab@math.brown.edu writes:
Instead of the proposals like "hash.from_iterable()", would it make sense to allow tuple.__hash__() to accept any iterable, when called as a classmethod? [...]
Would this API more cleanly communicate the algorithm being used and the implementation, while making a smaller increase in API surface area compared to introducing a new function?
I don't understand what you're proposing. There are three problems with the API. First, the "obvious" meaning of "tuple.__hash__(iterable)" is "hash(tuple(iterable))", and that's the obvious spelling. Second, there's no reason for a dunder method to be polymorphic; this is hardly discoverable. Third, dunders are normally not called from user code (including class implementations, although they're less ugly there), suggesting that there should be a helper for this. I'm unclear on how the function is supposed to be implemented, since presumably tuple.__hash__ knows about the tuple data structure's implementation (memory layout), but it can't know about an arbitrary iterable. Steve
On Wed, Jan 04, 2017 at 04:38:05PM -0500, jab@math.brown.edu wrote:
Instead of the proposals like "hash.from_iterable()", would it make sense to allow tuple.__hash__() to accept any iterable, when called as a classmethod?
The public API for calculating the hash of something is to call the hash() builtin function on some object, e.g. to call tuple.__hash__ you write hash((a, b, c)). The __hash__ dunder method is implementation, not interface, and normally shouldn't be called directly. Unless I'm missing something obvious, your proposal would require the caller to call the dunder methods directly: class X: def __hash__(self): return tuple.__hash__(iter(self)) I consider that a poor interface design. But even if we decide to make an exception in this case, tuple.__hash__ is currently an ordinary instance method right now. There's probably code that relies on that fact and expects that: tuple.__hash__((a, b, c)) is currently the same as (a, b, c).__hash__() (Starting with the hash() builtin itself, I expect, although that is easy enough to fix if needed.) Your proposal will break backwards compatibility, as it requires a change in semantics: (1) (a, b, c).__hash__() must keep the current behaviour, which means behaving like a bound instance method; (2) But tuple.__hash__ will no longer return an unbound method (actually a function object, but the difference is unimportant) and instead will return something that behaves like a bound class method. Here's an implementation which does this: http://code.activestate.com/recipes/577030-dualmethod-descriptor/ so such a thing is possible. But it breaks backwards-compatability and introduces something which I consider to be an unclean API (calling a dunder method directly). Unless there's a *really* strong advantage to tuple.__hash__(...) over hash.from_iterable(...) (or equivalent), I would be against this change.
(And similarly with frozenset.__hash__(), so that the fast C implementation of that algorithm could be used, rather than the slow collections.Set._hash() implementation. Then the duplicated implementation in _collections_abc.py's Set._hash() could be removed completely, delegating to frozenset.__hash__() instead.)
This is a good point. Until now, I've been assuming that hash.from_iterable should consider order. But frozenset shows us that sometimes the hash should *not* consider order. This hints that perhaps the hash.from_iterable() should have its own optional dunder method. Or maybe we need two functions: an ordered version and an unordered version. Hmmm... just tossing out a wild idea here... let's get rid of the dunder method part of your suggestion, and add new public class methods to tuple and frozenset: tuple.hash_from_iter(iterable) frozenset.hash_from_iter(iterable) That gets rid of all the objections about backwards compatibility, since these are new methods. They're not dunder names, so there are no objections to being used as part of the public API. A possible objection is the question, is this functionality *actually* important enough to bother? Another possible objection: are these methods part of the sequence/set API? If not, do they really belong on the tuple/frozenset? Maybe they belong elsewhere?
Would this API more cleanly communicate the algorithm being used and the implementation,
No. If you want to communicate the algorithm being used, write some documentation. Seriously, the public API doesn't communicate the algorithm used for the implementation. How can it? We can keep the same interface and change the implementation, or change the interface and keep the implementation. The two are (mostly) independent.
while making a smaller increase in API surface area compared to introducing a new function?
It's difficult to quantify "API surface area". On the one hand, we have the addition of one or two new functions or methods. Contrast with: * introducing a new kind of method into the built-ins (one which behaves like a classmethod when called from the class, and like an instance method when called from an instance); * changing tuple.__hash__ from an ordinary method to one of the above special methods; * and likewise for frozenset.__hash__; * change __hash__ from "only used as implementation, not as interface" to "sometimes used as interface". To me, adding one or two new methods/functions is the smaller, or at least less disruptive, change. -- Steve
Couldn't you add __hash__ to collections.abc.Iterable ? Essentially, expose __hash__ there; then all iterables automatically have a default hash that hashes their ordered contents. On Wednesday, January 4, 2017 at 7:37:26 PM UTC-5, Steven D'Aprano wrote:
On Wed, Jan 04, 2017 at 04:38:05PM -0500, j...@math.brown.edu <javascript:> wrote:
Instead of the proposals like "hash.from_iterable()", would it make sense to allow tuple.__hash__() to accept any iterable, when called as a classmethod?
The public API for calculating the hash of something is to call the hash() builtin function on some object, e.g. to call tuple.__hash__ you write hash((a, b, c)). The __hash__ dunder method is implementation, not interface, and normally shouldn't be called directly.
Unless I'm missing something obvious, your proposal would require the caller to call the dunder methods directly:
class X: def __hash__(self): return tuple.__hash__(iter(self))
I consider that a poor interface design.
But even if we decide to make an exception in this case, tuple.__hash__ is currently an ordinary instance method right now. There's probably code that relies on that fact and expects that:
tuple.__hash__((a, b, c))
is currently the same as
(a, b, c).__hash__()
(Starting with the hash() builtin itself, I expect, although that is easy enough to fix if needed.) Your proposal will break backwards compatibility, as it requires a change in semantics:
(1) (a, b, c).__hash__() must keep the current behaviour, which means behaving like a bound instance method;
(2) But tuple.__hash__ will no longer return an unbound method (actually a function object, but the difference is unimportant) and instead will return something that behaves like a bound class method.
Here's an implementation which does this:
http://code.activestate.com/recipes/577030-dualmethod-descriptor/
so such a thing is possible. But it breaks backwards-compatability and introduces something which I consider to be an unclean API (calling a dunder method directly). Unless there's a *really* strong advantage to
tuple.__hash__(...)
over
hash.from_iterable(...)
(or equivalent), I would be against this change.
(And similarly with frozenset.__hash__(), so that the fast C implementation of that algorithm could be used, rather than the slow collections.Set._hash() implementation. Then the duplicated implementation in _collections_abc.py's Set._hash() could be removed completely, delegating to frozenset.__hash__() instead.)
This is a good point. Until now, I've been assuming that hash.from_iterable should consider order. But frozenset shows us that sometimes the hash should *not* consider order.
This hints that perhaps the hash.from_iterable() should have its own optional dunder method. Or maybe we need two functions: an ordered version and an unordered version.
Hmmm... just tossing out a wild idea here... let's get rid of the dunder method part of your suggestion, and add new public class methods to tuple and frozenset:
tuple.hash_from_iter(iterable) frozenset.hash_from_iter(iterable)
That gets rid of all the objections about backwards compatibility, since these are new methods. They're not dunder names, so there are no objections to being used as part of the public API.
A possible objection is the question, is this functionality *actually* important enough to bother?
Another possible objection: are these methods part of the sequence/set API? If not, do they really belong on the tuple/frozenset? Maybe they belong elsewhere?
Would this API more cleanly communicate the algorithm being used and the implementation,
No. If you want to communicate the algorithm being used, write some documentation.
Seriously, the public API doesn't communicate the algorithm used for the implementation. How can it? We can keep the same interface and change the implementation, or change the interface and keep the implementation. The two are (mostly) independent.
while making a smaller increase in API surface area compared to introducing a new function?
It's difficult to quantify "API surface area". On the one hand, we have the addition of one or two new functions or methods. Contrast with:
* introducing a new kind of method into the built-ins (one which behaves like a classmethod when called from the class, and like an instance method when called from an instance);
* changing tuple.__hash__ from an ordinary method to one of the above special methods;
* and likewise for frozenset.__hash__;
* change __hash__ from "only used as implementation, not as interface" to "sometimes used as interface".
To me, adding one or two new methods/functions is the smaller, or at least less disruptive, change.
-- Steve _______________________________________________ Python-ideas mailing list Python...@python.org <javascript:> https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/
But, I think that the problem with adding `__hash__` to collections.abc.Iterable is that not all iterables are immutable -- And if they aren't immutable, then allowing them to be hashed is likely to be a pretty bad idea... I'm still having a hard time being convinced that this is very much of an optimization at all ... If you start hashing tuples that are large enough that memory is a concern, then that's going to also take a *really* long time and probably be prohibitive anyway. Just for kicks, I decided to throw together a simple script to time how much penalty you pay for hashing a tuple: class F(object): def __init__(self, arg): self.arg = arg def __hash__(self): return hash(tuple(self.arg)) class T(object): def __init__(self, arg): self.arg = tuple(arg) def __hash__(self): return hash(self.arg) class C(object): def __init__(self, arg): self.arg = tuple(arg) self._hash = None def __hash__(self): if self._hash is None: self._hash = hash(tuple(self.arg)) return self._hash import timeit print(timeit.timeit('hash(f)', 'from __main__ import F; f = F(list(range(500)))')) print(timeit.timeit('hash(t)', 'from __main__ import T; t = T(list(range(500)))')) print(timeit.timeit('hash(c)', 'from __main__ import C; c = C(list(range(500)))')) results = [] for i in range(1, 11): n = i * 100 t1 = timeit.timeit('hash(f)', 'from __main__ import F; f = F(list(range(%d)))' % i) t2 = timeit.timeit('hash(t)', 'from __main__ import T; t = T(list(range(%d)))' % i) results.append(t1/t2) print(results) F is going to create a new tuple each time and then hash it. T already has a tuple, so we'll only pay the cost of hashing a tuple, not the cost of constructing a tuple and C caches the hash value and re-uses it once it is known. C is the winner by a factor of 10 or more (no surprise there). But the real interesting thing is that the the ratio of the timing results from hashing `F` vs. `T` is relatively constant in the range of my test (up to 1000 elements) and that ratio's value is approximately 1.3. For most applications, that seems reasonable. If you really need a speed-up, then I suppose you could recode the thing in Cython and see what happens, but I doubt that will be frequently necessary. If you _do_ code it up in Cython, put it up on Pypi and see if people use it... On Wed, Jan 4, 2017 at 5:04 PM, Neil Girdhar <mistersheik@gmail.com> wrote:
Couldn't you add __hash__ to collections.abc.Iterable ? Essentially, expose __hash__ there; then all iterables automatically have a default hash that hashes their ordered contents.
On Wednesday, January 4, 2017 at 7:37:26 PM UTC-5, Steven D'Aprano wrote:
On Wed, Jan 04, 2017 at 04:38:05PM -0500, j...@math.brown.edu wrote:
Instead of the proposals like "hash.from_iterable()", would it make sense to allow tuple.__hash__() to accept any iterable, when called as a classmethod?
The public API for calculating the hash of something is to call the hash() builtin function on some object, e.g. to call tuple.__hash__ you write hash((a, b, c)). The __hash__ dunder method is implementation, not interface, and normally shouldn't be called directly.
Unless I'm missing something obvious, your proposal would require the caller to call the dunder methods directly:
class X: def __hash__(self): return tuple.__hash__(iter(self))
I consider that a poor interface design.
But even if we decide to make an exception in this case, tuple.__hash__ is currently an ordinary instance method right now. There's probably code that relies on that fact and expects that:
tuple.__hash__((a, b, c))
is currently the same as
(a, b, c).__hash__()
(Starting with the hash() builtin itself, I expect, although that is easy enough to fix if needed.) Your proposal will break backwards compatibility, as it requires a change in semantics:
(1) (a, b, c).__hash__() must keep the current behaviour, which means behaving like a bound instance method;
(2) But tuple.__hash__ will no longer return an unbound method (actually a function object, but the difference is unimportant) and instead will return something that behaves like a bound class method.
Here's an implementation which does this:
http://code.activestate.com/recipes/577030-dualmethod-descriptor/
so such a thing is possible. But it breaks backwards-compatability and introduces something which I consider to be an unclean API (calling a dunder method directly). Unless there's a *really* strong advantage to
tuple.__hash__(...)
over
hash.from_iterable(...)
(or equivalent), I would be against this change.
(And similarly with frozenset.__hash__(), so that the fast C implementation of that algorithm could be used, rather than the slow collections.Set._hash() implementation. Then the duplicated implementation in _collections_abc.py's Set._hash() could be removed completely, delegating to frozenset.__hash__() instead.)
This is a good point. Until now, I've been assuming that hash.from_iterable should consider order. But frozenset shows us that sometimes the hash should *not* consider order.
This hints that perhaps the hash.from_iterable() should have its own optional dunder method. Or maybe we need two functions: an ordered version and an unordered version.
Hmmm... just tossing out a wild idea here... let's get rid of the dunder method part of your suggestion, and add new public class methods to tuple and frozenset:
tuple.hash_from_iter(iterable) frozenset.hash_from_iter(iterable)
That gets rid of all the objections about backwards compatibility, since these are new methods. They're not dunder names, so there are no objections to being used as part of the public API.
A possible objection is the question, is this functionality *actually* important enough to bother?
Another possible objection: are these methods part of the sequence/set API? If not, do they really belong on the tuple/frozenset? Maybe they belong elsewhere?
Would this API more cleanly communicate the algorithm being used and the implementation,
No. If you want to communicate the algorithm being used, write some documentation.
Seriously, the public API doesn't communicate the algorithm used for the implementation. How can it? We can keep the same interface and change the implementation, or change the interface and keep the implementation. The two are (mostly) independent.
while making a smaller increase in API surface area compared to introducing a new function?
It's difficult to quantify "API surface area". On the one hand, we have the addition of one or two new functions or methods. Contrast with:
* introducing a new kind of method into the built-ins (one which behaves like a classmethod when called from the class, and like an instance method when called from an instance);
* changing tuple.__hash__ from an ordinary method to one of the above special methods;
* and likewise for frozenset.__hash__;
* change __hash__ from "only used as implementation, not as interface" to "sometimes used as interface".
To me, adding one or two new methods/functions is the smaller, or at least less disruptive, change.
-- Steve _______________________________________________ Python-ideas mailing list Python...@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/
_______________________________________________ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/
-- [image: pattern-sig.png] Matt Gilson // SOFTWARE ENGINEER E: matt@getpattern.com // P: 603.892.7736 We’re looking for beta testers. Go here <https://www.getpattern.com/meetpattern> to sign up!
On Thu, Jan 5, 2017 at 4:00 AM Matt Gilson <matt@getpattern.com> wrote:
But, I think that the problem with adding `__hash__` to collections.abc.Iterable is that not all iterables are immutable -- And if they aren't immutable, then allowing them to be hashed is likely to be a pretty bad idea...
Good point. A better option is to add collections.abc.ImmutableIterable that derives from Iterable and provides __hash__. Since tuple inherits from it, it can choose to delegate up. Then I think everyone is happy.
I'm still having a hard time being convinced that this is very much of an optimization at all ...
If you start hashing tuples that are large enough that memory is a concern, then that's going to also take a *really* long time and probably be prohibitive anyway. Just for kicks, I decided to throw together a simple script to time how much penalty you pay for hashing a tuple:
class F(object): def __init__(self, arg): self.arg = arg
def __hash__(self): return hash(tuple(self.arg))
class T(object): def __init__(self, arg): self.arg = tuple(arg)
def __hash__(self): return hash(self.arg)
class C(object): def __init__(self, arg): self.arg = tuple(arg) self._hash = None
def __hash__(self): if self._hash is None: self._hash = hash(tuple(self.arg)) return self._hash
import timeit
print(timeit.timeit('hash(f)', 'from __main__ import F; f = F(list(range(500)))')) print(timeit.timeit('hash(t)', 'from __main__ import T; t = T(list(range(500)))')) print(timeit.timeit('hash(c)', 'from __main__ import C; c = C(list(range(500)))'))
results = [] for i in range(1, 11): n = i * 100 t1 = timeit.timeit('hash(f)', 'from __main__ import F; f = F(list(range(%d)))' % i) t2 = timeit.timeit('hash(t)', 'from __main__ import T; t = T(list(range(%d)))' % i) results.append(t1/t2) print(results)
F is going to create a new tuple each time and then hash it. T already has a tuple, so we'll only pay the cost of hashing a tuple, not the cost of constructing a tuple and C caches the hash value and re-uses it once it is known. C is the winner by a factor of 10 or more (no surprise there). But the real interesting thing is that the the ratio of the timing results from hashing `F` vs. `T` is relatively constant in the range of my test (up to 1000 elements) and that ratio's value is approximately 1.3. For most applications, that seems reasonable. If you really need a speed-up, then I suppose you could recode the thing in Cython and see what happens, but I doubt that will be frequently necessary. If you _do_ code it up in Cython, put it up on Pypi and see if people use it...
On Wed, Jan 4, 2017 at 5:04 PM, Neil Girdhar <mistersheik@gmail.com> wrote:
Couldn't you add __hash__ to collections.abc.Iterable ? Essentially, expose __hash__ there; then all iterables automatically have a default hash that hashes their ordered contents.
On Wednesday, January 4, 2017 at 7:37:26 PM UTC-5, Steven D'Aprano wrote:
On Wed, Jan 04, 2017 at 04:38:05PM -0500, j...@math.brown.edu wrote:
Instead of the proposals like "hash.from_iterable()", would it make sense to allow tuple.__hash__() to accept any iterable, when called as a classmethod?
The public API for calculating the hash of something is to call the hash() builtin function on some object, e.g. to call tuple.__hash__ you write hash((a, b, c)). The __hash__ dunder method is implementation, not interface, and normally shouldn't be called directly.
Unless I'm missing something obvious, your proposal would require the caller to call the dunder methods directly:
class X: def __hash__(self): return tuple.__hash__(iter(self))
I consider that a poor interface design.
But even if we decide to make an exception in this case, tuple.__hash__ is currently an ordinary instance method right now. There's probably code that relies on that fact and expects that:
tuple.__hash__((a, b, c))
is currently the same as
(a, b, c).__hash__()
(Starting with the hash() builtin itself, I expect, although that is easy enough to fix if needed.) Your proposal will break backwards compatibility, as it requires a change in semantics:
(1) (a, b, c).__hash__() must keep the current behaviour, which means behaving like a bound instance method;
(2) But tuple.__hash__ will no longer return an unbound method (actually a function object, but the difference is unimportant) and instead will return something that behaves like a bound class method.
Here's an implementation which does this:
http://code.activestate.com/recipes/577030-dualmethod-descriptor/
so such a thing is possible. But it breaks backwards-compatability and introduces something which I consider to be an unclean API (calling a dunder method directly). Unless there's a *really* strong advantage to
tuple.__hash__(...)
over
hash.from_iterable(...)
(or equivalent), I would be against this change.
(And similarly with frozenset.__hash__(), so that the fast C implementation of that algorithm could be used, rather than the slow collections.Set._hash() implementation. Then the duplicated implementation in _collections_abc.py's Set._hash() could be removed completely, delegating to frozenset.__hash__() instead.)
This is a good point. Until now, I've been assuming that hash.from_iterable should consider order. But frozenset shows us that sometimes the hash should *not* consider order.
This hints that perhaps the hash.from_iterable() should have its own optional dunder method. Or maybe we need two functions: an ordered version and an unordered version.
Hmmm... just tossing out a wild idea here... let's get rid of the dunder method part of your suggestion, and add new public class methods to tuple and frozenset:
tuple.hash_from_iter(iterable) frozenset.hash_from_iter(iterable)
That gets rid of all the objections about backwards compatibility, since these are new methods. They're not dunder names, so there are no objections to being used as part of the public API.
A possible objection is the question, is this functionality *actually* important enough to bother?
Another possible objection: are these methods part of the sequence/set API? If not, do they really belong on the tuple/frozenset? Maybe they belong elsewhere?
Would this API more cleanly communicate the algorithm being used and the implementation,
No. If you want to communicate the algorithm being used, write some documentation.
Seriously, the public API doesn't communicate the algorithm used for the implementation. How can it? We can keep the same interface and change the implementation, or change the interface and keep the implementation. The two are (mostly) independent.
while making a smaller increase in API surface area compared to introducing a new function?
It's difficult to quantify "API surface area". On the one hand, we have the addition of one or two new functions or methods. Contrast with:
* introducing a new kind of method into the built-ins (one which behaves like a classmethod when called from the class, and like an instance method when called from an instance);
* changing tuple.__hash__ from an ordinary method to one of the above special methods;
* and likewise for frozenset.__hash__;
* change __hash__ from "only used as implementation, not as interface" to "sometimes used as interface".
To me, adding one or two new methods/functions is the smaller, or at least less disruptive, change.
-- Steve _______________________________________________ Python-ideas mailing list Python...@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/
_______________________________________________ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/
--
[image: pattern-sig.png]
Matt Gilson // SOFTWARE ENGINEER
E: matt@getpattern.com // P: 603.892.7736 <(603)%20892-7736>
We’re looking for beta testers. Go here <https://www.getpattern.com/meetpattern> to sign up!
On Thu, Jan 5, 2017, at 04:00, Matt Gilson wrote:
But, I think that the problem with adding `__hash__` to collections.abc.Iterable is that not all iterables are immutable -- And if they aren't immutable, then allowing them to be hashed is likely to be a pretty bad idea...
Why? This should never cause an interpreter-crashing bug, because user-defined types can have bad hash methods anyway. And without that, the reason for not applying the "consenting adults" principle and allowing people to add mutable objects to a *short-lived* dict without intending to change them while the dict is in use has never been clear to me. I think mutable types not having a hash method was a mistake in the first place.
On 5 January 2017 at 00:31, Steven D'Aprano <steve@pearwood.info> wrote:
This is a good point. Until now, I've been assuming that hash.from_iterable should consider order. But frozenset shows us that sometimes the hash should *not* consider order.
This hints that perhaps the hash.from_iterable() should have its own optional dunder method. Or maybe we need two functions: an ordered version and an unordered version.
Hmmm... just tossing out a wild idea here... let's get rid of the dunder method part of your suggestion, and add new public class methods to tuple and frozenset:
tuple.hash_from_iter(iterable) frozenset.hash_from_iter(iterable)
That gets rid of all the objections about backwards compatibility, since these are new methods. They're not dunder names, so there are no objections to being used as part of the public API.
A possible objection is the question, is this functionality *actually* important enough to bother?
Another possible objection: are these methods part of the sequence/set API? If not, do they really belong on the tuple/frozenset? Maybe they belong elsewhere?
At this point I'd be inclined to say that a 3rd party hashing_utils module would be a reasonable place to thrash out these design decisions before committing to a permanent design in the stdlib. The popularity of such a module would also give a level of indication as to whether this is an important optimisation in practice. Paul
On Fri, Dec 30, 2016 at 5:24 PM, Nick Coghlan <ncoghlan@gmail.com> wrote:
I understood the "iterhash" suggestion to be akin to itertools.accumulate:
>>> for value, tally in enumerate(accumulate(range(10))): print(value, ...
It reminds me of hmac[1]/hashlib[2], with the API : h.update(...) before a .digest(). It is slightly lower level than a `from_iterable`, but would be a bit more flexible. If the API were kept similar things would be easier to remember. -- M [1]: https://docs.python.org/3/library/hmac.html#hmac.HMAC.update [2]: https://docs.python.org/3/library/hashlib-blake2.html#module-hashlib
On 2016-12-30 20:59, Matthias Bussonnier wrote:
On Fri, Dec 30, 2016 at 5:24 PM, Nick Coghlan <ncoghlan@gmail.com> wrote:
I understood the "iterhash" suggestion to be akin to itertools.accumulate:
>>> for value, tally in enumerate(accumulate(range(10))): print(value, ...
It reminds me of hmac[1]/hashlib[2], with the API : h.update(...) before a .digest(). It is slightly lower level than a `from_iterable`, but would be a bit more flexible. If the API were kept similar things would be easier to remember.
Hi, I'm the author of PEP 456 (SipHash24) and one of the maintainers of the hashlib module. Before we come up with a new API or recipe, I would like to understand the problem first. Why does the initial op consider hash(large_tuple) a performance issue? If you have an object with lots of members that affect both __hash__ and __eq__, then __hash__ is really least of your concern. The hash has to be computed just once and then will stay the same over the life time of the object. Once computed the hash can be cached. On the other hand __eq__ is at least called once for every successful hash lookup. On the worst case it is called n-1 for a dict of size n for a match *and* a hashmap miss. Every __eq__ call has to compare between 1 and m member attributes. For a dict with 1,000 elements with 1,000 members each, that's just 1,000 hash computations with roughly 8 bB memory allocation but almost a million comparisons in worst case. A hasher objects adds further overhead, e.g. object allocation, creation of a bound methods for each call etc. It's also less CPU cache friendly than the linear data structure of a tuple. Christian
On Fri, Dec 30, 2016 at 3:54 PM, Christian Heimes <christian@python.org> wrote:
Hi,
I'm the author of PEP 456 (SipHash24) and one of the maintainers of the hashlib module.
Before we come up with a new API or recipe, I would like to understand the problem first. Why does the initial op consider hash(large_tuple) a performance issue? If you have an object with lots of members that affect both __hash__ and __eq__, then __hash__ is really least of your concern. The hash has to be computed just once and then will stay the same over the life time of the object. Once computed the hash can be cached.
On the other hand __eq__ is at least called once for every successful hash lookup. On the worst case it is called n-1 for a dict of size n for a match *and* a hashmap miss. Every __eq__ call has to compare between 1 and m member attributes. For a dict with 1,000 elements with 1,000 members each, that's just 1,000 hash computations with roughly 8 bB memory allocation but almost a million comparisons in worst case.
A hasher objects adds further overhead, e.g. object allocation, creation of a bound methods for each call etc. It's also less CPU cache friendly than the linear data structure of a tuple.
Thanks for joining the discussion, great to have your input. In the use cases I described, the objects' members are ordered. So in the unlikely event that two objects hash to the same value but are unequal, the __eq__ call should be cheap, because they probably differ in length or on their first member, and can skip further comparison. After a successful hash lookup of an object that's already in a set or dict, a successful identity check avoids an expensive equality check. Perhaps I misunderstood? Here is some example code: class FrozenOrderedCollection: ... def __eq__(self, other): if not isinstance(other, FrozenOrderedCollection): return NotImplemented if len(self) != len(other): return False return all(i == j for (i, j) in zip(self, other)) def __hash__(self): if hasattr(self, '_hashval'): return self._hashval hash_initial = hash(self.__class__) self._hashval = reduce(lambda h, i: hash((h, i)), self, hash_initial) return self._hashval Is it the case that internally, the code is mostly there to compute the hash of such an object in incremental fashion, and it's just a matter of figuring out the right API to expose it? If creating an arbitrarily large tuple, passing that to hash(), and then throwing it away really is the best practice, then perhaps that should be explicitly documented, since I think many would find that counterintuitive? @Stephen J. Turnbull, thank you for your input -- still digesting, but very interesting. Thanks again to everyone for the helpful discussion.
On 12/30/2016 03:36 PM, jab@math.brown.edu wrote:
In the use cases I described, the objects' members are ordered. So in the unlikely event that two objects hash to the same value but are unequal, the __eq__ call should be cheap, because they probably differ in length or on their first member, and can skip further comparison. After a successful hash lookup of an object that's already in a set or dict, a successful identity check avoids an expensive equality check. Perhaps I misunderstood?
If you are relying on an identity check for equality then no two FrozenOrderedCollection instances can be equal. Was that your intention? It it was, then just hash the instance's id() and you're done. If that wasn't your intention then, while there can certainly be a few quick checks to rule out equality (such as length) if those match, the expensive full equality check will have to happen. -- ~Ethan~
On Fri, Dec 30, 2016 at 7:20 PM, Ethan Furman <ethan@stoneleaf.us> wrote:
On 12/30/2016 03:36 PM, jab@math.brown.edu wrote:
In the use cases I described, the objects' members are ordered. So in the
unlikely event that two objects hash to the same value but are unequal, the __eq__ call should be cheap, because they probably differ in length or on their first member, and can skip further comparison. After a successful hash lookup of an object that's already in a set or dict, a successful identity check avoids an expensive equality check. Perhaps I misunderstood?
If you are relying on an identity check for equality then no two FrozenOrderedCollection instances can be equal. Was that your intention? It it was, then just hash the instance's id() and you're done.
No, I was talking about the identity check done by a set or dict when doing a lookup to check if the object in a hash bucket is identical to the object being looked up. In that case, there is no need for the set or dict to even call __eq__. Right? The FrozenOrderedCollection.__eq__ implementation I sketched out was as intended -- no identity check there. If that wasn't your intention then, while there can certainly be a few
quick checks to rule out equality (such as length) if those match, the expensive full equality check will have to happen.
I think we're misunderstanding each other, but at the risk of saying the same thing again: Because it's an ordered collection, the equality check for two unequal instances with the same hash value will very likely be able to bail after comparing lengths or the first items. With a good hash function, the odds of two unequal instances both hashing to the same value and having their first N items equal should be vanishingly small, no?
On 12/30/2016 04:31 PM, jab@math.brown.edu wrote:
On Fri, Dec 30, 2016 at 7:20 PM, Ethan Furman wrote:
If you are relying on an identity check for equality then no two FrozenOrderedCollection instances can be equal. Was that your intention? It it was, then just hash the instance's id() and you're done.
No, I was talking about the identity check done by a set or dict when doing a lookup to check if the object in a hash bucket is identical to the object being looked up. In that case, there is no need for the set or dict to even call __eq__. Right?
No. It is possible to have two keys be equal but different -- an easy example is 1 and 1.0; they both hash the same, equal the same, but are not identical. dict has to check equality when two different objects hash the same but have non-matching identities. -- ~Ethan~
On Fri, Dec 30, 2016 at 8:04 PM, Ethan Furman <ethan@stoneleaf.us> wrote:
No. It is possible to have two keys be equal but different -- an easy example is 1 and 1.0; they both hash the same, equal the same, but are not identical. dict has to check equality when two different objects hash the same but have non-matching identities.
Python 3.6.0 (default, Dec 24 2016, 00:01:50) [GCC 4.2.1 Compatible Apple LLVM 8.0.0 (clang-800.0.42.1)] on darwin Type "help", "copyright", "credits" or "license" for more information.
d = {1: 'int', 1.0: 'float'} d {1: 'float'}
IPython 5.1.0 -- An enhanced Interactive Python. In [1]: class Foo: ...: def __eq__(self, other): ...: return True ...: def __init__(self, val): ...: self.val = val ...: def __repr__(self): ...: return '<Foo %r>' % self.val ...: def __hash__(self): ...: return 42 ...: In [2]: f1 = Foo(1) In [3]: f2 = Foo(2) In [4]: x = {f1: 1, f2: 2} In [5]: x Out[5]: {<Foo 1>: 2} I'm having trouble showing that two equal but nonidentical objects can both be in the same dict.
On Fri, Dec 30, 2016 at 8:10 PM, <jab@math.brown.edu> wrote:
On Fri, Dec 30, 2016 at 8:04 PM, Ethan Furman <ethan@stoneleaf.us> wrote:
No. It is possible to have two keys be equal but different -- an easy example is 1 and 1.0; they both hash the same, equal the same, but are not identical. dict has to check equality when two different objects hash the same but have non-matching identities.
... I'm having trouble showing that two equal but nonidentical objects can both be in the same dict.
(Because they can't, though) your point stands that Python had to call __eq__ in these cases. But with instances of large, immutable, ordered collections, an application could either: 1. accept that it might create a duplicate, equivalent instance of an existing instance with sufficient infrequency that it's okay taking the performance hit, or 2. try to avoid creating duplicate instances in the first place, using the existing, equivalent instance it created as a singleton. Then a set or dict lookup could use the identity check, and avoid the expensive __eq__ call. I think it's much more important to focus on what happens with unequal instances here, since there are usually many more of them. And with them, the performance hit of the __eq__ calls definitely does not necessarily dominate that of __hash__. Right?
On 12/30/2016 06:12 PM, jab@math.brown.edu wrote:
... your point stands that Python had to call __eq__ in these cases.
But with instances of large, immutable, ordered collections, an application could either:
1. accept that it might create a duplicate, equivalent instance of an existing instance with sufficient infrequency that it's okay taking the performance hit, or
2. try to avoid creating duplicate instances in the first place, using the existing, equivalent instance it created as a singleton. Then a set or dict lookup could use the identity check, and avoid the expensive __eq__ call.
I think it's much more important to focus on what happens with unequal instances here, since there are usually many more of them. And with them, the performance hit of the __eq__ calls definitely does not necessarily dominate that of __hash__. Right?
I don't think so. As someone else said, a hash can be calculated once and then cached, but __eq__ has to be called every time. Depending on the clustering of your data that could be quick... or not. -- ~Ethan~
On Fri, Dec 30, 2016 at 9:21 PM, Ethan Furman <ethan@stoneleaf.us> wrote:
I don't think so. As someone else said, a hash can be calculated once and then cached, but __eq__ has to be called every time. Depending on the clustering of your data that could be quick... or not.
__eq__ only has to be called when a hash bucket is non-empty. In that case, it may be O(n) in pathological cases, but it could also be O(1) every time. On the other hand, __hash__ has to be called on every lookup, is O(n) on the first call, and ideally O(1) after. I'd expect that __eq__ may often not dominate, and avoiding an unnecessary large tuple allocation on the first __hash__ call could be helpful. If a __hash__ implementation can avoid creating an arbitrarily large tuple unnecessarily and still perform well, why not save the space? If the resulting hash distribution is worse, that's another story, but it seems worth documenting one way or the other, since the current docs give memory-conscious Python programmers pause for this use case.
On 12/30/2016 06:47 PM, jab@math.brown.edu wrote:
__eq__ only has to be called when a hash bucket is non-empty. In that case, it may be O(n) in pathological cases, but it could also be O(1) every time. On the other hand, __hash__ has to be called on every lookup, is O(n) on the first call, and ideally O(1) after. I'd expect that __eq__ may often not dominate, and avoiding an unnecessary large tuple allocation on the first __hash__ call could be helpful. If a __hash__ implementation can avoid creating an arbitrarily large tuple unnecessarily and still perform well, why not save the space? If the resulting hash distribution is worse, that's another story, but it seems worth documenting one way or the other, since the current docs give memory-conscious Python programmers pause for this use case.
A quick list of what we have agreed on: - __hash__ is called once for every dict/set lookup - __hash__ is calculated once because we are caching the value - __eq__ is being called an unknown number of times, but can be quite expensive in terms of time - items with the same hash are probably cheap in terms of __eq__ (?) So maybe this will work? def __hash__(self): return hash(self.name) * hash(self.nick) * hash(self.color) In other words, don't create a new tuple, just use the ones you already have and toss in a couple maths operations. (and have somebody vet that ;) -- ~Ethan~
On Fri, Dec 30, 2016 at 10:08 PM, Ethan Furman <ethan@stoneleaf.us> wrote:
So maybe this will work?
def __hash__(self): return hash(self.name) * hash(self.nick) * hash(self.color)
In other words, don't create a new tuple, just use the ones you already have and toss in a couple maths operations. (and have somebody vet that ;)
See the "Simply XORing such results together would not be order-sensitive, and so wouldn't work" from my original post. (Like XOR, multiplication is also commutative.) e.g. Since FrozenOrderedCollection([1, 2]) != FrozenOrderedCollection([2, 1]), we should try to avoid making their hashes equal, or else we increase collisions unnecessarily. More generally, I think the trick is to get an even, chaotic distribution into sys.hash_info.hash_bits out of this hash algorithm, and I think simply multiplying hash values together like this wouldn't do that.
On Sat, Dec 31, 2016 at 2:24 PM, <jab@math.brown.edu> wrote:
See the "Simply XORing such results together would not be order-sensitive, and so wouldn't work" from my original post. (Like XOR, multiplication is also commutative.)
e.g. Since FrozenOrderedCollection([1, 2]) != FrozenOrderedCollection([2, 1]), we should try to avoid making their hashes equal, or else we increase collisions unnecessarily.
How likely is it that you'll have this form of collision, rather than some other? Remember, collisions *will* happen, so don't try to eliminate them all; just try to minimize the chances of *too many* collisions. So if you're going to be frequently dealing with (1,2,3) and (1,3,2) and (2,1,3) and (3,1,2), then sure, you need to care about order; but otherwise, one possible cause of a collision is no worse than any other. Keep your algorithm simple, and don't sweat the details that you aren't sure matter. ChrisA
On Fri, Dec 30, 2016 at 07:08:27PM -0800, Ethan Furman wrote:
So maybe this will work?
def __hash__(self): return hash(self.name) * hash(self.nick) * hash(self.color)
I don't like the multiplications. If any of the three hashes return zero, the overall hash will be zero. I think you need better mixing than that. Look at tuple: py> hash((0, 1, 2)) -421559672 py> hash(0) * hash(1) * hash(2) 0 -- Steve
On Fri, Dec 30, 2016 at 09:47:54PM -0500, jab@math.brown.edu wrote:
__eq__ only has to be called when a hash bucket is non-empty. In that case, it may be O(n) in pathological cases, but it could also be O(1) every time. On the other hand, __hash__ has to be called on every lookup, is O(n) on the first call, and ideally O(1) after. I'd expect that __eq__ may often not dominate, and avoiding an unnecessary large tuple allocation on the first __hash__ call could be helpful.
Sorry to be the broken record repeating himself, but this sounds *exactly* like premature optimization here. My suggestion is that you are overthinking things, or at least worrying about issues before you've got any evidence that they are going to be real issues. I expect that the amount of time and effort you've spent in analysing the theorectical problems here and writing to this list is more than the time it would have taken for you to write the simplest __hash__ method that could work (using the advice in the docs to make a tuple). You could have implemented that in five minutes, and be running code by now. Of course I understand that performance issues may not be visible when you have 100 or 100 thousand items but may be horrific when you have 100 million. I get that. But you're aware of the possibility, so you can write a performance test that generates 100 million objects and tests performance. *If* you find an actual problem, then you can look into changing your __hash__ method. You could come back here and talk about actual performance issues instead of hypothetical issues, or you could hire an expert to tune your hash function. (And if you do pay for an expert to solve this, please consider giving back to the community.) Remember that the specific details of __hash__ should be an internal implementation detail for your class. You shouldn't fear changing the hash algorithm as often as you need to, including in bug fix releases. You don't have to wait for the perfect hash function, just a "good enough for now" one to get started. I'm not trying to be dismissive of your concerns. They may be real issues that you have to solve. I'm just saying, you should check your facts first rather than solve hypothetic problems. I have seen, and even written, far too much pessimized code in the name of "this must be better, it stands to reason" to give much credence to theoretical arguments about performance. -- Steve
On 2016-12-30 17:04, Ethan Furman wrote:
On 12/30/2016 04:31 PM,jab@math.brown.edu wrote:
On Fri, Dec 30, 2016 at 7:20 PM, Ethan Furman wrote:
If you are relying on an identity check for equality then no two FrozenOrderedCollection instances can be equal. Was that your intention? It it was, then just hash the instance's id() and you're done.
No, I was talking about the identity check done by a set or dict when doing a lookup to check if the object in a hash bucket is identical to the object being looked up. In that case, there is no need for the set or dict to even call __eq__. Right? No. It is possible to have two keys be equal but different -- an easy example is 1 and 1.0; they both hash the same, equal the same, but are not identical. dict has to check equality when two different objects hash the same but have non-matching identities.
I think that is the same as what he said. The point is that if they *are* the same object, you *don't* need to check equality. -- Brendan Barnwell "Do not follow where the path may lead. Go, instead, where there is no path, and leave a trail." --author unknown
Thanks for the thoughtful discussion, it's been very interesting. Hash algorithms seem particularly sensitive and tricky to get right, with a great deal of research going into choices of constants, etc. and lots of gotchas. So it seemed worth asking about. If I had to bet on whether repeatedly accumulating pairwise hash() results would maintain the same desired properties that hash(tuple(items)) guarantees, I'd want to get confirmation from someone with expertise in this first, hence my starting this thread. But as you showed, it's certainly possible to do some exploration in the meantime. Prompted by your helpful comparison, I just put together https://gist.github.com/jab/fd78b3acd25b3530e0e21f5aaee3c674 to further compare hash_tuple vs. hash_incremental. Based on that, hash_incremental seems to have a comparable distribution to hash_tuple. I'm not sure if the methodology there is sound, as I'm new to analysis like this. So I'd still welcome confirmation from someone with actual expertise in Python's internal hash algorithms. But so far this sure seems good enough for the use cases I described. Given sufficiently good distribution, I'd expect there to be unanimous agreement that an immutable collection, which could contain arbitrarily many items, should strongly prefer hash_incremental(self) over hash(tuple(self)), for the same reason we use generator comprehensions instead of list comprehensions when appropriate. Please correct me if I'm wrong. +1 for the "hash.from_iterable" API you proposed, if some additional support for this is added to Python. Otherwise +1 for including Ned's recipe in the docs. Again, happy to submit a patch for either of these if it would be helpful. And to be clear, I really appreciate the time that contributors have put into this thread, and into Python in general. Thoughtful responses are always appreciated, and never expected. I'm just interested in learning and in helping improve Python when I might have an opportunity. My Python open source work has been done on a voluntary basis too, and I haven't even gotten to use Python for paid/closed source work in several years, alas. Thanks again, Josh On Thu, Dec 29, 2016 at 3:20 AM, Steven D'Aprano <steve@pearwood.info> wrote:
On Wed, Dec 28, 2016 at 12:44:55PM -0500, jab@math.brown.edu wrote:
On Wed, Dec 28, 2016 at 12:10 PM, Ethan Furman <ethan@stoneleaf.us> wrote:
In other words, objects that do not compare equal can also have the same hash value (although too much of that will reduce the efficiency of Python's containers).
Yes, I realize that unequal objects can return the same hash value with only performance, and not correctness, suffering. It's the performance I'm concerned about. That's what I meant by "...to keep from unnecessarily causing hash collisions..." in my original message, but sorry this wasn't clearer. We should be able to do this in a way that doesn't increase hash collisions unnecessarily.
With respect Josh, I feel that this thread is based on premature optimization. It seems to me that you're *assuming* that anything less than some theoretically ideal O(1) space O(N) time hash function is clearly and obviously unsuitable.
Of course I might be completely wrong. Perhaps you have implemented your own __hash__ methods as suggested by the docs, as well as Ned's version, and profiled your code and discovered that __hash__ is a significant bottleneck. In which case, I'll apologise for doubting you, but in my defence I'll say that the language you have used in this thread so far gives no hint that you've actually profiled your code and found the bottleneck.
As I see it, this thread includes a few questions:
(1) What is a good way to generate a hash one item at a time?
I think Ned's answer is "good enough", subject to evidence to the contrary. If somebody cares to spend the time to analyse it, that's excellent, but we're volunteers and its the holiday period and most people have probably got better things to do. But we shouldn't let the perfect be the enemy of the good.
But for what it's worth, I've done an *extremely* quick and dirty test to see whether the incremental hash function gives a good spread of values, by comparing it to the standard hash() function.
py> import statistics py> def incrhash(iterable): ... h = hash(()) ... for x in iterable: ... h = hash((h, x)) ... return h ... py> py> data1 = [] py> data2 = [] py> for i in range(1000): ... it = range(i, i+100) ... data1.append(hash(tuple(it))) ... data2.append(incrhash(it)) ... py> # Are there any collisions? ... len(set(data1)), len(set(data2)) (1000, 1000) py> # compare spread of values ... statistics.stdev(data1), statistics.stdev(data2) (1231914201.0980585, 1227850884.443638) py> max(data1)-min(data1), max(data2)-min(data2) (4287398438, 4287569008)
Neither the built-in hash() nor the incremental hash gives any collisions over this (admittedly small) data set, and both have very similar spreads of values as measured by either the standard deviation or the statistical range. The means are quite different:
py> statistics.mean(data1), statistics.mean(data2) (-8577110.944, 2854438.568)
but I don't think that matters. So that's good enough for me.
(2) Should Ned's incremental hash, or some alternative with better properties, go into the standard library?
I'm not convinced that your examples are common enough that the stdlib should be burdened with supporting it. On the other hand, I don't think it is an especially *large* burden, so perhaps it could be justified. Count me as sitting on the fence on this one.
Perhaps a reasonable compromise would be to include it as a recipe in the docs.
(3) If it does go in the stdlib, where should it go?
I'm suspicious of functions that change their behaviour depending on how they are called, so I'm not keen on your suggestion of adding a second API to the hash built-in:
hash(obj) # return hash of obj
hash(iterable=obj) # return incrementally calculated hash of obj
That feels wrong to me. I'd rather add a generator to the itertools module:
itertools.iterhash(iterable) # yield incremental hashes
or, copying the API of itertools.chain, add a method to hash:
hash.from_iterable(iterable) # return hash calculated incrementally
-- Steve
jab@math.brown.edu writes:
But as you showed, it's certainly possible to do some exploration in the meantime. Prompted by your helpful comparison, I just put together https://gist.github.com/jab/fd78b3acd25b3530e0e21f5aaee3c674 to further compare hash_tuple vs. hash_incremental.
It's been a while :-) since I read Knuth[1] (and that was when Knuth was authoritative on this subject 8^O), but neither methodology is particularly helpful here. The ideal is effectively a uniform distribution on a circle, which has no mean. Therefore standard deviation is also uninteresting, since its definition uses the mean. The right test is something like a χ² test against a uniform distribution. The scatter plots (of hash against test data) simply show the same thing, without the precision of a statistical test. (Note: do not deprecate the computer visualization + human eye + human brain system. It is the best known machine for detecting significant patterns and anomolies, though YMMV.) But it's not very good at detecting high- dimensional patterns. However, it's nowhere near good enough for a hash function to have a uniform distribution on random data. It actually needs to be "chaotic" in the sense that it also spreads "nearby" data out, where "nearby" here would probably mean that as 4000-bit strings less than m% (for small m) differ, as in real data you usually expect a certain amount of structure (continuity in time, clustering around a mean, line noise in a communication channel) so that you'd be likely to get lots of "clusters" of very similar data. You don't want them pounding on a small subset of buckets.
I'm not sure if the methodology there is sound, as I'm new to analysis like this.
Even decades later, starting with Knuth[1] can't hurt. :-)
Given sufficiently good distribution, I'd expect there to be unanimous agreement that an immutable collection, which could contain arbitrarily many items, should strongly prefer hash_incremental(self) over hash(tuple(self)), for the same reason we use generator comprehensions instead of list comprehensions when appropriate. Please correct me if I'm wrong.
I think that's a misleading analogy. Just stick to For very large collections where we already have the data, duplicating the data is very expensive. Furthermore, since the purpose of hashing is making equality comparisons cheap, this is likely to happen in a loop. On the other hand, if there are going to be a *lot* of "large" collections being stored, then they're actually not all that large compared to the system, and you might not actually care that much about the cost of the emphemeral tuples, because the real cost is in an n^2 set of comparisons. From the point of view of NumPy, this is an "are you kidding?" argument because large datasets are its stock in trade, but for core Python, this may be sufficiently esoteric that it should be delegated to On balance, the arguments that Steven d'Aprano advanced for having a statistics module in the stdlib vs. importing pandas apply here. In particular, I think there are a huge number of options for an iterative hash. For example, Ned chained 2-tuples. But perhaps for efficient time in bounded space you want to use bounded but larger tuples. I don't know -- and that's the point. If we have a TOOWTDI API like hash.from_iterable then smart people (and people with time on their hands to do exhaustive experiments!) can tune that over time, as has been done with the sort API. Another option is yielding partials, as Steven says:
itertools.iterhash(iterable) # yield incremental hashes
That's a very interesting idea, though I suspect it rarely would make a big performance improvement. I'm not sure I like the "hash.from_iterable" name for this API. The problem is that if it's not a concrete collection[3], then you throw away the data. If the hash algorithm happens to suck for certain data, you'll get a lot of collisions, and conclude that your data is much more concentrated than it actually is. I find it hard to imagine a use case where you have large data where you only care about whether two data points are different (cf. equality comparisons for floats). You want to know how they're different. So I think I would prefer to name it "hash.from_collection" or similar. Of course whether the implementation actually raises on a generator or other non-concrete iterable is a fine point. Footnotes: [1] Of course I mean The Art of Computer Programming, Ch. 3. [2] Including the newly ordered dicts, maybe? Somebody tweet @dabeaz! What evil can he do with this? [3] Yeah, I know, "re-iterables". But we don't have a definition, let alone an API for identifying, those Things.
On 29 December 2016 at 19:20, Steven D'Aprano <steve@pearwood.info> wrote:
With respect Josh, I feel that this thread is based on premature optimization. It seems to me that you're *assuming* that anything less than some theoretically ideal O(1) space O(N) time hash function is clearly and obviously unsuitable.
Of course I might be completely wrong. Perhaps you have implemented your own __hash__ methods as suggested by the docs, as well as Ned's version, and profiled your code and discovered that __hash__ is a significant bottleneck. In which case, I'll apologise for doubting you, but in my defence I'll say that the language you have used in this thread so far gives no hint that you've actually profiled your code and found the bottleneck.
In Josh's defence, the initial use case he put forward is exactly the kind of scenario where it's worthwhile optimising ahead of time. Quite often a poorly implemented hash function doesn't manifest as a problem until you scale up massively - and a developer may not have the capability to scale up to a suitable level in-house, resulting in performance issues at customer sites. I had one particular case (fortunately discovered before going to customers) where a field was included in the equality check, but wasn't part of the hash. Unfortunately, the lack of this one field resulted in objects only being allocated to a few buckets (in a Java HashMap), resulting in every access having to walk a potentially very long chain doing equality comparisons - O(N) behaviour from an amortised O(1) data structure. Unit tests - no worries. Small-scale tests - everything looked fine. Once we started our load tests though everything slowed to a crawl. 100% CPU, throughput at a standstill ... it didn't look good. Adding that one field to the hash resulted in the ability to scale up to hundreds of thousands of objects with minimal CPU. I can't remember if it was millions we tested to (it was around 10 years ago ...). Tim Delaney
On 28.12.2016 04:13, jab@math.brown.edu wrote:
Suppose you have implemented an immutable Position type to represent the state of a game played on an MxN board, where the board size can grow quite large. ...
According to https://docs.python.org/3/reference/datamodel.html#object.__hash__ :
""" it is advised to mix together the hash values of the components of the object that also play a part in comparison of objects by packing them into a tuple and hashing the tuple. Example:
def __hash__(self): return hash((self.name, self.nick, self.color))
"""
Applying this advice to the use cases above would require creating an arbitrarily large tuple in memory before passing it to hash(), which is then just thrown away. It would be preferable if there were a way to pass multiple values to hash() in a streaming fashion, such that the overall hash were computed incrementally, without building up a large object in memory first.
I think there's a misunderstanding here: the hash(obj) built-in merely interfaces to the obj.__hash__() method (or the tp_hash slot for C types) and returns whatever these methods give. It doesn't implement any logic by itself. If you would like to implement a more efficient hash algorithm for your types, just go ahead and write them as .__hash__() method or tp_hash slot method and you're done. The example from the docs is just to showcase an example of how such a hash function should work, i.e. to mix in all relevant data attributes. In your case, you'd probably use a simple for loop to calculate the hash without creating tuples or any other temporary structures. Here's the hash implementation tuples use as an example /* The addend 82520, was selected from the range(0, 1000000) for generating the greatest number of prime multipliers for tuples upto length eight: 1082527, 1165049, 1082531, 1165057, 1247581, 1330103, 1082533, 1330111, 1412633, 1165069, 1247599, 1495177, 1577699 Tests have shown that it's not worth to cache the hash value, see issue #9685. */ static Py_hash_t tuplehash(PyTupleObject *v) { Py_uhash_t x; /* Unsigned for defined overflow behavior. */ Py_hash_t y; Py_ssize_t len = Py_SIZE(v); PyObject **p; Py_uhash_t mult = _PyHASH_MULTIPLIER; x = 0x345678UL; p = v->ob_item; while (--len >= 0) { y = PyObject_Hash(*p++); if (y == -1) return -1; x = (x ^ y) * mult; /* the cast might truncate len; that doesn't change hash stability */ mult += (Py_hash_t)(82520UL + len + len); } x += 97531UL; if (x == (Py_uhash_t)-1) x = -2; return x; } As you can see, there's some magic going on there to make sure that the hash values behave well when used as "keys" for the dictionary implementation (which is their main purpose in Python). You are free to create your own hash implementation. The only characteristic to pay attention to is to have objects which compare equal give the same hash value. This is needed to be able to map such objects to the same dictionary slots. There should be no need to have a special hash function which works on iterables. As long as those iterable objects define their own .__hash__() method or tp_slot, the hash() built-in (and Python's dict implementation) will use these and, if needed, those methods can then use an approach to build hash values using iterators on the object's internal data along similar lines as the above tuple implementation. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Experts (#1, Jan 05 2017)
Python Projects, Coaching and Consulting ... http://www.egenix.com/ Python Database Interfaces ... http://products.egenix.com/ Plone/Zope Database Interfaces ... http://zope.egenix.com/
::: We implement business ideas - efficiently in both time and costs ::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/ http://www.malemburg.com/
The point is that the OP doesn't want to write his own hash function, but wants Python to provide a standard way of hashing an iterable. Today, the standard way is to convert to tuple and call hash on that. That may not be efficient. FWIW from a style perspective, I agree with OP. On Thu, Jan 5, 2017 at 4:30 AM M.-A. Lemburg <mal@egenix.com> wrote:
On 28.12.2016 04:13, jab@math.brown.edu wrote:
Suppose you have implemented an immutable Position type to represent the state of a game played on an MxN board, where the board size can grow quite large. ...
According to https://docs.python.org/3/reference/datamodel.html#object.__hash__ :
""" it is advised to mix together the hash values of the components of the object that also play a part in comparison of objects by packing them into a tuple and hashing the tuple. Example:
def __hash__(self): return hash((self.name, self.nick, self.color))
"""
Applying this advice to the use cases above would require creating an arbitrarily large tuple in memory before passing it to hash(), which is then just thrown away. It would be preferable if there were a way to pass multiple values to hash() in a streaming fashion, such that the overall hash were computed incrementally, without building up a large object in memory first.
I think there's a misunderstanding here: the hash(obj) built-in merely interfaces to the obj.__hash__() method (or the tp_hash slot for C types) and returns whatever these methods give.
It doesn't implement any logic by itself.
If you would like to implement a more efficient hash algorithm for your types, just go ahead and write them as .__hash__() method or tp_hash slot method and you're done.
The example from the docs is just to showcase an example of how such a hash function should work, i.e. to mix in all relevant data attributes.
In your case, you'd probably use a simple for loop to calculate the hash without creating tuples or any other temporary structures.
Here's the hash implementation tuples use as an example
/* The addend 82520, was selected from the range(0, 1000000) for generating the greatest number of prime multipliers for tuples upto length eight:
1082527, 1165049, 1082531, 1165057, 1247581, 1330103, 1082533, 1330111, 1412633, 1165069, 1247599, 1495177, 1577699
Tests have shown that it's not worth to cache the hash value, see issue #9685. */
static Py_hash_t tuplehash(PyTupleObject *v) { Py_uhash_t x; /* Unsigned for defined overflow behavior. */ Py_hash_t y; Py_ssize_t len = Py_SIZE(v); PyObject **p; Py_uhash_t mult = _PyHASH_MULTIPLIER; x = 0x345678UL; p = v->ob_item; while (--len >= 0) { y = PyObject_Hash(*p++); if (y == -1) return -1; x = (x ^ y) * mult; /* the cast might truncate len; that doesn't change hash stability */ mult += (Py_hash_t)(82520UL + len + len); } x += 97531UL; if (x == (Py_uhash_t)-1) x = -2; return x; }
As you can see, there's some magic going on there to make sure that the hash values behave well when used as "keys" for the dictionary implementation (which is their main purpose in Python).
You are free to create your own hash implementation. The only characteristic to pay attention to is to have objects which compare equal give the same hash value. This is needed to be able to map such objects to the same dictionary slots.
There should be no need to have a special hash function which works on iterables. As long as those iterable objects define their own .__hash__() method or tp_slot, the hash() built-in (and Python's dict implementation) will use these and, if needed, those methods can then use an approach to build hash values using iterators on the object's internal data along similar lines as the above tuple implementation.
-- Marc-Andre Lemburg eGenix.com
Professional Python Services directly from the Experts (#1, Jan 05 2017)
Python Projects, Coaching and Consulting ... http://www.egenix.com/ Python Database Interfaces ... http://products.egenix.com/ Plone/Zope Database Interfaces ... http://zope.egenix.com/
::: We implement business ideas - efficiently in both time and costs :::
eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/ http://www.malemburg.com/
_______________________________________________ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/
--
--- You received this message because you are subscribed to a topic in the Google Groups "python-ideas" group. To unsubscribe from this topic, visit https://groups.google.com/d/topic/python-ideas/XcuC01a8SYs/unsubscribe. To unsubscribe from this group and all its topics, send an email to python-ideas+unsubscribe@googlegroups.com. For more options, visit https://groups.google.com/d/optout.
On 5 January 2017 at 13:28, Neil Girdhar <mistersheik@gmail.com> wrote:
The point is that the OP doesn't want to write his own hash function, but wants Python to provide a standard way of hashing an iterable. Today, the standard way is to convert to tuple and call hash on that. That may not be efficient. FWIW from a style perspective, I agree with OP.
The debate here regarding tuple/frozenset indicates that there may not be a "standard way" of hashing an iterable (should order matter?). Although I agree that assuming order matters is a reasonable assumption to make in the absence of any better information. Hashing is low enough level that providing helpers in the stdlib is not unreasonable. It's not obvious (to me, at least) that it's a common enough need to warrant it, though. Do we have any information on how often people implement their own __hash__, or how often hash(tuple(my_iterable)) would be an acceptable hash, except for the cost of creating the tuple? The OP's request is the only time this has come up as a requirement, to my knowledge. Hence my suggestion to copy the tuple implementation, modify it to work with general iterables, and publish it as a 3rd party module - its usage might give us an idea of how often this need arises. (The other option would be for someone to do some analysis of published code). Assuming it is a sufficiently useful primitive to add, then we can debate naming. But I'd prefer it to be named in such a way that it makes it clear that it's a low-level helper for people writing their own __hash__ function, and not some sort of variant of hashing (which hash.from_iterable implies to me). Paul
I agree with Paul -- I'm not convinced that this is common enough or that the benefits are big enough to warrant something builtin. However, I did decide to dust off some of my old skills and I threw together a simple gist to see how hard it would be to create something using Cython based on the CPython tuple hash algorithm. I don't know how well it works for arbitrary iterables without a `__length_hint__`, but seems to work as intended for iterables that have the length hint. <goog_827102756> https://gist.github.com/mgilson/129859a79487a483163980db25b709bf If you're interested, or want to pick this up and actually do something with it, feel free... Also, I haven't written anything using Cython for ages, so if this could be further optimized, feel free to let me know. On Thu, Jan 5, 2017 at 7:58 AM, Paul Moore <p.f.moore@gmail.com> wrote:
The point is that the OP doesn't want to write his own hash function, but wants Python to provide a standard way of hashing an iterable. Today,
On 5 January 2017 at 13:28, Neil Girdhar <mistersheik@gmail.com> wrote: the
standard way is to convert to tuple and call hash on that. That may not be efficient. FWIW from a style perspective, I agree with OP.
The debate here regarding tuple/frozenset indicates that there may not be a "standard way" of hashing an iterable (should order matter?). Although I agree that assuming order matters is a reasonable assumption to make in the absence of any better information.
Hashing is low enough level that providing helpers in the stdlib is not unreasonable. It's not obvious (to me, at least) that it's a common enough need to warrant it, though. Do we have any information on how often people implement their own __hash__, or how often hash(tuple(my_iterable)) would be an acceptable hash, except for the cost of creating the tuple? The OP's request is the only time this has come up as a requirement, to my knowledge. Hence my suggestion to copy the tuple implementation, modify it to work with general iterables, and publish it as a 3rd party module - its usage might give us an idea of how often this need arises. (The other option would be for someone to do some analysis of published code).
Assuming it is a sufficiently useful primitive to add, then we can debate naming. But I'd prefer it to be named in such a way that it makes it clear that it's a low-level helper for people writing their own __hash__ function, and not some sort of variant of hashing (which hash.from_iterable implies to me).
Paul _______________________________________________ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/
-- [image: pattern-sig.png] Matt Gilson // SOFTWARE ENGINEER E: matt@getpattern.com // P: 603.892.7736 We’re looking for beta testers. Go here <https://www.getpattern.com/meetpattern> to sign up!
On Thu, Jan 5, 2017 at 10:58 AM Paul Moore <p.f.moore@gmail.com> wrote:
The point is that the OP doesn't want to write his own hash function, but wants Python to provide a standard way of hashing an iterable. Today,
On 5 January 2017 at 13:28, Neil Girdhar <mistersheik@gmail.com> wrote: the
standard way is to convert to tuple and call hash on that. That may not be efficient. FWIW from a style perspective, I agree with OP.
The debate here regarding tuple/frozenset indicates that there may not be a "standard way" of hashing an iterable (should order matter?). Although I agree that assuming order matters is a reasonable assumption to make in the absence of any better information.
That's another good point. In keeping with my abc proposal, why not add abstract base classes with __hash__: * ImmutableIterable, and * ImmutableSet. ImmutableSet inherits from ImmutableIterable, and overrides __hash__ in such a way that order is ignored. This presumably involves very little new code — it's just a propagating up of the code that's already in set and tuple. The advantage is that instead of implementing __hash__ for your type, you declare your intention by inheriting from an abc and get an automatically-provided hash function. Hashing is low enough level that providing helpers in the stdlib is
not unreasonable. It's not obvious (to me, at least) that it's a common enough need to warrant it, though. Do we have any information on how often people implement their own __hash__, or how often hash(tuple(my_iterable)) would be an acceptable hash, except for the cost of creating the tuple? The OP's request is the only time this has come up as a requirement, to my knowledge. Hence my suggestion to copy the tuple implementation, modify it to work with general iterables, and publish it as a 3rd party module - its usage might give us an idea of how often this need arises. (The other option would be for someone to do some analysis of published code).
Assuming it is a sufficiently useful primitive to add, then we can debate naming. But I'd prefer it to be named in such a way that it makes it clear that it's a low-level helper for people writing their own __hash__ function, and not some sort of variant of hashing (which hash.from_iterable implies to me).
Paul
Paul Moore writes:
The debate here regarding tuple/frozenset indicates that there may not be a "standard way" of hashing an iterable (should order matter?).
If part of the data structure, yes, if an implementation accident, no.
Although I agree that assuming order matters is a reasonable assumption to make in the absence of any better information.
I don't think so. Eg, with dicts now ordered by insertion, an order-dependent default hash for collections means a = {} b = {} a['1'] = 1 a['2'] = 2 b['2'] = 2 b['1'] = 1 hash(a) != hash(b) # modulo usual probability of collision (and modulo normally not hashing mutables). For the same reason I expect I'd disagree with Neil's proposal for an ImmutableWhatever default __hash__ although the hash comparison is "cheap", it's still a pessimization. Haven't thought that through, though. BTW, it occurs to me that now that dictionaries are versioned, in some cases it *may* make sense to hash dictionaries even though they are mutable, although the "hash" would need to somehow account for the version changing. Seems messy but maybe someone has an idea? Steve
On Thu, Jan 5, 2017 at 9:10 PM Stephen J. Turnbull < turnbull.stephen.fw@u.tsukuba.ac.jp> wrote:
Paul Moore writes:
The debate here regarding tuple/frozenset indicates that there may not be a "standard way" of hashing an iterable (should order matter?).
If part of the data structure, yes, if an implementation accident, no.
Although I agree that assuming order matters is a reasonable assumption to make in the absence of any better information.
I don't think so. Eg, with dicts now ordered by insertion, an order-dependent default hash for collections means
a = {} b = {} a['1'] = 1 a['2'] = 2 b['2'] = 2 b['1'] = 1 hash(a) != hash(b) # modulo usual probability of collision
(and modulo normally not hashing mutables). For the same reason I expect I'd disagree with Neil's proposal for an ImmutableWhatever default __hash__ although the hash comparison is "cheap", it's still a pessimization. Haven't thought that through, though.
I don't understand this? How is providing a default method in an abstract base class a pessimization? If it happens to be slower than the code in the current methods, it can still be overridden.
BTW, it occurs to me that now that dictionaries are versioned, in some cases it *may* make sense to hash dictionaries even though they are mutable, although the "hash" would need to somehow account for the version changing. Seems messy but maybe someone has an idea?
I think it's important to keep in mind that dictionaries are not versioned in Python. They happen to be versioned in CPython as an unexposed implementation detail. I don't think that such details should have any bearing on potential changes to Python.
Steve
Neil Girdhar writes:
I don't understand this? How is providing a default method in an abstract base class a pessimization? If it happens to be slower than the code in the current methods, it can still be overridden.
How often will people override until it's bitten them? How many people will not even notice until they've lost business due to slow response? If you don't have a default, that's much more obvious. Note that if there is a default, the collections are "large", and equality comparisons are "rare", this could be a substantial overhead.
BTW, it occurs to me that now that dictionaries are versioned, in some cases it *may* make sense to hash dictionaries even though they are mutable, although the "hash" would need to somehow account for the version changing. Seems messy but maybe someone has an idea?
I think it's important to keep in mind that dictionaries are not versioned in Python. They happen to be versioned in CPython as an unexposed implementation detail. I don't think that such details should have any bearing on potential changes to Python.
AFAIK the use of the hash member for equality checking is an implementation detail too, although the language reference does mention that set, frozenset and dict are "hashed collections". The basic requirements on hashes are that (1) objects that compare equal must hash to the same value, and (2) the hash bucket must not change over an object's lifetime (this is the "messy" aspect that probably kills the idea -- you'd need to fix up all hashed collections that contain the object as a key).
On Fri, Jan 6, 2017 at 2:07 AM Stephen J. Turnbull < turnbull.stephen.fw@u.tsukuba.ac.jp> wrote:
Neil Girdhar writes:
I don't understand this? How is providing a default method in an abstract base class a pessimization? If it happens to be slower than the code in the current methods, it can still be overridden.
How often will people override until it's bitten them? How many people will not even notice until they've lost business due to slow response? If you don't have a default, that's much more obvious. Note that if there is a default, the collections are "large", and equality comparisons are "rare", this could be a substantial overhead.
I still don't understand what you're talking about here. You're saying that we shouldn't provide a __hash__ in case the default hash happens to be slower than what the user wants and so by not providing it, we force the user to write a fast one? Doesn't that argument apply to all methods provided by abcs?
BTW, it occurs to me that now that dictionaries are versioned, in some cases it *may* make sense to hash dictionaries even though they are mutable, although the "hash" would need to somehow account for the version changing. Seems messy but maybe someone has an idea?
I think it's important to keep in mind that dictionaries are not versioned in Python. They happen to be versioned in CPython as an unexposed implementation detail. I don't think that such details should have any bearing on potential changes to Python.
AFAIK the use of the hash member for equality checking is an implementation detail too, although the language reference does mention that set, frozenset and dict are "hashed collections". The basic requirements on hashes are that (1) objects that compare equal must hash to the same value, and (2) the hash bucket must not change over an object's lifetime (this is the "messy" aspect that probably kills the idea -- you'd need to fix up all hashed collections that
contain the object as a key).
Are you saying that __hash__ is called by __eq__?
On 6 January 2017 at 07:26, Neil Girdhar <mistersheik@gmail.com> wrote:
On Fri, Jan 6, 2017 at 2:07 AM Stephen J. Turnbull <turnbull.stephen.fw@u.tsukuba.ac.jp> wrote:
Neil Girdhar writes:
I don't understand this? How is providing a default method in an abstract base class a pessimization? If it happens to be slower than the code in the current methods, it can still be overridden.
How often will people override until it's bitten them? How many people will not even notice until they've lost business due to slow response? If you don't have a default, that's much more obvious. Note that if there is a default, the collections are "large", and equality comparisons are "rare", this could be a substantial overhead.
I still don't understand what you're talking about here. You're saying that we shouldn't provide a __hash__ in case the default hash happens to be slower than what the user wants and so by not providing it, we force the user to write a fast one? Doesn't that argument apply to all methods provided by abcs?
The point here is that ABCs should provide defaults for methods where there is an *obvious* default. It's not at all clear that there's an obvious default for __hash__. Unless I missed a revision of your proposal, what you suggested was:
A better option is to add collections.abc.ImmutableIterable that derives from Iterable and provides __hash__.
So what classes would derive from ImmutableIterable? Frozenset? A user-defined frozendict? There's no "obvious" default that would work for both those cases. And that's before we even get to the question of whether the default has the right performance characteristics, which is highly application-dependent. It's not clear to me if you expect ImmutableIterable to provide anything other than a default implementation of hash. If not, then how is making it an ABC any better than simply providing a helper function that people can use in their own __hash__ implementation? That would make it explicit what people are doing, and avoid any tendency towards people thinking they "should" inherit from ImmutableIterable and yet needing to override the only method it provides. Paul
On Fri, Jan 6, 2017 at 3:59 AM Paul Moore <p.f.moore@gmail.com> wrote:
On Fri, Jan 6, 2017 at 2:07 AM Stephen J. Turnbull <turnbull.stephen.fw@u.tsukuba.ac.jp> wrote:
Neil Girdhar writes:
I don't understand this? How is providing a default method in an abstract base class a pessimization? If it happens to be slower than the code in the current methods, it can still be overridden.
How often will people override until it's bitten them? How many people will not even notice until they've lost business due to slow response? If you don't have a default, that's much more obvious. Note that if there is a default, the collections are "large", and equality comparisons are "rare", this could be a substantial overhead.
I still don't understand what you're talking about here. You're saying
On 6 January 2017 at 07:26, Neil Girdhar <mistersheik@gmail.com> wrote: that
we shouldn't provide a __hash__ in case the default hash happens to be slower than what the user wants and so by not providing it, we force the user to write a fast one? Doesn't that argument apply to all methods provided by abcs?
The point here is that ABCs should provide defaults for methods where there is an *obvious* default. It's not at all clear that there's an obvious default for __hash__.
Unless I missed a revision of your proposal, what you suggested was:
Yeah, looks like you missed a revision. There were two emails. I suggested adding ImmutableIterable and ImmutableSet, and so there is an obvious implementation of __hash__ for both.
A better option is to add collections.abc.ImmutableIterable that derives from Iterable and provides __hash__.
So what classes would derive from ImmutableIterable? Frozenset? A user-defined frozendict? There's no "obvious" default that would work for both those cases. And that's before we even get to the question of whether the default has the right performance characteristics, which is highly application-dependent.
It's not clear to me if you expect ImmutableIterable to provide anything other than a default implementation of hash. If not, then how is making it an ABC any better than simply providing a helper function that people can use in their own __hash__ implementation? That would make it explicit what people are doing, and avoid any tendency towards people thinking they "should" inherit from ImmutableIterable and yet needing to override the only method it provides.
Paul
On 6 January 2017 at 09:02, Neil Girdhar <mistersheik@gmail.com> wrote:
Yeah, looks like you missed a revision. There were two emails. I suggested adding ImmutableIterable and ImmutableSet, and so there is an obvious implementation of __hash__ for both.
OK, sorry. The proposal is still getting more complicated, though, and I really don't see how it's better than having some low-level helper functions for people who need to build custom __hash__ implementations. The "one obvious way" to customise hashing is to implement __hash__, not to derive from a base class that says my class is an Immutable{Iterable,Set}. Paul
participants (18)
-
Brendan Barnwell
-
Chris Angelico
-
Christian Heimes
-
Ethan Furman
-
jab@math.brown.edu
-
M.-A. Lemburg
-
Matt Gilson
-
Matthias Bussonnier
-
Nathaniel Smith
-
Ned Batchelder
-
Neil Girdhar
-
Nick Coghlan
-
Paul Moore
-
Random832
-
Ryan Gonzalez
-
Stephen J. Turnbull
-
Steven D'Aprano
-
Tim Delaney