[Python-ideas] incremental hashing in __hash__

Steven D'Aprano steve at pearwood.info
Sat Dec 31 01:57:05 EST 2016


On Fri, Dec 30, 2016 at 09:47:54PM -0500, jab at math.brown.edu wrote:

> __eq__ only has to be called when a hash bucket is non-empty. In that case,
> it may be O(n) in pathological cases, but it could also be O(1) every time.
> On the other hand, __hash__ has to be called on every lookup, is O(n) on
> the first call, and ideally O(1) after. I'd expect that __eq__ may often
> not dominate, and avoiding an unnecessary large tuple allocation on the
> first __hash__ call could be helpful.

Sorry to be the broken record repeating himself, but this sounds 
*exactly* like premature optimization here.

My suggestion is that you are overthinking things, or at least worrying 
about issues before you've got any evidence that they are going to be 
real issues. I expect that the amount of time and effort you've spent in 
analysing the theorectical problems here and writing to this list is 
more than the time it would have taken for you to write the simplest 
__hash__ method that could work (using the advice in the docs to make a 
tuple). You could have implemented that in five minutes, and be running 
code by now.

Of course I understand that performance issues may not be visible when 
you have 100 or 100 thousand items but may be horrific when you have 100 
million. I get that. But you're aware of the possibility, so you can 
write a performance test that generates 100 million objects and tests 
performance. *If* you find an actual problem, then you can look into 
changing your __hash__ method. You could come back here and talk about 
actual performance issues instead of hypothetical issues, or you could 
hire an expert to tune your hash function. (And if you do pay for an 
expert to solve this, please consider giving back to the community.)

Remember that the specific details of __hash__ should be an internal 
implementation detail for your class. You shouldn't fear changing the 
hash algorithm as often as you need to, including in bug fix releases. 
You don't have to wait for the perfect hash function, just a "good 
enough for now" one to get started.

I'm not trying to be dismissive of your concerns. They may be real 
issues that you have to solve. I'm just saying, you should check your 
facts first rather than solve hypothetic problems. I have seen, and even 
written, far too much pessimized code in the name of "this must be 
better, it stands to reason" to give much credence to theoretical 
arguments about performance.



-- 
Steve


More information about the Python-ideas mailing list