[Python-ideas] incremental hashing in __hash__
Steven D'Aprano
steve at pearwood.info
Sat Dec 31 01:57:05 EST 2016
On Fri, Dec 30, 2016 at 09:47:54PM -0500, jab at math.brown.edu wrote:
> __eq__ only has to be called when a hash bucket is non-empty. In that case,
> it may be O(n) in pathological cases, but it could also be O(1) every time.
> On the other hand, __hash__ has to be called on every lookup, is O(n) on
> the first call, and ideally O(1) after. I'd expect that __eq__ may often
> not dominate, and avoiding an unnecessary large tuple allocation on the
> first __hash__ call could be helpful.
Sorry to be the broken record repeating himself, but this sounds
*exactly* like premature optimization here.
My suggestion is that you are overthinking things, or at least worrying
about issues before you've got any evidence that they are going to be
real issues. I expect that the amount of time and effort you've spent in
analysing the theorectical problems here and writing to this list is
more than the time it would have taken for you to write the simplest
__hash__ method that could work (using the advice in the docs to make a
tuple). You could have implemented that in five minutes, and be running
code by now.
Of course I understand that performance issues may not be visible when
you have 100 or 100 thousand items but may be horrific when you have 100
million. I get that. But you're aware of the possibility, so you can
write a performance test that generates 100 million objects and tests
performance. *If* you find an actual problem, then you can look into
changing your __hash__ method. You could come back here and talk about
actual performance issues instead of hypothetical issues, or you could
hire an expert to tune your hash function. (And if you do pay for an
expert to solve this, please consider giving back to the community.)
Remember that the specific details of __hash__ should be an internal
implementation detail for your class. You shouldn't fear changing the
hash algorithm as often as you need to, including in bug fix releases.
You don't have to wait for the perfect hash function, just a "good
enough for now" one to get started.
I'm not trying to be dismissive of your concerns. They may be real
issues that you have to solve. I'm just saying, you should check your
facts first rather than solve hypothetic problems. I have seen, and even
written, far too much pessimized code in the name of "this must be
better, it stands to reason" to give much credence to theoretical
arguments about performance.
--
Steve
More information about the Python-ideas
mailing list