[Python-ideas] incremental hashing in __hash__
Tim Delaney
timothy.c.delaney at gmail.com
Thu Dec 29 20:59:43 EST 2016
On 29 December 2016 at 19:20, Steven D'Aprano <steve at pearwood.info> wrote:
>
> With respect Josh, I feel that this thread is based on premature
> optimization. It seems to me that you're *assuming* that anything less
> than some theoretically ideal O(1) space O(N) time hash function is
> clearly and obviously unsuitable.
>
> Of course I might be completely wrong. Perhaps you have implemented your
> own __hash__ methods as suggested by the docs, as well as Ned's version,
> and profiled your code and discovered that __hash__ is a significant
> bottleneck. In which case, I'll apologise for doubting you, but in my
> defence I'll say that the language you have used in this thread so far
> gives no hint that you've actually profiled your code and found the
> bottleneck.
>
In Josh's defence, the initial use case he put forward is exactly the kind
of scenario where it's worthwhile optimising ahead of time. Quite often a
poorly implemented hash function doesn't manifest as a problem until you
scale up massively - and a developer may not have the capability to scale
up to a suitable level in-house, resulting in performance issues at
customer sites.
I had one particular case (fortunately discovered before going to
customers) where a field was included in the equality check, but wasn't
part of the hash. Unfortunately, the lack of this one field resulted in
objects only being allocated to a few buckets (in a Java HashMap),
resulting in every access having to walk a potentially very long chain
doing equality comparisons - O(N) behaviour from an amortised O(1) data
structure.
Unit tests - no worries. Small-scale tests - everything looked fine. Once
we started our load tests though everything slowed to a crawl. 100% CPU,
throughput at a standstill ... it didn't look good.
Adding that one field to the hash resulted in the ability to scale up to
hundreds of thousands of objects with minimal CPU. I can't remember if it
was millions we tested to (it was around 10 years ago ...).
Tim Delaney
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-ideas/attachments/20161230/ef81e841/attachment-0001.html>
More information about the Python-ideas
mailing list