[Python-ideas] incremental hashing in __hash__
Stephen J. Turnbull
turnbull.stephen.fw at u.tsukuba.ac.jp
Fri Dec 30 12:39:53 EST 2016
jab at math.brown.edu writes:
> But as you showed, it's certainly possible to do some exploration in the
> meantime. Prompted by your helpful comparison, I just put together
> https://gist.github.com/jab/fd78b3acd25b3530e0e21f5aaee3c674 to further
> compare hash_tuple vs. hash_incremental.
It's been a while :-) since I read Knuth[1] (and that was when Knuth
was authoritative on this subject 8^O), but neither methodology is
particularly helpful here. The ideal is effectively a uniform
distribution on a circle, which has no mean. Therefore standard
deviation is also uninteresting, since its definition uses the mean.
The right test is something like a χ² test against a uniform
distribution.
The scatter plots (of hash against test data) simply show the same
thing, without the precision of a statistical test. (Note: do not
deprecate the computer visualization + human eye + human brain system.
It is the best known machine for detecting significant patterns and
anomolies, though YMMV.) But it's not very good at detecting high-
dimensional patterns.
However, it's nowhere near good enough for a hash function to have a
uniform distribution on random data. It actually needs to be
"chaotic" in the sense that it also spreads "nearby" data out, where
"nearby" here would probably mean that as 4000-bit strings less than
m% (for small m) differ, as in real data you usually expect a certain
amount of structure (continuity in time, clustering around a mean,
line noise in a communication channel) so that you'd be likely to get
lots of "clusters" of very similar data. You don't want them pounding
on a small subset of buckets.
> I'm not sure if the methodology there is sound, as I'm new to
> analysis like this.
Even decades later, starting with Knuth[1] can't hurt. :-)
> Given sufficiently good distribution, I'd expect there to be unanimous
> agreement that an immutable collection, which could contain arbitrarily
> many items, should strongly prefer hash_incremental(self) over
> hash(tuple(self)), for the same reason we use generator comprehensions
> instead of list comprehensions when appropriate. Please correct me if I'm
> wrong.
I think that's a misleading analogy. Just stick to
For very large collections where we already have the data,
duplicating the data is very expensive. Furthermore, since the
purpose of hashing is making equality comparisons cheap, this is
likely to happen in a loop.
On the other hand, if there are going to be a *lot* of "large"
collections being stored, then they're actually not all that large
compared to the system, and you might not actually care that much
about the cost of the emphemeral tuples, because the real cost is in
an n^2 set of comparisons. From the point of view of NumPy, this is
an "are you kidding?" argument because large datasets are its stock in
trade, but for core Python, this may be sufficiently esoteric that it
should be delegated to
On balance, the arguments that Steven d'Aprano advanced for having a
statistics module in the stdlib vs. importing pandas apply here. In
particular, I think there are a huge number of options for an
iterative hash. For example, Ned chained 2-tuples. But perhaps for
efficient time in bounded space you want to use bounded but larger
tuples. I don't know -- and that's the point. If we have a TOOWTDI
API like hash.from_iterable then smart people (and people with time on
their hands to do exhaustive experiments!) can tune that over time, as
has been done with the sort API.
Another option is yielding partials, as Steven says:
> > itertools.iterhash(iterable) # yield incremental hashes
That's a very interesting idea, though I suspect it rarely would make
a big performance improvement.
I'm not sure I like the "hash.from_iterable" name for this API. The
problem is that if it's not a concrete collection[3], then you throw
away the data. If the hash algorithm happens to suck for certain
data, you'll get a lot of collisions, and conclude that your data is
much more concentrated than it actually is. I find it hard to imagine
a use case where you have large data where you only care about whether
two data points are different (cf. equality comparisons for floats).
You want to know how they're different. So I think I would prefer to
name it "hash.from_collection" or similar. Of course whether the
implementation actually raises on a generator or other non-concrete
iterable is a fine point.
Footnotes:
[1] Of course I mean The Art of Computer Programming, Ch. 3.
[2] Including the newly ordered dicts, maybe? Somebody tweet
@dabeaz! What evil can he do with this?
[3] Yeah, I know, "re-iterables". But we don't have a definition,
let alone an API for identifying, those Things.
More information about the Python-ideas
mailing list