[David Abrahams]
... Noticing that also left me with a question: how come everybody in the world hasn't stolen as much as possible from the Python hashing implementation? Are there a billion such 10-years'-tweaked implementations lying around which all perform comparably well?
It's a Mystery, and in all directions. Python has virtually no code from, say, Tcl or Perl either, and the latter certainly do some things better than Python does them. I've studied all their hash implementations, but didn't find anything worth stealing <wink>; OTOH, multiple attempts by multiple groups to steal Perl's regexp engine years ago fizzled out in a tarpit of frustration. Curious: Python independently developed a string hash *very* similar to what later became "the standard" Fowler-Noll-Vo string hash: http://www.isthe.com/chongo/tech/comp/fnv/ The multiplier is different, and the initial value, but that's it. I'm sure there was no communication in either direction. So ya, given enough time, a billion other projects will rediscover it too.
OTOH, it can be very hard to write an efficient, correct "<" ordering, while testing just "equal or not?" can be easier and run quicker than that. Dict comparison is a good example from the Python core: computing "<" for dicts is a nightmare, but computing "==" for dicts is easy (contrast the straightforward dict_equal() with the brain-busting dict_compare() + characterize() pair).
Well, OK, ordering hash tables is hard, unless the bucket count is a deterministic function of the element count.
I don't know how the latter could help; for that matter, I'm not even sure what it means.
If they were sorted containers, of course, < would be a simple matter.
Yes.
And I assume that testing equality still involves a lot of hashing...
No more times than the common length of the two dicts. It's just: def dict_equal(dict1, dict2): if len(dict1) != len(dict2): return False for key, value in dict1.iteritems(): if key not in dict2 or not value == dict2[key]: return False return True Searching dict2 for key *may* involve hashing key again (or it may not; for example, Python string objects are immutable and cache their 32-bit hash in the string object the first time it's computed). There's a world of pain involved in the "==" there, though, as a dict can very well have itself as a value in itself, and the time required for completion appears to be exponential in some pathological cases of that kind (Python does detect the unbounded recursion in such cases -- eventually).
Hmm, looking at the 3 C++ implementations of hashed containers that I have available to me, only one provides operator<(), which is rather strange since the other two implement operator == by first comparing sizes, then iterating through consecutive elements of each set looking for a difference. The implementation supplying operator<() uses a (IMO misguided) design that rehashes incrementally, but it seems to me that if the more straightforward approaches can implement operator==() as described, operator<() shouldn't have to be a big challenge for an everyday hash table.
I'm obviously missing something, but what...?
I don't know, but I didn't follow what you were saying (like, "rehashes incrementally" doesn't mean anything to me). If there's a simpler way to get "the right" answer, I'd love to see it. I once spent two hours trying to prove that the dict_compare() + characterize() pair in Python was correct, but gave up in a mushrooming forest of end cases. In The Beginning, Python implemented dict comparison by materializing the .items(), sorting both, and then doing list comparison. The correctness of that was easy to show. But it turned out that in real life all anyone ever used was == comparison on dicts, and sorting was enormously expensive compared to what was possible. characterize() is a very clever hack Guido dreamt up to get the same result in no more than two passes -- but I've never been sure it's a thoroughly correct hack. OTOH, since nobody appears to care about "<" for dicts, if it's wrong we may never know that.
This was one of the motivations for introducing "rich comparisons".
I don't see how that helps. Got a link? Or a clue?
Sorry, I don't understand the question. When Python funneled all comparisons through cmp(), it wasn't possible for a type implementation to do anything faster for, say, "==", because it had no idea why cmp() was being called. Allowing people to ask for the specific comparison they wanted is part of what "rich comparisons" was about, and speed was one of the reasons for adopting it. Comparing strings for equality/inequality alone is also done faster than needing to resolve string ordering. And complex numbers have no accepted "<" relation at all. So comparing dicts isn't the only place it's easier and quicker to restrict the burden on the type to implementing equality testing. For user-defined types, I've often found it *much* easier. For example, I can easily tell whether two chessboards are equal (do they have the same pieces on the same squares?), but a concept of "<" for chessboards is strained.
I don't know what that means.
There's too much of that on both sides here, so I delcare this mercifully ended now <0.9 wink>.
If you represent your sets as sorted containers, getting a strict weak ordering on sets is trivial; you just do it with a lexicographical comparison of the two sequences.
And if you don't, that conclusion doesn't follow.
.,, No, I suppose not. But python's dicts are general-purpose containers, and you can put any key you like in there. It's still surprising to me given my (much less than 10 years') experience with hash implementations that you can design something that performs well over all those different cases.
You probably can't a priori, but after a decade people stumble into all the cases that don't work well, and you eventually fiddle the type-specific hash functions and the general implementation until surprises appear to stop. It remains a probabilistic method, though, and there are no guarantees. BTW, I believe that of all Python's builtin types, only the hash function for integers remains in its original form (hash(i) == i). So even if I don't want to, I'm forced to agree that finding a good hash function isn't trivial. [on Zope's B-Trees]
Aww, heck, you just need a good C++ exception-handling implementation to get rid of the error-checking overheads ;-)
I'd love to use C++ for this. This is one of those things that defines 5 families of 4 related data structures each via a pile of .c and .h files that get #include'd and recompiled 5 times after #define'ing a pile of macros. It would be *so* much more pleasant using templates.
... Thanks for the perspective!
still-learning-ly y'rs,
You're too old for that now -- start making money instead <wink>. the-psf-will-put-it-to-good-use-ly y'rs - tim