From: "Tim Peters" <tim@zope.com>
There's more to a speedy hash implementation than just the hash function, of course.
'course.
Furthermore, even if I succeeded, it was like black magic: it was < inconsistent accross my test cases and there was no way to understand why it worked well, and to get a feeling for how it would scale to problems outside those cases.
Python's dictobject.c and .h have extensive comments about how Python's dicts work. Their behavior isn't mysterious, at least not after 10 years of thinking about it <0.9 wink>. Python's dicts also use tricks that I've never seen in print -- many people have contributed clever tricks.
I noticed that, and I think the next time I try hashing I'm going to steal as much as possible from Python's implementation to get a head start. Noticing that also left me with a question: how come everybody in the world hasn't stolen as much as possible from the Python hashing implementation? Are there a billion such 10-years'-tweaked implementations lying around which all perform comparably well?
The example I posted built a mapping with a million entries. A red-black tree of that size needs to chase between 20 and 40 pointers to determine membership. By sheer instruction count alone, that's way more instructions than the Python dict usually has to do, although it's comparable to the number the BTree had to do. The BTree probably has better cache behavior than a red-black tree; for example, all the keys in the million-element example were on the third level, and a query required exactly two pointer-chases to get to the right leaf bucket. All the rest is binary search on 60-120 element contiguous vectors (in fact, sounds a lot like your custom "two-level scheme")
Yeah, I think it ended up being something like that. Of course, the container I ended up with used domain-specific knowledge which would have been inappropriate for general-purpose use.
it places easy requirements on the user (supply a strick weak ordering) and provides predictable and smooth performance even asymptotically.
OTOH, it can be very hard to write an efficient, correct "<" ordering, while testing just "equal or not?" can be easier and run quicker than that. Dict comparison is a good example from the Python core: computing "<" for dicts is a nightmare, but computing "==" for dicts is easy (contrast the straightforward dict_equal() with the brain-busting dict_compare() + characterize() pair).
Well, OK, ordering hash tables is hard, unless the bucket count is a deterministic function of the element count. If they were sorted containers, of course, < would be a simple matter. And I assume that testing equality still involves a lot of hashing... Hmm, looking at the 3 C++ implementations of hashed containers that I have available to me, only one provides operator<(), which is rather strange since the other two implement operator== by first comparing sizes, then iterating through consecutive elements of each set looking for a difference. The implementation supplying operator<() uses a (IMO misguided) design that rehashes incrementally, but it seems to me that if the more straightforward approaches can implement operator==() as described, operator<() shouldn't have to be a big challenge for an everyday hash table. I'm obviously missing something, but what...?
This was one of the motivations for introducing "rich comparisons".
I don't see how that helps. Got a link? Or a clue?
On the other hand, hashing requires that the user supply both a hash function and an equality detector which must agree with one-another,
I've rarely found this to be a challenge. For example, for sets that contain sets as elements, a suitable hash function can simply xor the hash codes of the set elements. Since equal sets have equal elements, such a scheme delivers equal hash codes for equal sets, and independent of the order in which set elements get enumerated. In contrast, defining a sensible *total* ordering on sets is a delicate undertaking (yes, I know that "strict weak ordering" is weaker than "total", but in real life you couldn't buy a hot dog with the difference <wink>).
I don't know what that means. If you represent your sets as sorted containers, getting a strict weak ordering on sets is trivial; you just do it with a lexicographical comparison of the two sequences.
I'm ready to believe that hashing is the most appropriate choice for Python, but I wonder what makes the difference?
Well, I'm intimately familar with the details of how Python dicts and Zope BTrees are implemented, down to staring at the machine instructions generated, and there's no mystery here to me. I'm not familiar with any of the details of what you tried. Understanding speed differences at this level isn't a "general principles" kind of discussion.
I should note that Zope's BTrees pay a lot for playing nice with persistence, about a factor of two: upon visiting and leaving each BTree node, there are messy test+branch sequences ensuring that the object isn't a ghost, notifying the persistence machinery that fields have been accessed and/or changed, and telling the persistence machinery when the object is no longer in active use. Most of these bookkeeping operations can fail too, so there's another layer of "and did that fail?" test+branches around all
The saving grace for BTrees (why this doesn't cost a factor of, say, 10) is that each BTree node contains a fair amount of "stuff", so that the guts of each function can do a reasonable amount of useful work. The persistence overhead could be a disaster if visiting an object only moved one bit closer to the result.
But Python's dicts aren't aware of persistence at all, and that did give dicts an ~= factor-of-2 advantage in the example. While they're still not as zippy as dicts after factoring that out, B-Trees certainly aren't
No, I suppose not. But python's dicts are general-purpose containers, and you can put any key you like in there. It's still surprising to me given my (much less than 10 years') experience with hash implementations that you can design something that performs well over all those different cases. that. Aww, heck, you just need a good C++ exception-handling implementation to get rid of the error-checking overheads ;-) pigs.
BTW, note that Python's dicts originally catered only to string keys, as they were the implementation of Python's namespaces, and dicts remain
optimized for that specific purpose. Indeed, there's a distinct dict lookup routine dedicated to dicts with string keys. Namespaces have no compelling use for range search or lexicographic traversal, just association, and
highly peak
lookup speed was the design imperative.
Thanks for the perspective! still-learning-ly y'rs, dave