[David Abrahams]
This is really interesting. When I was at Dragon (well, actually, after Tim left and it became L&H), I ported my natural language parsing/understanding system from Python to C++ so it could run quickly enough for embedded devices. The core of this system was an associative container, so I knew that its performance would be crucial. I used C++ generics which made it really easy to swap in different associative container implementations, and I tried lots, including the red-black tree containers built into most C++ implementations, and hash tables. My experience was that trying to come up with a hash function that would give a significant speed increases over the tree containers was extremely difficult, because it was really hard to come up with a good hash function.
There's more to a speedy hash implementation than just the hash function, of course.
Furthermore, even if I succeeded, it was like black magic: it was < inconsistent accross my test cases and there was no way to understand why it worked well, and to get a feeling for how it would scale to problems outside those cases.
Python's dictobject.c and .h have extensive comments about how Python's dicts work. Their behavior isn't mysterious, at least not after 10 years of thinking about it <0.9 wink>. Python's dicts also use tricks that I've never seen in print -- many people have contributed clever tricks.
I ended up hand-coding a two-level scheme based on binary searches in contiguous arrays which blew away anything I'd been able to do with a hash table. My conclusion was that for general-purpose use, the red- black tree was pretty good, despite its relatively high memory overhead of 3 pointers per node:
The example I posted built a mapping with a million entries. A red-black tree of that size needs to chase between 20 and 40 pointers to determine membership. By sheer instruction count alone, that's way more instructions than the Python dict usually has to do, although it's comparable to the number the BTree had to do. The BTree probably has better cache behavior than a red-black tree; for example, all the keys in the million-element example were on the third level, and a query required exactly two pointer-chases to get to the right leaf bucket. All the rest is binary search on 60-120 element contiguous vectors (in fact, sounds a lot like your custom "two-level scheme")
it places easy requirements on the user (supply a strick weak ordering) and provides predictable and smooth performance even asymptotically.
OTOH, it can be very hard to write an efficient, correct "<" ordering, while testing just "equal or not?" can be easier and run quicker than that. Dict comparison is a good example from the Python core: computing "<" for dicts is a nightmare, but computing "==" for dicts is easy (contrast the straightforward dict_equal() with the brain-busting dict_compare() + characterize() pair). This was one of the motivations for introducing "rich comparisons".
On the other hand, hashing requires that the user supply both a hash function and an equality detector which must agree with one-another,
I've rarely found this to be a challenge. For example, for sets that contain sets as elements, a suitable hash function can simply xor the hash codes of the set elements. Since equal sets have equal elements, such a scheme delivers equal hash codes for equal sets, and independent of the order in which set elements get enumerated. In contrast, defining a sensible *total* ordering on sets is a delicate undertaking (yes, I know that "strict weak ordering" is weaker than "total", but in real life you couldn't buy a hot dog with the difference <wink>).
requires hand-tuning of the hash function for performance, and is rather more unpredictable. We've been talking about adding hash-based containers to the C++ standard library but I'm reluctant on these grounds. It seems to me that when you really care about speed, some kind of hand-coded solution might be a better investment than trying to come up with a good hash function.
I'm ready to believe that hashing is the most appropriate choice for Python, but I wonder what makes the difference?
Well, I'm intimately familar with the details of how Python dicts and Zope BTrees are implemented, down to staring at the machine instructions generated, and there's no mystery here to me. I'm not familiar with any of the details of what you tried. Understanding speed differences at this level isn't a "general principles" kind of discussion. I should note that Zope's BTrees pay a lot for playing nice with persistence, about a factor of two: upon visiting and leaving each BTree node, there are messy test+branch sequences ensuring that the object isn't a ghost, notifying the persistence machinery that fields have been accessed and/or changed, and telling the persistence machinery when the object is no longer in active use. Most of these bookkeeping operations can fail too, so there's another layer of "and did that fail?" test+branches around all that. The saving grace for BTrees (why this doesn't cost a factor of, say, 10) is that each BTree node contains a fair amount of "stuff", so that the guts of each function can do a reasonable amount of useful work. The persistence overhead could be a disaster if visiting an object only moved one bit closer to the result. But Python's dicts aren't aware of persistence at all, and that did give dicts an ~= factor-of-2 advantage in the example. While they're still not as zippy as dicts after factoring that out, B-Trees certainly aren't pigs. BTW, note that Python's dicts originally catered only to string keys, as they were the implementation of Python's namespaces, and dicts remain highly optimized for that specific purpose. Indeed, there's a distinct dict lookup routine dedicated to dicts with string keys. Namespaces have no compelling use for range search or lexicographic traversal, just association, and peak lookup speed was the design imperative.