This is really interesting. When I was at Dragon (well, actually, after Tim left and it became L&H), I ported my natural language parsing/understanding system from Python to C++ so it could run quickly enough for embedded devices. The core of this system was an associative container, so I knew that its performance would be crucial. I used C++ generics which made it really easy to swap in different associative container implementations, and I tried lots, including the red-black tree containers built into most C++ implementations, and hash tables. My experience was that trying to come up with a hash function that would give a significant speed increases over the tree containers was extremely difficult, because it was really hard to come up with a good hash function. Furthermore, even if I succeeded, it was like black magic: it was inconsistent accross my test cases and there was no way to understand why it worked well, and to get a feeling for how it would scale to problems outside those cases. I ended up hand-coding a two-level scheme based on binary searches in contiguous arrays which blew away anything I'd been able to do with a hash table. My conclusion was that for general-purpose use, the red-black tree was pretty good, despite its relatively high memory overhead of 3 pointers per node: it places easy requirements on the user (supply a strick weak ordering) and provides predictable and smooth performance even asymptotically. On the other hand, hashing requires that the user supply both a hash function and an equality detector which must agree with one-another, requires hand-tuning of the hash function for performance, and is rather more unpredictable. We've been talking about adding hash-based containers to the C++ standard library but I'm reluctant on these grounds. It seems to me that when you really care about speed, some kind of hand-coded solution might be a better investment than trying to come up with a good hash function.
I'm ready to believe that hashing is the most appropriate choice for Python, but I wonder what makes the difference?
From: "Tim Peters" email@example.com
There's just no contest here. BTrees have many other virtues, like supporting range searches, and automatically playing nice with ZODB persistence, but they're plain sluggish compared to dicts. To be
fair and unfair at the same time <wink>, there are also 4 other flavors
Zope BTree, purely for optimization reasons. In particular, the IIBTree maps Python ints to Python ints, and does so by avoiding Python int
altogether, storing C longs directly and comparing them at native
long to a long" C speed. That's *almost* as fast as Python 2.1 int->int dicts (which endure all-purpose Python object comparison), except for deletion (the BTree spends a lot of time tearing apart all the tree
Now that's a perfectly height-balanced search tree that "chunks up"
of keys for storage and speed efficiency, and rarely needs more than a simple local adjustment to maintain balance. I expect that puts it at
fast end of what can be achieved with a balanced tree scheme.
The ABC language (which Guido worked on before Python) used AVL trees for just about everything under the covers. It's not a coincidence that
doesn't use balanced trees for anything <wink>.
Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev