There's just no contest here. BTrees have many other virtues, like supporting range searches, and automatically playing nice with ZODB persistence, but they're plain sluggish compared to dicts. To be completely fair and unfair at the same time <wink>, there are also 4 other flavors of Zope BTree, purely for optimization reasons. In particular, the IIBTree maps Python ints to Python ints, and does so by avoiding Python int objects altogether, storing C longs directly and comparing them at native "compare a long to a long" C speed. That's *almost* as fast as Python 2.1 int->int dicts (which endure all-purpose Python object comparison), except for deletion (the BTree spends a lot of time tearing apart all the tree
again).
Now that's a perfectly height-balanced search tree that "chunks up" blocks of keys for storage and speed efficiency, and rarely needs more than a simple local adjustment to maintain balance. I expect that puts it at
This is really interesting. When I was at Dragon (well, actually, after Tim left and it became L&H), I ported my natural language parsing/understanding system from Python to C++ so it could run quickly enough for embedded devices. The core of this system was an associative container, so I knew that its performance would be crucial. I used C++ generics which made it really easy to swap in different associative container implementations, and I tried lots, including the red-black tree containers built into most C++ implementations, and hash tables. My experience was that trying to come up with a hash function that would give a significant speed increases over the tree containers was extremely difficult, because it was really hard to come up with a good hash function. Furthermore, even if I succeeded, it was like black magic: it was inconsistent accross my test cases and there was no way to understand why it worked well, and to get a feeling for how it would scale to problems outside those cases. I ended up hand-coding a two-level scheme based on binary searches in contiguous arrays which blew away anything I'd been able to do with a hash table. My conclusion was that for general-purpose use, the red-black tree was pretty good, despite its relatively high memory overhead of 3 pointers per node: it places easy requirements on the user (supply a strick weak ordering) and provides predictable and smooth performance even asymptotically. On the other hand, hashing requires that the user supply both a hash function and an equality detector which must agree with one-another, requires hand-tuning of the hash function for performance, and is rather more unpredictable. We've been talking about adding hash-based containers to the C++ standard library but I'm reluctant on these grounds. It seems to me that when you really care about speed, some kind of hand-coded solution might be a better investment than trying to come up with a good hash function. I'm ready to believe that hashing is the most appropriate choice for Python, but I wonder what makes the difference? -Dave From: "Tim Peters" <tim.one@comcast.net> pointers the
fast end of what can be achieved with a balanced tree scheme.
The ABC language (which Guido worked on before Python) used AVL trees for just about everything under the covers. It's not a coincidence that Python doesn't use balanced trees for anything <wink>.
_______________________________________________ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev