There's just no contest here. BTrees have many other virtues, like supporting range searches, and automatically playing nice with ZODB persistence, but they're plain sluggish compared to dicts. To be completely fair and unfair at the same time <wink>, there are also 4 other flavors of Zope BTree, purely for optimization reasons. In particular, the IIBTree maps Python ints to Python ints, and does so by avoiding Python int objects altogether, storing C longs directly and comparing them at native "compare a long to a long" C speed. That's *almost* as fast as Python 2.1 int->int dicts (which endure all-purpose Python object comparison), except for deletion (the BTree spends a lot of time tearing apart all the tree
again).
Now that's a perfectly height-balanced search tree that "chunks up" blocks of keys for storage and speed efficiency, and rarely needs more than a simple local adjustment to maintain balance. I expect that puts it at
This is really interesting. When I was at Dragon (well, actually, after Tim
left and it became L&H), I ported my natural language parsing/understanding
system from Python to C++ so it could run quickly enough for embedded
devices. The core of this system was an associative container, so I knew
that its performance would be crucial. I used C++ generics which made it
really easy to swap in different associative container implementations, and
I tried lots, including the red-black tree containers built into most C++
implementations, and hash tables. My experience was that trying to come up
with a hash function that would give a significant speed increases over the
tree containers was extremely difficult, because it was really hard to come
up with a good hash function. Furthermore, even if I succeeded, it was like
black magic: it was inconsistent accross my test cases and there was no way
to understand why it worked well, and to get a feeling for how it would
scale to problems outside those cases. I ended up hand-coding a two-level
scheme based on binary searches in contiguous arrays which blew away
anything I'd been able to do with a hash table. My conclusion was that for
general-purpose use, the red-black tree was pretty good, despite its
relatively high memory overhead of 3 pointers per node: it places easy
requirements on the user (supply a strick weak ordering) and provides
predictable and smooth performance even asymptotically. On the other hand,
hashing requires that the user supply both a hash function and an equality
detector which must agree with one-another, requires hand-tuning of the
hash function for performance, and is rather more unpredictable. We've been
talking about adding hash-based containers to the C++ standard library but
I'm reluctant on these grounds. It seems to me that when you really care
about speed, some kind of hand-coded solution might be a better investment
than trying to come up with a good hash function.
I'm ready to believe that hashing is the most appropriate choice for
Python, but I wonder what makes the difference?
-Dave
From: "Tim Peters"
fast end of what can be achieved with a balanced tree scheme.
The ABC language (which Guido worked on before Python) used AVL trees for just about everything under the covers. It's not a coincidence that Python doesn't use balanced trees for anything <wink>.
_______________________________________________ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev