[Python-Dev] Priority queue (binary heap) python code

Tim Peters tim@zope.com
Wed, 26 Jun 2002 18:09:46 -0400


[David Abrahams]
> This is really interesting. When I was at Dragon (well, actually,
> after Tim left and it became L&H), I ported my natural language
> parsing/understanding system from Python to C++ so it could run
> quickly enough for embedded devices.  The core of this system was an
> associative container, so I knew that its performance would be
> crucial.  I used C++ generics which made it really easy to swap in
> different associative container implementations, and I tried lots,
> including the red-black tree containers built into most C++
> implementations, and hash tables. My experience was that trying to
> come up with a hash function that would give a significant speed
> increases over the tree containers was extremely difficult, because it
> was really hard to come up with a good hash function.

There's more to a speedy hash implementation than just the hash function, of
course.

> Furthermore, even if I succeeded, it was like black magic: it was
< inconsistent accross my test cases and there was no way to understand
> why it worked well, and to get a feeling for how it would scale to
> problems outside those cases.

Python's dictobject.c and .h have extensive comments about how Python's
dicts work.  Their behavior isn't mysterious, at least not after 10 years of
thinking about it <0.9 wink>.  Python's dicts also use tricks that I've
never seen in print -- many people have contributed clever tricks.

> I ended up hand-coding a two-level scheme based on binary searches in
> contiguous arrays which blew away anything I'd been able to do with a
> hash table. My conclusion was that for general-purpose use, the red-
> black tree was pretty good, despite its relatively high memory overhead
> of 3 pointers per node:

The example I posted built a mapping with a million entries.  A red-black
tree of that size needs to chase between 20 and 40 pointers to determine
membership.  By sheer instruction count alone, that's way more instructions
than the Python dict usually has to do, although it's comparable to the
number the BTree had to do.  The BTree probably has better cache behavior
than a red-black tree; for example, all the keys in the million-element
example were on the third level, and a query required exactly two
pointer-chases to get to the right leaf bucket.  All the rest is binary
search on 60-120 element contiguous vectors (in fact, sounds a lot like your
custom "two-level scheme")

> it places easy requirements on the user (supply a strick weak ordering)
> and provides predictable and smooth performance even asymptotically.

OTOH, it can be very hard to write an efficient, correct "<" ordering, while
testing just "equal or not?" can be easier and run quicker than that.  Dict
comparison is a good example from the Python core:  computing "<" for dicts
is a nightmare, but computing "==" for dicts is easy (contrast the
straightforward dict_equal() with the brain-busting dict_compare() +
characterize() pair).  This was one of the motivations for introducing "rich
comparisons".

> On the other hand, hashing requires that the user supply both a hash
> function and an equality detector which must agree with one-another,

I've rarely found this to be a challenge.  For example, for sets that
contain sets as elements, a suitable hash function can simply xor the hash
codes of the set elements.  Since equal sets have equal elements, such a
scheme delivers equal hash codes for equal sets, and independent of the
order in which set elements get enumerated.  In contrast, defining a
sensible *total* ordering on sets is a delicate undertaking (yes, I know
that "strict weak ordering" is weaker than "total", but in real life you
couldn't buy a hot dog with the difference <wink>).

> requires hand-tuning of the hash function for performance, and is rather
> more unpredictable.  We've been talking about adding hash-based
> containers to the C++ standard library but I'm reluctant on these
> grounds. It seems to me that when you really care about speed, some kind
> of hand-coded solution might be a better investment than trying to come
> up with a good hash function.
>
> I'm ready to believe that hashing is the most appropriate choice for
> Python, but I wonder what makes the difference?

Well, I'm intimately familar with the details of how Python dicts and Zope
BTrees are implemented, down to staring at the machine instructions
generated, and there's no mystery here to me.  I'm not familiar with any of
the details of what you tried.  Understanding speed differences at this
level isn't a "general principles" kind of discussion.

I should note that Zope's BTrees pay a lot for playing nice with
persistence, about a factor of two:  upon visiting and leaving each BTree
node, there are messy test+branch sequences ensuring that the object isn't a
ghost, notifying the persistence machinery that fields have been accessed
and/or changed, and telling the persistence machinery when the object is no
longer in active use.  Most of these bookkeeping operations can fail too, so
there's another layer of "and did that fail?" test+branches around all that.
The saving grace for BTrees (why this doesn't cost a factor of, say, 10) is
that each BTree node contains a fair amount of "stuff", so that the guts of
each function can do a reasonable amount of useful work.  The persistence
overhead could be a disaster if visiting an object only moved one bit closer
to the result.

But Python's dicts aren't aware of persistence at all, and that did give
dicts an ~= factor-of-2 advantage in the example.  While they're still not
as zippy as dicts after factoring that out, B-Trees certainly aren't pigs.

BTW, note that Python's dicts originally catered only to string keys, as
they were the implementation of Python's namespaces, and dicts remain highly
optimized for that specific purpose.  Indeed, there's a distinct dict lookup
routine dedicated to dicts with string keys.  Namespaces have no compelling
use for range search or lexicographic traversal, just association, and peak
lookup speed was the design imperative.