[Python-Dev] Priority queue (binary heap) python code

Wed, 26 Jun 2002 19:20:21 -0400

From: "Tim Peters" <tim@zope.com>

> There's more to a speedy hash implementation than just the hash function,
of
> course.

'course.

> > Furthermore, even if I succeeded, it was like black magic: it was
> < inconsistent accross my test cases and there was no way to understand
> > why it worked well, and to get a feeling for how it would scale to
> > problems outside those cases.
>
> Python's dictobject.c and .h have extensive comments about how Python's
> dicts work.  Their behavior isn't mysterious, at least not after 10 years
of
> thinking about it <0.9 wink>.  Python's dicts also use tricks that I've
> never seen in print -- many people have contributed clever tricks.

I noticed that, and I think the next time I try hashing I'm going to steal
as much as possible from Python's implementation to get a head start.

Noticing that also left me with a question: how come everybody in the world
hasn't stolen as much as possible from the Python hashing implementation?
Are there a billion such 10-years'-tweaked implementations lying around
which all perform comparably well?

> The example I posted built a mapping with a million entries.  A red-black
> tree of that size needs to chase between 20 and 40 pointers to determine
> membership.  By sheer instruction count alone, that's way more
instructions
> than the Python dict usually has to do, although it's comparable to the
> number the BTree had to do.  The BTree probably has better cache behavior
> than a red-black tree; for example, all the keys in the million-element
> example were on the third level, and a query required exactly two
> pointer-chases to get to the right leaf bucket.  All the rest is binary
> search on 60-120 element contiguous vectors (in fact, sounds a lot like
your
> custom "two-level scheme")

Yeah, I think it ended up being something like that. Of course, the
container I ended up with used domain-specific knowledge which would have
been inappropriate for general-purpose use.

> > it places easy requirements on the user (supply a strick weak ordering)
> > and provides predictable and smooth performance even asymptotically.
>
> OTOH, it can be very hard to write an efficient, correct "<" ordering,
while
> testing just "equal or not?" can be easier and run quicker than that.
Dict
> comparison is a good example from the Python core:  computing "<" for
dicts
> is a nightmare, but computing "==" for dicts is easy (contrast the
> straightforward dict_equal() with the brain-busting dict_compare() +
> characterize() pair).

Well, OK, ordering hash tables is hard, unless the bucket count is a
deterministic function of the element count. If they were sorted
containers, of course, < would be a simple matter. And I assume that
testing equality still involves a lot of hashing...

Hmm, looking at the 3 C++ implementations of hashed containers that I have
available to me, only one provides operator<(), which is rather strange
since the other two implement operator== by first comparing sizes, then
iterating through consecutive elements of each set looking for a
difference. The implementation supplying operator<() uses a (IMO misguided)
design that rehashes incrementally, but it seems to me that if the more
straightforward approaches can implement operator==() as described,
operator<() shouldn't have to be a big challenge for an everyday hash
table.

I'm obviously missing something, but what...?

> This was one of the motivations for introducing "rich
> comparisons".

I don't see how that helps. Got a link? Or a clue?

> > On the other hand, hashing requires that the user supply both a hash
> > function and an equality detector which must agree with one-another,
>
> I've rarely found this to be a challenge.  For example, for sets that
> contain sets as elements, a suitable hash function can simply xor the
hash
> codes of the set elements.  Since equal sets have equal elements, such a
> scheme delivers equal hash codes for equal sets, and independent of the
> order in which set elements get enumerated.  In contrast, defining a
> sensible *total* ordering on sets is a delicate undertaking (yes, I know
> that "strict weak ordering" is weaker than "total", but in real life you
> couldn't buy a hot dog with the difference <wink>).

I don't know what that means. If you represent your sets as sorted
containers, getting a strict weak ordering on sets is trivial; you just do
it with a lexicographical comparison of the two sequences.

> > I'm ready to believe that hashing is the most appropriate choice for
> > Python, but I wonder what makes the difference?
>
> Well, I'm intimately familar with the details of how Python dicts and
Zope
> BTrees are implemented, down to staring at the machine instructions
> generated, and there's no mystery here to me.  I'm not familiar with any
of
> the details of what you tried.  Understanding speed differences at this
> level isn't a "general principles" kind of discussion.

No, I suppose not. But python's dicts are general-purpose containers, and
you can put any key you like in there. It's still surprising to me given my
(much less than 10 years') experience with hash implementations that you
can design something that performs well over all those different cases.

> I should note that Zope's BTrees pay a lot for playing nice with
> persistence, about a factor of two:  upon visiting and leaving each BTree
> node, there are messy test+branch sequences ensuring that the object
isn't a
> ghost, notifying the persistence machinery that fields have been accessed
> and/or changed, and telling the persistence machinery when the object is
no
> longer in active use.  Most of these bookkeeping operations can fail too,
so
> there's another layer of "and did that fail?" test+branches around all
that.

Aww, heck, you just need a good C++ exception-handling implementation to
get rid of the error-checking overheads ;-)

> The saving grace for BTrees (why this doesn't cost a factor of, say, 10)
is
> that each BTree node contains a fair amount of "stuff", so that the guts
of
> each function can do a reasonable amount of useful work.  The persistence
> overhead could be a disaster if visiting an object only moved one bit
closer
> to the result.
>
> But Python's dicts aren't aware of persistence at all, and that did give
> dicts an ~= factor-of-2 advantage in the example.  While they're still
not
> as zippy as dicts after factoring that out, B-Trees certainly aren't
pigs.
>
> BTW, note that Python's dicts originally catered only to string keys, as
> they were the implementation of Python's namespaces, and dicts remain
highly
> optimized for that specific purpose.  Indeed, there's a distinct dict
lookup
> routine dedicated to dicts with string keys.  Namespaces have no
compelling
> use for range search or lexicographic traversal, just association, and
peak
> lookup speed was the design imperative.

Thanks for the perspective!

still-learning-ly y'rs,
dave