Re: [Python-Dev] Priority queue (binary heap) python code

June 26, 2002

      From: "Tim Peters" <tim@zope.com>
...
There's more to a speedy hash implementation than just the hash function,
of
course.
'course.
...
...
Furthermore, even if I succeeded, it was like black magic: it was
< inconsistent accross my test cases and there was no way to understand
why it worked well, and to get a feeling for how it would scale to
problems outside those cases.
Python's dictobject.c and .h have extensive comments about how Python's
dicts work.  Their behavior isn't mysterious, at least not after 10 years
of
thinking about it <0.9 wink>.  Python's dicts also use tricks that I've
never seen in print -- many people have contributed clever tricks.
I noticed that, and I think the next time I try hashing I'm going to steal
as much as possible from Python's implementation to get a head start.

Noticing that also left me with a question: how come everybody in the world
hasn't stolen as much as possible from the Python hashing implementation?
Are there a billion such 10-years'-tweaked implementations lying around
which all perform comparably well?
...
The example I posted built a mapping with a million entries.  A red-black
tree of that size needs to chase between 20 and 40 pointers to determine
membership.  By sheer instruction count alone, that's way more
instructions
than the Python dict usually has to do, although it's comparable to the
number the BTree had to do.  The BTree probably has better cache behavior
than a red-black tree; for example, all the keys in the million-element
example were on the third level, and a query required exactly two
pointer-chases to get to the right leaf bucket.  All the rest is binary
search on 60-120 element contiguous vectors (in fact, sounds a lot like
your
custom "two-level scheme")
Yeah, I think it ended up being something like that. Of course, the
container I ended up with used domain-specific knowledge which would have
been inappropriate for general-purpose use.
...
...
it places easy requirements on the user (supply a strick weak ordering)
and provides predictable and smooth performance even asymptotically.
OTOH, it can be very hard to write an efficient, correct "<" ordering,
while
testing just "equal or not?" can be easier and run quicker than that.
Dict
comparison is a good example from the Python core:  computing "<" for
dicts
is a nightmare, but computing "==" for dicts is easy (contrast the
straightforward dict_equal() with the brain-busting dict_compare() +
characterize() pair).
Well, OK, ordering hash tables is hard, unless the bucket count is a
deterministic function of the element count. If they were sorted
containers, of course, < would be a simple matter. And I assume that
testing equality still involves a lot of hashing...

Hmm, looking at the 3 C++ implementations of hashed containers that I have
available to me, only one provides operator<(), which is rather strange
since the other two implement operator== by first comparing sizes, then
iterating through consecutive elements of each set looking for a
difference. The implementation supplying operator<() uses a (IMO misguided)
design that rehashes incrementally, but it seems to me that if the more
straightforward approaches can implement operator==() as described,
operator<() shouldn't have to be a big challenge for an everyday hash
table.

I'm obviously missing something, but what...?
...
This was one of the motivations for introducing "rich
comparisons".
I don't see how that helps. Got a link? Or a clue?
...
...
On the other hand, hashing requires that the user supply both a hash
function and an equality detector which must agree with one-another,
I've rarely found this to be a challenge.  For example, for sets that
contain sets as elements, a suitable hash function can simply xor the
hash
codes of the set elements.  Since equal sets have equal elements, such a
scheme delivers equal hash codes for equal sets, and independent of the
order in which set elements get enumerated.  In contrast, defining a
sensible *total* ordering on sets is a delicate undertaking (yes, I know
that "strict weak ordering" is weaker than "total", but in real life you
couldn't buy a hot dog with the difference <wink>).
I don't know what that means. If you represent your sets as sorted
containers, getting a strict weak ordering on sets is trivial; you just do
it with a lexicographical comparison of the two sequences.
...
...
I'm ready to believe that hashing is the most appropriate choice for
Python, but I wonder what makes the difference?
Well, I'm intimately familar with the details of how Python dicts and
Zope
BTrees are implemented, down to staring at the machine instructions
generated, and there's no mystery here to me.  I'm not familiar with any
of
the details of what you tried.  Understanding speed differences at this
level isn't a "general principles" kind of discussion.
...
I should note that Zope's BTrees pay a lot for playing nice with
persistence, about a factor of two:  upon visiting and leaving each BTree
node, there are messy test+branch sequences ensuring that the object
isn't a
ghost, notifying the persistence machinery that fields have been accessed
and/or changed, and telling the persistence machinery when the object is
no
longer in active use.  Most of these bookkeeping operations can fail too,
so
there's another layer of "and did that fail?" test+branches around all
...
The saving grace for BTrees (why this doesn't cost a factor of, say, 10)
is
that each BTree node contains a fair amount of "stuff", so that the guts
of
each function can do a reasonable amount of useful work.  The persistence
overhead could be a disaster if visiting an object only moved one bit
closer
to the result.
But Python's dicts aren't aware of persistence at all, and that did give
dicts an ~= factor-of-2 advantage in the example.  While they're still
not
as zippy as dicts after factoring that out, B-Trees certainly aren't
No, I suppose not. But python's dicts are general-purpose containers, and
you can put any key you like in there. It's still surprising to me given my
(much less than 10 years') experience with hash implementations that you
can design something that performs well over all those different cases.

that.

Aww, heck, you just need a good C++ exception-handling implementation to
get rid of the error-checking overheads ;-)

pigs.
...
BTW, note that Python's dicts originally catered only to string keys, as
they were the implementation of Python's namespaces, and dicts remain
...
optimized for that specific purpose.  Indeed, there's a distinct dict
lookup
routine dedicated to dicts with string keys.  Namespaces have no
compelling
use for range search or lexicographic traversal, just association, and
highly
peak
...
lookup speed was the design imperative.
Thanks for the perspective!

still-learning-ly y'rs,
dave

Re: [Python-Dev] Priority queue (binary heap) python code

David Abrahams