[Python-Dev] Priority queue (binary heap) python code

Fri, 28 Jun 2002 01:01:41 -0400

[David Abrahams]
> ...
> Noticing that also left me with a question: how come everybody in
> the world hasn't stolen as much as possible from the Python hashing
> implementation?  Are there a billion such 10-years'-tweaked
> implementations lying around which all perform comparably well?

It's a Mystery, and in all directions.  Python has virtually no code from,
say, Tcl or Perl either, and the latter certainly do some things better than
Python does them.  I've studied all their hash implementations, but didn't
find anything worth stealing <wink>; OTOH, multiple attempts by multiple
groups to steal Perl's regexp engine years ago fizzled out in a tarpit of
frustration.

Curious:  Python independently developed a string hash *very* similar to
what later became "the standard" Fowler-Noll-Vo string hash:

    http://www.isthe.com/chongo/tech/comp/fnv/

The multiplier is different, and the initial value, but that's it.  I'm sure
there was no communication in either direction.  So ya, given enough time, a
billion other projects will rediscover it too.

>> OTOH, it can be very hard to write an efficient, correct "<" ordering,
>> while testing just "equal or not?" can be easier and run quicker than
>> that.  Dict comparison is a good example from the Python core:
>> computing "<" for dicts is a nightmare, but computing "==" for dicts is
>> easy (contrast the straightforward dict_equal() with the brain-busting
>> dict_compare() + characterize() pair).

> Well, OK, ordering hash tables is hard, unless the bucket count is a
> deterministic function of the element count.

I don't know how the latter could help; for that matter, I'm not even sure
what it means.

> If they were sorted containers, of course, < would be a simple matter.

Yes.

> And I assume that testing equality still involves a lot of hashing...

No more times than the common length of the two dicts.  It's just:

def dict_equal(dict1, dict2):
    if len(dict1) != len(dict2):
        return False
    for key, value in dict1.iteritems():
        if key not in dict2 or not value == dict2[key]:
             return False
    return True

Searching dict2 for key *may* involve hashing key again (or it may not; for
example, Python string objects are immutable and cache their 32-bit hash in
the string object the first time it's computed).

There's a world of pain involved in the "==" there, though, as a dict can
very well have itself as a value in itself, and the time required for
completion appears to be exponential in some pathological cases of that kind
(Python does detect the unbounded recursion in such cases -- eventually).

> Hmm, looking at the 3 C++ implementations of hashed containers that I have
> available to me, only one provides operator<(), which is rather strange
> since the other two implement operator == by first comparing sizes, then
> iterating through consecutive elements of each set looking for a
> difference. The implementation supplying operator<() uses a (IMO
> misguided) design that rehashes incrementally, but it seems to me that if
> the more straightforward approaches can implement operator==() as
> described, operator<() shouldn't have to be a big challenge for an
> everyday hash table.
>
> I'm obviously missing something, but what...?

I don't know, but I didn't follow what you were saying (like, "rehashes
incrementally" doesn't mean anything to me).  If there's a simpler way to
get "the right" answer, I'd love to see it.  I once spent two hours trying
to prove that the dict_compare() + characterize() pair in Python was
correct, but gave up in a mushrooming forest of end cases.

In The Beginning, Python implemented dict comparison by materializing the
.items(), sorting both, and then doing list comparison.  The correctness of
that was easy to show.  But it turned out that in real life all anyone ever
used was == comparison on dicts, and sorting was enormously expensive
compared to what was possible.  characterize() is a very clever hack Guido
dreamt up to get the same result in no more than two passes -- but I've
never been sure it's a thoroughly correct hack.  OTOH, since nobody appears
to care about "<" for dicts, if it's wrong we may never know that.

>> This was one of the motivations for introducing "rich comparisons".

> I don't see how that helps. Got a link? Or a clue?

Sorry, I don't understand the question.  When Python funneled all
comparisons through cmp(), it wasn't possible for a type implementation to
do anything faster for, say, "==", because it had no idea why cmp() was
being called.  Allowing people to ask for the specific comparison they
wanted is part of what "rich comparisons" was about, and speed was one of
the reasons for adopting it.  Comparing strings for equality/inequality
alone is also done faster than needing to resolve string ordering.  And
complex numbers have no accepted "<" relation at all.  So comparing dicts
isn't the only place it's easier and quicker to restrict the burden on the
type to implementing equality testing.  For user-defined types, I've often
found it *much* easier.  For example, I can easily tell whether two
chessboards are equal (do they have the same pieces on the same squares?),
but a concept of "<" for chessboards is strained.

> I don't know what that means.

There's too much of that on both sides here, so I delcare this mercifully
ended now <0.9 wink>.

> If you represent your sets as sorted containers, getting a strict weak
> ordering on sets is trivial; you just do it with a lexicographical
> comparison of the two sequences.

And if you don't, that conclusion doesn't follow.

> .,,
> No, I suppose not. But python's dicts are general-purpose containers, and
> you can put any key you like in there. It's still surprising to
> me given my (much less than 10 years') experience with hash
> implementations that you can design something that performs well over
> all those different cases.

You probably can't a priori, but after a decade people stumble into all the
cases that don't work well, and you eventually fiddle the type-specific hash
functions and the general implementation until surprises appear to stop.  It
remains a probabilistic method, though, and there are no guarantees.  BTW, I
believe that of all Python's builtin types, only the hash function for
integers remains in its original form (hash(i) == i).  So even if I don't
want to, I'm forced to agree that finding a good hash function isn't
trivial.

[on Zope's B-Trees]
> Aww, heck, you just need a good C++ exception-handling implementation to
> get rid of the error-checking overheads ;-)

I'd love to use C++ for this.  This is one of those things that defines 5
families of 4 related data structures each via a pile of .c and .h files
that get #include'd and recompiled 5 times after #define'ing a pile of
macros.  It would be *so* much more pleasant using templates.

> ...
> Thanks for the perspective!
>
> still-learning-ly y'rs,

You're too old for that now -- start making money instead <wink>.

the-psf-will-put-it-to-good-use-ly y'rs  - tim