[Python-Dev] Priority queue (binary heap) python code

David Abrahams David Abrahams" <david.abrahams@rcn.com
Fri, 28 Jun 2002 01:56:23 -0400

----- Original Message -----
From: "Tim Peters" <tim.one@comcast.net>
To: "David Abrahams" <david.abrahams@rcn.com>
Cc: <python-dev@python.org>
Sent: Friday, June 28, 2002 1:01 AM
Subject: RE: [Python-Dev] Priority queue (binary heap) python code

> [David Abrahams]
> > ...
> > Noticing that also left me with a question: how come everybody in
> > the world hasn't stolen as much as possible from the Python hashing
> > implementation?  Are there a billion such 10-years'-tweaked
> > implementations lying around which all perform comparably well?
> It's a Mystery, and in all directions.  Python has virtually no code
> say, Tcl or Perl either, and the latter certainly do some things better
> Python does them.  I've studied all their hash implementations, but
> find anything worth stealing <wink>;

Well of course not!

> OTOH, multiple attempts by multiple
> groups to steal Perl's regexp engine years ago fizzled out in a tarpit of
> frustration.

Oh, I had the impression that Python's re *was* pilfered Perl.

> Curious:  Python independently developed a string hash *very* similar to
> what later became "the standard" Fowler-Noll-Vo string hash:
>     http://www.isthe.com/chongo/tech/comp/fnv/
> The multiplier is different, and the initial value, but that's it.  I'm
> there was no communication in either direction.  So ya, given enough
time, a
> billion other projects will rediscover it too.


> > Well, OK, ordering hash tables is hard, unless the bucket count is a
> > deterministic function of the element count.
> I don't know how the latter could help; for that matter, I'm not even
> what it means.

I know what I meant, but I was wrong. My brain cell musta jammed. Ordering
hash tables is hard if collisions are possible.

> > And I assume that testing equality still involves a lot of hashing...
> No more times than the common length of the two dicts.

Of course.

> It's just:
> def dict_equal(dict1, dict2):
>     if len(dict1) != len(dict2):
>         return False
>     for key, value in dict1.iteritems():
>         if key not in dict2 or not value == dict2[key]:
>              return False
>     return True
> Searching dict2 for key *may* involve hashing key again (or it may not;
> example, Python string objects are immutable and cache their 32-bit hash
> the string object the first time it's computed).

Tricky. I guess a C++ object could be designed to cooperate with hash
tables in that way also.

> There's a world of pain involved in the "==" there, though, as a dict can
> very well have itself as a value in itself, and the time required for
> completion appears to be exponential in some pathological cases of that
> (Python does detect the unbounded recursion in such cases -- eventually).

Yuck. I wouldn't expect any C++ implementation to handle that issue.

> > Hmm, looking at the 3 C++ implementations of hashed containers that I
> > available to me, only one provides operator<(), which is rather strange
> > since the other two implement operator == by first comparing sizes,
> > iterating through consecutive elements of each set looking for a
> > difference. The implementation supplying operator<() uses a (IMO
> > misguided) design that rehashes incrementally, but it seems to me that
> > the more straightforward approaches can implement operator==() as
> > described, operator<() shouldn't have to be a big challenge for an
> > everyday hash table.
> >
> > I'm obviously missing something, but what...?
> I don't know, but I didn't follow what you were saying (like, "rehashes
> incrementally" doesn't mean anything to me).

Get ahold of MSVC7 and look at the hash_set implementation. IIRC how
Plaugher described it, it is constantly maintaining the load factor across
insertions, so there's never a big cost to grow the table. It also keeps
the items in each bucket sorted, so hash table comparisons are a lot
easier. My gut tells me that this isn't worth what you pay for it, but so
far my gut hasn't had very much of any value to say about hashing...

The other implementations seem to implement equality as something like:

template <class T, class Hash, class Compare, class Allocator>
operator==(const hash_set<T, Hash, Compare, Allocator>& x,
           const hash_set<T, Hash, Compare, Allocator>& y)
 return x.size() == y.size() && std::equal(x.begin(), x.end(), y.begin());

Which has to be a bug unless they've got a very strange way of defining
equality, or some kindof ordering built into the iterators.

> If there's a simpler way to
> get "the right" answer, I'd love to see it.  I once spent two hours
> to prove that the dict_compare() + characterize() pair in Python was
> correct, but gave up in a mushrooming forest of end cases.

I think it's a tougher problem in Python than in languages with value
semantics, where an object can't actually contain itself.

> In The Beginning, Python implemented dict comparison by materializing the
> .items(), sorting both, and then doing list comparison.  The correctness
> that was easy to show.  But it turned out that in real life all anyone
> used was == comparison on dicts, and sorting was enormously expensive
> compared to what was possible.  characterize() is a very clever hack
> dreamt up to get the same result in no more than two passes -- but I've
> never been sure it's a thoroughly correct hack.

??? I can't find characterize() described anywhere, nor can I find it on my
trusty dict objects:

>>> help({}.characterize)
Traceback (most recent call last):
  File "<stdin>", line 1, in ?
AttributeError: 'dict' object has no attribute 'characterize'

> OTOH, since nobody appears
> to care about "<" for dicts, if it's wrong we may never know that.

As long as the Python associative world is built around hash + ==, you're
probably OK.

> >> This was one of the motivations for introducing "rich comparisons".
> > I don't see how that helps. Got a link? Or a clue?
> Sorry, I don't understand the question.

Well, you answered it pretty damn well anyway...

> Comparing strings for equality/inequality
> alone is also done faster than needing to resolve string ordering.  And
> complex numbers have no accepted "<" relation at all.

Yeah, good point. C++ has a less<T>/operator< dichotomy mostly to
accomodate pointer types in segmented memory models, but there's no such
accomodation for complex<T>.

> So comparing dicts
> isn't the only place it's easier and quicker to restrict the burden on
> type to implementing equality testing.  For user-defined types, I've
> found it *much* easier.  For example, I can easily tell whether two
> chessboards are equal (do they have the same pieces on the same
> but a concept of "<" for chessboards is strained.

Strained, maybe, but easy. You can do a lexicographic comparison of the
square contents.

> > I don't know what that means.
> There's too much of that on both sides here, so I delcare this mercifully
> ended now <0.9 wink>.

I, of course, will drag it on to the bitter end.

> [on Zope's B-Trees]
> > Aww, heck, you just need a good C++ exception-handling implementation
> > get rid of the error-checking overheads ;-)
> I'd love to use C++ for this.  This is one of those things that defines 5
> families of 4 related data structures each via a pile of .c and .h files
> that get #include'd and recompiled 5 times after #define'ing a pile of
> macros.  It would be *so* much more pleasant using templates.

I have *just* the library for you. Works with 'C', too!
Believe it or not, people are still pushing this technology to improve
compilation times and debuggability of the result.

> > ...
> > Thanks for the perspective!
> >
> > still-learning-ly y'rs,
> You're too old for that now -- start making money instead <wink>.

Sorry, I'll try hard to grow up now.