Re: [Python-Dev] Priority queue (binary heap) python code

28 Jun 2002

      ...
[David Abrahams]
...
...
Noticing that also left me with a question: how come everybody in
the world hasn't stolen as much as possible from the Python hashing
implementation?  Are there a billion such 10-years'-tweaked
implementations lying around which all perform comparably well?
It's a Mystery, and in all directions.  Python has virtually no code
from,
say, Tcl or Perl either, and the latter certainly do some things better
----- Original Message -----
From: "Tim Peters" <tim.one@comcast.net>
To: "David Abrahams" <david.abrahams@rcn.com>
Cc: <python-dev@python.org>
Sent: Friday, June 28, 2002 1:01 AM
Subject: RE: [Python-Dev] Priority queue (binary heap) python code

than
...
Python does them.  I've studied all their hash implementations, but
didn't
find anything worth stealing <wink>;
Well of course not!
...
OTOH, multiple attempts by multiple
groups to steal Perl's regexp engine years ago fizzled out in a tarpit of
frustration.
Oh, I had the impression that Python's re *was* pilfered Perl.
...
Curious:  Python independently developed a string hash *very* similar to
what later became "the standard" Fowler-Noll-Vo string hash:
http://www.isthe.com/chongo/tech/comp/fnv/
The multiplier is different, and the initial value, but that's it.  I'm
sure
there was no communication in either direction.  So ya, given enough
time, a
billion other projects will rediscover it too.
Nifty.
...
...
Well, OK, ordering hash tables is hard, unless the bucket count is a
deterministic function of the element count.
I don't know how the latter could help; for that matter, I'm not even
sure
what it means.
I know what I meant, but I was wrong. My brain cell musta jammed. Ordering
hash tables is hard if collisions are possible.
...
...
And I assume that testing equality still involves a lot of hashing...
No more times than the common length of the two dicts.
Of course.
...
It's just:
def dict_equal(dict1, dict2):
    if len(dict1) != len(dict2):
        return False
    for key, value in dict1.iteritems():
        if key not in dict2 or not value == dict2[key]:
             return False
    return True
Searching dict2 for key *may* involve hashing key again (or it may not;
for
example, Python string objects are immutable and cache their 32-bit hash
in
the string object the first time it's computed).
Tricky. I guess a C++ object could be designed to cooperate with hash
tables in that way also.
...
There's a world of pain involved in the "==" there, though, as a dict can
very well have itself as a value in itself, and the time required for
completion appears to be exponential in some pathological cases of that
kind
(Python does detect the unbounded recursion in such cases -- eventually).
...
...
Hmm, looking at the 3 C++ implementations of hashed containers that I
have
available to me, only one provides operator<(), which is rather strange
since the other two implement operator == by first comparing sizes,
Yuck. I wouldn't expect any C++ implementation to handle that issue.

then
...
...
iterating through consecutive elements of each set looking for a
difference. The implementation supplying operator<() uses a (IMO
misguided) design that rehashes incrementally, but it seems to me that
if
the more straightforward approaches can implement operator==() as
described, operator<() shouldn't have to be a big challenge for an
everyday hash table.
I'm obviously missing something, but what...?
I don't know, but I didn't follow what you were saying (like, "rehashes
incrementally" doesn't mean anything to me).
...
If there's a simpler way to
get "the right" answer, I'd love to see it.  I once spent two hours
Get ahold of MSVC7 and look at the hash_set implementation. IIRC how
Plaugher described it, it is constantly maintaining the load factor across
insertions, so there's never a big cost to grow the table. It also keeps
the items in each bucket sorted, so hash table comparisons are a lot
easier. My gut tells me that this isn't worth what you pay for it, but so
far my gut hasn't had very much of any value to say about hashing...

The other implementations seem to implement equality as something like:

template <class T, class Hash, class Compare, class Allocator>
inline
bool
operator==(const hash_set<T, Hash, Compare, Allocator>& x,
           const hash_set<T, Hash, Compare, Allocator>& y)
{
 return x.size() == y.size() && std::equal(x.begin(), x.end(), y.begin());
}

Which has to be a bug unless they've got a very strange way of defining
equality, or some kindof ordering built into the iterators.

trying
...
to prove that the dict_compare() + characterize() pair in Python was
correct, but gave up in a mushrooming forest of end cases.
I think it's a tougher problem in Python than in languages with value
semantics, where an object can't actually contain itself.
...
In The Beginning, Python implemented dict comparison by materializing the
.items(), sorting both, and then doing list comparison.  The correctness
of
that was easy to show.  But it turned out that in real life all anyone
ever
used was == comparison on dicts, and sorting was enormously expensive
compared to what was possible.  characterize() is a very clever hack
Guido
dreamt up to get the same result in no more than two passes -- but I've
never been sure it's a thoroughly correct hack.
??? I can't find characterize() described anywhere, nor can I find it on my
trusty dict objects:
...
...
...
help({}.characterize)
Traceback (most recent call last):
  File "<stdin>", line 1, in ?
AttributeError: 'dict' object has no attribute 'characterize'
...
OTOH, since nobody appears
to care about "<" for dicts, if it's wrong we may never know that.
As long as the Python associative world is built around hash + ==, you're
probably OK.
...
...
...
This was one of the motivations for introducing "rich comparisons".
...
I don't see how that helps. Got a link? Or a clue?
Sorry, I don't understand the question.
Well, you answered it pretty damn well anyway...
...
Comparing strings for equality/inequality
alone is also done faster than needing to resolve string ordering.  And
complex numbers have no accepted "<" relation at all.
...
So comparing dicts
isn't the only place it's easier and quicker to restrict the burden on
Yeah, good point. C++ has a less<T>/operator< dichotomy mostly to
accomodate pointer types in segmented memory models, but there's no such
accomodation for complex<T>.

the
...
type to implementing equality testing.  For user-defined types, I've
often
found it *much* easier.  For example, I can easily tell whether two
chessboards are equal (do they have the same pieces on the same
squares?),
but a concept of "<" for chessboards is strained.
Strained, maybe, but easy. You can do a lexicographic comparison of the
square contents.
...
...
I don't know what that means.
There's too much of that on both sides here, so I delcare this mercifully
ended now <0.9 wink>.
I, of course, will drag it on to the bitter end.
...
[on Zope's B-Trees]
...
Aww, heck, you just need a good C++ exception-handling implementation
to
...
...
get rid of the error-checking overheads ;-)
I'd love to use C++ for this.  This is one of those things that defines 5
families of 4 related data structures each via a pile of .c and .h files
that get #include'd and recompiled 5 times after #define'ing a pile of
macros.  It would be *so* much more pleasant using templates.
I have *just* the library for you. Works with 'C', too!
http://www.boost.org/libs/preprocessor/doc/
Believe it or not, people are still pushing this technology to improve
compilation times and debuggability of the result.
...
...
...
Thanks for the perspective!
still-learning-ly y'rs,
You're too old for that now -- start making money instead <wink>.
Sorry, I'll try hard to grow up now.

-Dave