[David Abrahams]
... Noticing that also left me with a question: how come everybody in the world hasn't stolen as much as possible from the Python hashing implementation? Are there a billion such 10-years'-tweaked implementations lying around which all perform comparably well?
It's a Mystery, and in all directions. Python has virtually no code from, say, Tcl or Perl either, and the latter certainly do some things better
----- Original Message ----- From: "Tim Peters" <tim.one@comcast.net> To: "David Abrahams" <david.abrahams@rcn.com> Cc: <python-dev@python.org> Sent: Friday, June 28, 2002 1:01 AM Subject: RE: [Python-Dev] Priority queue (binary heap) python code than
Python does them. I've studied all their hash implementations, but didn't find anything worth stealing <wink>;
Well of course not!
OTOH, multiple attempts by multiple groups to steal Perl's regexp engine years ago fizzled out in a tarpit of frustration.
Oh, I had the impression that Python's re *was* pilfered Perl.
Curious: Python independently developed a string hash *very* similar to what later became "the standard" Fowler-Noll-Vo string hash:
http://www.isthe.com/chongo/tech/comp/fnv/
The multiplier is different, and the initial value, but that's it. I'm sure there was no communication in either direction. So ya, given enough time, a billion other projects will rediscover it too.
Nifty.
Well, OK, ordering hash tables is hard, unless the bucket count is a deterministic function of the element count.
I don't know how the latter could help; for that matter, I'm not even sure what it means.
I know what I meant, but I was wrong. My brain cell musta jammed. Ordering hash tables is hard if collisions are possible.
And I assume that testing equality still involves a lot of hashing...
No more times than the common length of the two dicts.
Of course.
It's just:
def dict_equal(dict1, dict2): if len(dict1) != len(dict2): return False for key, value in dict1.iteritems(): if key not in dict2 or not value == dict2[key]: return False return True
Searching dict2 for key *may* involve hashing key again (or it may not; for example, Python string objects are immutable and cache their 32-bit hash in the string object the first time it's computed).
Tricky. I guess a C++ object could be designed to cooperate with hash tables in that way also.
There's a world of pain involved in the "==" there, though, as a dict can very well have itself as a value in itself, and the time required for completion appears to be exponential in some pathological cases of that kind (Python does detect the unbounded recursion in such cases -- eventually).
Hmm, looking at the 3 C++ implementations of hashed containers that I have available to me, only one provides operator<(), which is rather strange since the other two implement operator == by first comparing sizes,
Yuck. I wouldn't expect any C++ implementation to handle that issue. then
iterating through consecutive elements of each set looking for a difference. The implementation supplying operator<() uses a (IMO misguided) design that rehashes incrementally, but it seems to me that if the more straightforward approaches can implement operator==() as described, operator<() shouldn't have to be a big challenge for an everyday hash table.
I'm obviously missing something, but what...?
I don't know, but I didn't follow what you were saying (like, "rehashes incrementally" doesn't mean anything to me).
If there's a simpler way to get "the right" answer, I'd love to see it. I once spent two hours
Get ahold of MSVC7 and look at the hash_set implementation. IIRC how Plaugher described it, it is constantly maintaining the load factor across insertions, so there's never a big cost to grow the table. It also keeps the items in each bucket sorted, so hash table comparisons are a lot easier. My gut tells me that this isn't worth what you pay for it, but so far my gut hasn't had very much of any value to say about hashing... The other implementations seem to implement equality as something like: template <class T, class Hash, class Compare, class Allocator> inline bool operator==(const hash_set<T, Hash, Compare, Allocator>& x, const hash_set<T, Hash, Compare, Allocator>& y) { return x.size() == y.size() && std::equal(x.begin(), x.end(), y.begin()); } Which has to be a bug unless they've got a very strange way of defining equality, or some kindof ordering built into the iterators. trying
to prove that the dict_compare() + characterize() pair in Python was correct, but gave up in a mushrooming forest of end cases.
I think it's a tougher problem in Python than in languages with value semantics, where an object can't actually contain itself.
In The Beginning, Python implemented dict comparison by materializing the .items(), sorting both, and then doing list comparison. The correctness of that was easy to show. But it turned out that in real life all anyone ever used was == comparison on dicts, and sorting was enormously expensive compared to what was possible. characterize() is a very clever hack Guido dreamt up to get the same result in no more than two passes -- but I've never been sure it's a thoroughly correct hack.
??? I can't find characterize() described anywhere, nor can I find it on my trusty dict objects:
help({}.characterize) Traceback (most recent call last): File "<stdin>", line 1, in ? AttributeError: 'dict' object has no attribute 'characterize'
OTOH, since nobody appears to care about "<" for dicts, if it's wrong we may never know that.
As long as the Python associative world is built around hash + ==, you're probably OK.
This was one of the motivations for introducing "rich comparisons".
I don't see how that helps. Got a link? Or a clue?
Sorry, I don't understand the question.
Well, you answered it pretty damn well anyway...
Comparing strings for equality/inequality alone is also done faster than needing to resolve string ordering. And complex numbers have no accepted "<" relation at all.
So comparing dicts isn't the only place it's easier and quicker to restrict the burden on
Yeah, good point. C++ has a less<T>/operator< dichotomy mostly to accomodate pointer types in segmented memory models, but there's no such accomodation for complex<T>. the
type to implementing equality testing. For user-defined types, I've often found it *much* easier. For example, I can easily tell whether two chessboards are equal (do they have the same pieces on the same squares?), but a concept of "<" for chessboards is strained.
Strained, maybe, but easy. You can do a lexicographic comparison of the square contents.
I don't know what that means.
There's too much of that on both sides here, so I delcare this mercifully ended now <0.9 wink>.
I, of course, will drag it on to the bitter end.
[on Zope's B-Trees]
Aww, heck, you just need a good C++ exception-handling implementation
to
get rid of the error-checking overheads ;-)
I'd love to use C++ for this. This is one of those things that defines 5 families of 4 related data structures each via a pile of .c and .h files that get #include'd and recompiled 5 times after #define'ing a pile of macros. It would be *so* much more pleasant using templates.
I have *just* the library for you. Works with 'C', too! http://www.boost.org/libs/preprocessor/doc/ Believe it or not, people are still pushing this technology to improve compilation times and debuggability of the result.
... Thanks for the perspective!
still-learning-ly y'rs,
You're too old for that now -- start making money instead <wink>.
Sorry, I'll try hard to grow up now. -Dave