[Vladimir Marangozov, *almost* seems ready to give up on a counter- productive dict pessimization <wink>]
... There is, however, a small percentage of dicts which are used below 1/3 of their size. I must say, below 1/3 of their peek size, because dowsizing is also rare. To trigger a downsize, 1/3 new entries of the peek size must be inserted.
Not so, although "on average" 1/6 may be correct. Look at an extreme: Say a dict has size 333 (it can't, but it makes the math obvious ...). Say it contains 221 items. Now someone deletes them all, one at a time. used==0 and fill==221 at this point. They insert one new key that happens to hit one of the 333-221 = 112 remaining NULL keys. Then used==1 and fill==222. They insert a 2nd key, and before the dict is searched the new fill of 222 triggers the 2/3rds load-factor resizing -- which asks for a new size of 1*2 == 2. For the minority of dicts that go up and down in size wildly many times, the current behavior is fine.
Besides these observations, after looking at the code one more time, I can't really understand why the resize logic is based on the "fill" watermark and not on "used". fill = used + dummy, but since lookdict returns the first free slot (null or dummy), I don't really see what's the point of using a fill watermark...
Let's just consider an unsuccessful search. Then it does return "the first" free slot, but not necessarily at the time it *sees* the first free slot. So long as it sees a dummy, it has to keep searching; the search doesn't end until it finds a NULL. So consider this, assuming the resize triggered only on "used": d = {} for i in xrange(50000): d[random.randrange(1000000)] = 1 for k in d.keys(): del d[k] # now there are 50000 dummy dict keys, and some number of NULLs # loop invariant: used == 0 for i in xrange(sys.maxint): j = random.randrange(10000000) d[j] = 1 del d[j] assert not d.has_key(i) However many NULL slots remained, the last loop eventually transforms them *all* into dummies. The dummies act exactly like "real keys" with respect to expected time for an unsuccessful search, which is why it's thoroughly appropriate to include dummies in the load factor computation. The loop will run slower and slower as the percentage of dummies approaches 100%, and each failing has_key approaches O(N) time. In most hash table implementations that's the worst that can happen (and it's a disaster), but under Python's implementation it's worse: Python never checks to see whether the probe sequence "wraps around", so the first search after the last NULL is changed to a dummy never ends. Counting the dummies in the load-factor computation prevents all that: no matter how much inserts and deletes are intermixed, the "effective load factor" stays under 2/3rds so gives excellent expected-case behavior; and it also protects against an all-dummy dict, making the lack of an expensive inner-loop "wrapped around?" check safe.
Perhaps you can enlighten me on this. Using only the "used" metrics seems fine to me. I even deactivated "fill" and replaced it with "used" to see what happens -- no visible changes, except a tiny speedup I'm willing to neglect.
You need a mix of deletes and inserts for the dummies to make a difference; dicts that always grow don't have dummies, so they're not likely to have any dummy-related problems either <wink>. Try this (untested): import time from random import randrange N = 1000 thatmany = [None] * N while 1: start = time.clock() for i in thatmany: d[randrange(10000000)] = 1 for i in d.keys(): del d[i] finish = time.clock() print round(finish - start, 3) Succeeding iterations of the outer loop should grow dramatically slower, and finally get into an infinite loop, despite that "used" never exceeds N. Short course rewording: for purposes of predicting expected search time, a dummy is the same as a live key, because finding a dummy doesn't end a search -- it has to press on until either finding the key it was looking for, or finding a NULL. And with a mix of insertions and deletions, and if the hash function is doing a good job, then over time all the slots in the table will become either live or dummy, even if "used" stays within a very small range. So, that's why <wink>. dictobject-may-be-the-subtlest-object-there-is-ly y'rs - tim
participants (1)
-
Tim Peters