[Python-Dev] shrinking dicts

Tim Peters tim_one@email.msn.com
Fri, 13 Aug 1999 00:53:38 -0400


[Vladimir Marangozov, *almost* seems ready to give up on a counter-
 productive dict pessimization <wink>]

> ...
> There is, however, a small percentage of dicts which are used
> below 1/3 of their size. I must say, below 1/3 of their peek size,
> because dowsizing is also rare. To trigger a downsize, 1/3 new
> entries of the peek size must be inserted.

Not so, although "on average" 1/6 may be correct.  Look at an extreme:  Say
a dict has size 333 (it can't, but it makes the math obvious ...).  Say it
contains 221 items.  Now someone deletes them all, one at a time.  used==0
and fill==221 at this point.  They insert one new key that happens to hit
one of the 333-221 = 112 remaining NULL keys.  Then used==1 and fill==222.
They insert a 2nd key, and before the dict is searched the new fill of 222
triggers the 2/3rds load-factor resizing -- which asks for a new size of 1*2
== 2.

For the minority of dicts that go up and down in size wildly many times, the
current behavior is fine.

> Besides these observations, after looking at the code one more
> time, I can't really understand why the resize logic is based on
> the "fill" watermark and not on "used". fill = used + dummy, but
> since lookdict returns the first free slot (null or dummy), I don't
> really see what's the point of using a fill watermark...

Let's just consider an unsuccessful search.  Then it does return "the first"
free slot, but not necessarily at the time it *sees* the first free slot.
So long as it sees a dummy, it has to keep searching; the search doesn't end
until it finds a NULL.  So consider this, assuming the resize triggered only
on "used":

d = {}
for i in xrange(50000):
    d[random.randrange(1000000)] = 1
for k in d.keys():
    del d[k]
# now there are 50000 dummy dict keys, and some number of NULLs

# loop invariant:  used == 0
for i in xrange(sys.maxint):
    j = random.randrange(10000000)
    d[j] = 1
    del d[j]
    assert not d.has_key(i)

However many NULL slots remained, the last loop eventually transforms them
*all* into dummies.  The dummies act exactly like "real keys" with respect
to expected time for an unsuccessful search, which is why it's thoroughly
appropriate to include dummies in the load factor computation.  The loop
will run slower and slower as the percentage of dummies approaches 100%, and
each failing has_key approaches O(N) time.

In most hash table implementations that's the worst that can happen (and
it's a disaster), but under Python's implementation it's worse:  Python
never checks to see whether the probe sequence "wraps around", so the first
search after the last NULL is changed to a dummy never ends.

Counting the dummies in the load-factor computation prevents all that:  no
matter how much inserts and deletes are intermixed, the "effective load
factor" stays under 2/3rds so gives excellent expected-case behavior; and it
also protects against an all-dummy dict, making the lack of an expensive
inner-loop "wrapped around?" check safe.

> Perhaps you can enlighten me on this. Using only the "used" metrics
> seems fine to me. I even deactivated "fill" and replaced it with "used"
> to see what happens -- no visible changes, except a tiny speedup I'm
> willing to neglect.

You need a mix of deletes and inserts for the dummies to make a difference;
dicts that always grow don't have dummies, so they're not likely to have any
dummy-related problems either <wink>.  Try this (untested):

import time
from random import randrange
N = 1000
thatmany = [None] * N

while 1:
    start = time.clock()
    for i in thatmany:
        d[randrange(10000000)] = 1
    for i in d.keys():
        del d[i]
    finish = time.clock()
    print round(finish - start, 3)

Succeeding iterations of the outer loop should grow dramatically slower, and
finally get into an infinite loop, despite that "used" never exceeds N.

Short course rewording:  for purposes of predicting expected search time, a
dummy is the same as a live key, because finding a dummy doesn't end a
search -- it has to press on until either finding the key it was looking
for, or finding a NULL.  And with a mix of insertions and deletions, and if
the hash function is doing a good job, then over time all the slots in the
table will become either live or dummy, even if "used" stays within a very
small range.

So, that's why <wink>.

dictobject-may-be-the-subtlest-object-there-is-ly y'rs  - tim