[Python-Dev] The Dictionary Gem is polished!

Dec. 17, 2000

      Old topic: {}.popitem() (was Re: {}.first[key,value,item] ...)

Christian Tismer wrote:
...
Fredrik Lundh wrote:
...
christian wrote:
...
That algorithm is really a gem which you should know,
so let me try to explain it.
I think someone just won the "brain exploder 2000" award ;-)
<snip>
...
As you might have guessed, I didn't do this just for fun.
It is the old game of explaining what is there, convincing
everybody that you at least know what you are talking about,
and then three days later coming up with an improved
application of the theory.
Today is Monday, 2 days left. :-)
Ok, today is Sunday, I had no time to finish this.
But now it is here.

                  ===========================
                  =====    Claim:       =====
                  ===========================

-  Dictionary access time can be improved with a minimal change -

On the hash() function:
All Objects are supposed to provide a hash function which is
as good as possible.
Good means to provide a wide range of different keys for different
values.

Problem: There are hash functions which are "good" in this sense,
but they do not spread their randomness uniformly over the
32 bits.

Example: Integers use their own value as hash.
This is ok, as far as the integers are uniformly distributed.
But if they all contain a high power of two, for instance,
the low bits give a very bad hash function.

Take a dictionary with integers range(1000) as keys and access
all entries. Then use a dictionay with the integers shifted
left by 16.
Access time is slowed down by a factor of 100, since every
access is a linear search now.

This is not an urgent problem, although applications exist
where this can play a role (memory addresses for instance
can have high factors of two when people do statistics on
page accesses...)

While this is not a big problem, it is ugly enough to
think of a solution.

Solution 1:
-------------
Try to involve more bits of the hash value by doing extra
shuffling, either 
a) in the dictlook function, or
b) in the hash generation itself.

I believe, both can't be really justified for a rare problem.
But how about changing the existing solution in a way that
an improvement is gained without extra cost?

Solution 2: (*the* solution)
----------------------------
Some people may remember what I wrote about re-hashing
functions through the multiplicative group GF(2^n)*,
and I don't want to repeat this here.
The simple idea can be summarized quickly:

The original algorithm uses multiplication by polynomials,
and it is guaranteed that these re-hash values are jittering
through all possible nonzero patterns of the n bits.

Observation: Whe are using an operation of a finite field.
This means that the inverse of multiplication also exists!

Old algortithm (multiplication):
      shift the index left by 1
      if index > mask:
          xor the index with the generator polynomial

New algorithm (division):
      if low bit of index set:
          xor the index with the generator polynomial
      shift the index right by 1

What does this mean? Not so much, we are just cycling through
our bit patterns in reverse order.

But now for the big difference.

First change:    We change from multiplication to division.
Second change:   We do not mask the hash value before!

The second change is what I was after: By not masking the
hash value when computing the initial index, all the
existing bits in the hash come into play.

This can be seen like a polynomial division, but the initial
remainder (the hash value) was not normalized. After a number
of steps, all the extra bits are wheeled into our index,
but not wasted by masking them off. That gives our re-hash
some more randomness. When all the extra bits are sucked
in, the guaranteed single-visit cycle begins. There cannot
be more than 27 extra cycles in the worst case (dict size
= 32, so there are 27 bits to consume).

I do not expect any bad effect from this modification.

Here some results, dictionaries have 1000 entries:

timing for strings              old=  5.097 new= 5.088
timing for bad integers (<<10)  old=101.540 new=12.610
timing for bad integers (<<16)  old=571.210 new=19.220

On strings, both algorithms behave the same.
On numbers, they differ dramatically.
While the current algorithm is 110 times slower on a worst
case dict (quadratic behavior), the new algorithm accounts
a little for the extra cycle, but is only 4 times slower.

Alternative implementation:
The above approach is conservative in the sense that it
tries not to slow down the current implementation in any
way. An alternative would be to comsume all of the extra
bits at once. But this would add an extra "warmup" loop
like this to the algorithm:

    while index > mask:
        if low bit of index set:
            xor the index with the generator polynomial
        shift the index right by 1

This is of course a very good digest of the higher bits,
since it is a polynomial division and not just some
bit xor-ing which might give quite predictable cancellations,
therefore it is "the right way" in my sense.
It might be cheap, but would add over 20 cycles to every
small dict. I therefore don't think it is worth to do this.

Personally, I prefer the solution to merge the bits during
the actual lookup, since it suffices to get access time
from quadratic down to logarithmic.

Attached is a direct translation of the relevant parts
of dictobject.c into Python, with both algorithms
implemented.

cheers - chris

-- 
Christian Tismer             :^)   <mailto:tismer@tismer.com>
Mission Impossible 5oftware  :     Have a break! Take a ride on Python's
Kaunstr. 26                  :    *Starship* http://starship.python.net
14163 Berlin                 :     PGP key -> http://wwwkeys.pgp.net
PGP Fingerprint       E182 71C7 1A9D 66E9 9D15  D3CC D4D7 93E2 1FAE F6DF
     where do you want to jump today?   http://www.stackless.com

## dictest.py
## Test of a new rehash algorithm
## Chris Tismer
## 2000-12-17
## Mission Impossible 5oftware Team

# The following is a partial re-implementation of
# Python dictionaries in Python.
# The original algorithm was literally turned
# into Python code.

##/*
##Table of irreducible polynomials to efficiently cycle through
##GF(2^n)-{0}, 2<=n<=30.
##*/
polys = [
    4 + 3,
    8 + 3,
    16 + 3,
    32 + 5,
    64 + 3,
    128 + 3,
    256 + 29,
    512 + 17,
    1024 + 9,
    2048 + 5,
    4096 + 83,
    8192 + 27,
    16384 + 43,
    32768 + 3,
    65536 + 45,
    131072 + 9,
    262144 + 39,
    524288 + 39,
    1048576 + 9,
    2097152 + 5,
    4194304 + 3,
    8388608 + 33,
    16777216 + 27,
    33554432 + 9,
    67108864 + 71,
    134217728 + 39,
    268435456 + 9,
    536870912 + 5,
    1073741824 + 83,
    0
]

class NULL: pass

class Dictionary:
    dummy = "<dummy key>"
    def __init__(mp, newalg=0):
        mp.ma_size = 0
        mp.ma_poly = 0
        mp.ma_table = []
        mp.ma_fill = 0
        mp.ma_used = 0
        mp.oldalg = not newalg

    def lookdict(mp, key, _hash):
        me_hash, me_key, me_value = range(3) # rec slots
        dummy = mp.dummy

        mask = mp.ma_size-1
        ep0 = mp.ma_table
        i = (~_hash) & mask
        ep = ep0[i]
        if ep[me_key] is NULL or ep[me_key] == key:
            return ep
        if ep[me_key] == dummy:
            freeslot = ep
        else:
            if (ep[me_hash] == _hash and
                cmp(ep[me_key], key) == 0) :
                return ep
            freeslot = NULL

###### FROM HERE
        if mp.oldalg:
            incr = (_hash ^ (_hash >> 3)) & mask
        else:
            # note that we do not mask!
            # even the shifting my not be worth it.
            incr = _hash ^ (_hash >> 3)
###### TO HERE

        if (not incr):
            incr = mask
        while 1:
            ep = ep0[(i+incr)&mask]
            if (ep[me_key] is NULL) :
                if (freeslot != NULL) :
                    return freeslot
                else:
                    return ep
            if (ep[me_key] == dummy) :
                if (freeslot == NULL):
                    freeslot = ep
            elif (ep[me_key] == key or
                 (ep[me_hash] == _hash and
                  cmp(ep[me_key], key) == 0)) :
                return ep

            # Cycle through GF(2^n)-{0}
###### FROM HERE
            if mp.oldalg:
                incr = incr << 1
                if (incr > mask):
                    incr = incr ^ mp.ma_poly
            else:
                # new algorithm: do a division
                if (incr & 1):
                    incr = incr ^ mp.ma_poly
                incr = incr >> 1
###### TO HERE

    def insertdict(mp, key, _hash, value):
        me_hash, me_key, me_value = range(3) # rec slots

        ep = mp.lookdict(key, _hash)
        if (ep[me_value] is not NULL) :
            old_value = ep[me_value]
            ep[me_value] = value
        else :
            if (ep[me_key] is NULL):
                mp.ma_fill=mp.ma_fill+1
            ep[me_key] = key
            ep[me_hash] = _hash
            ep[me_value] = value
            mp.ma_used = mp.ma_used+1

    def dictresize(mp, minused):
        me_hash, me_key, me_value = range(3) # rec slots

        oldsize = mp.ma_size
        oldtable = mp.ma_table
        MINSIZE = 4
        newsize = MINSIZE
        for i in range(len(polys)):
            if (newsize > minused) :
                newpoly = polys[i]
                break
            newsize = newsize << 1
        else:
            return -1

        _nullentry = range(3)
        _nullentry[me_hash] = 0
        _nullentry[me_key] = NULL
        _nullentry[me_value] = NULL

        newtable = map(lambda x,y=_nullentry:y[:], range(newsize))      

        mp.ma_size = newsize
        mp.ma_poly = newpoly
        mp.ma_table = newtable
        mp.ma_fill = 0
        mp.ma_used = 0

        for ep in oldtable:
            if (ep[me_value] is not NULL):
                mp.insertdict(ep[me_key],ep[me_hash],ep[me_value])
        return 0

    # PyDict_GetItem
    def __getitem__(op, key):
        me_hash, me_key, me_value = range(3) # rec slots

        if not op.ma_table:
            raise KeyError, key
        _hash = hash(key)
        return op.lookdict(key, _hash)[me_value]

    # PyDict_SetItem
    def __setitem__(op, key, value):
        mp = op
        _hash = hash(key)
##      /* if fill >= 2/3 size, double in size */
        if (mp.ma_fill*3 >= mp.ma_size*2) :
            if (mp.dictresize(mp.ma_used*2) != 0):
                if (mp.ma_fill+1 > mp.ma_size):
                    raise MemoryError
        mp.insertdict(key, _hash, value)

    # more interface functions
    def keys(self):
        me_hash, me_key, me_value = range(3) # rec slots
        res = []
        for _hash, _key, _value in self.ma_table:
            if _value is not NULL:
                res.append( _key)
        return res

    def values(self):
        me_hash, me_key, me_value = range(3) # rec slots
        res = []
        for _hash, _key, _value in self.ma_table:
            if _value is not NULL:
                res.append( _value)
        return res

    def items(self):
        me_hash, me_key, me_value = range(3) # rec slots
        res = []
        for _hash, _key, _value in self.ma_table:
            if _value is not NULL:
                res.append( (_key, _value) )
        return res

    def __cmp__(self, other):
        mine = self.items()
        others = other.items()
        mine.sort()
        others.sort()
        return cmp(mine, others)

######################################################
## tests

def timing(func, args=None, n=1, **keywords) :
	import time
	time=time.time
	appl=apply
	if args is None: args = ()
	if type(args) != type(()) : args=(args,)
	rep=range(n)
	dummyarg = ("",)
	dummykw = {}
	dummyfunc = len
	if keywords:
		before=time()
		for i in rep: res=appl(dummyfunc, dummyarg, dummykw)
		empty = time()-before
		before=time()
		for i in rep: res=appl(func, args, keywords)
	else:
		before=time()
		for i in rep: res=appl(dummyfunc, dummyarg)
		empty = time()-before
		before=time()
		for i in rep: res=appl(func, args)
	after = time()
	return round(after-before-empty,4), res

def test(lis, dic):
    for key in lis: dic[key]

def nulltest(lis, dic):
    for key in lis: dic

def string_dicts():
	d1 = Dictionary()   # original
	d2 = Dictionary(1)  # other rehash
	for i in range(1000):
		s = str(i) * 5
		d1[s] = d2[s] = i
	return d1, d2

def badnum_dicts():
	d1 = Dictionary()   # original
	d2 = Dictionary(1)  # other rehash
	for i in range(1000):
		bad = i << 16
		d1[bad] = d2[bad] = i
	return d1, d2

def do_test(dict, keys, n):
	t0 = timing(nulltest, (keys, dict), n)[0]
	t1 = timing(test, (keys, dict), n)[0]
	return t1-t0

if __name__ == "__main__":
	sdold, sdnew = string_dicts()
	bdold, bdnew = badnum_dicts()
	print "timing for strings old=%.3f new=%.3f" % (
		  do_test(sdold, sdold.keys(), 100),
		  do_test(sdnew, sdnew.keys(), 100) )
	print "timing for bad integers old=%.3f new=%.3f" % (
		  do_test(bdold, bdold.keys(), 10) *10,
		  do_test(bdnew, bdnew.keys(), 10) *10)

"""
D:\crml_doc\platf\py>python dictest.py
timing for strings old=5.097 new=5.088
timing for bad integers old=101.540 new=12.610
"""