[Python-Dev] The Dictionary Gem is polished!

Christian Tismer tismer@tismer.com
Tue, 19 Dec 2000 02:16:27 +0100


This is a multi-part message in MIME format.
--------------E592273E7D1C3FC9F78A4489
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit



Greg Wilson wrote:
> 
> > > > Here some results, dictionaries have 1000 entries:
> > I will publish some results later today.
> 
> In Doctor Dobb's Journal, right? :-)  We'd *really* like this article...

Well, the results are not so bad:

I stopped testing computation time for the Python dictionary
implementation, in favor of "trips". How many trips does
the re-hash take in a dictionary?

Tests were run for dictionaries of size 1000, 2000, 3000, 4000.

Dictionary 1 consists of i, formatted as string.
Dictionary 2 consists of strings containig the binary of i.
Dictionary 3 consists of random numbers.
Dictionary 4 consists of i << 16.

Algorithms:
old is the original dictionary algorithm implemented
in Python (probably quite correct now, using longs :-)

new is the proposed incremental bits-suck-in-division
algorithm.

new2 is a version of new, where all extra bits of the
hash function are wheeled in in advance. The computation
time of this is not neglectible, so please use this result
for reference, only.


Here the results:
(bad integers(old) not computed for n>1000 )

"""
D:\crml_doc\platf\py>python dictest.py
N=1000
trips for strings old=293 new=302 new2=221
trips for bin strings old=0 new=0 new2=0
trips for bad integers old=499500 new=13187 new2=999
trips for random integers old=377 new=371 new2=393
trips for windows names old=230 new=207 new2=200
N=2000
trips for strings old=1093 new=1109 new2=786
trips for bin strings old=0 new=0 new2=0
trips for bad integers old=0 new=26455 new2=1999
trips for random integers old=691 new=710 new2=741
trips for windows names old=503 new=542 new2=564
N=3000
trips for strings old=810 new=839 new2=609
trips for bin strings old=0 new=0 new2=0
trips for bad integers old=0 new=38681 new2=2999
trips for random integers old=762 new=740 new2=735
trips for windows names old=712 new=711 new2=691
N=4000
trips for strings old=1850 new=1843 new2=1375
trips for bin strings old=0 new=0 new2=0
trips for bad integers old=0 new=52994 new2=3999
trips for random integers old=1440 new=1450 new2=1414
trips for windows names old=1449 new=1434 new2=1457

D:\crml_doc\platf\py>
"""

Interpretation:
---------------
Short numeric strings show a slightly too high trip number.
This means that the hash() function could be enhanced.
But the effect would be below 10 percent compared to
random hashes, therefore not worth it.

Binary representations of numbers as strings still create
perfect hash numbers.

Bad integers (complete hash clash due to high power of 2)
are handled fairly well by the new algorithm. "new2"
shows that they can be brought down to nearly perfect
hashes just by applying the "hash melting wheel":

Windows names are almost upper case, and almost verbose.
They appear to perform nearly as well as random numbers.
This means: The Python string has function is very good
for a wide area of applications.

In Summary: I would try to modify the string hash function
slightly for short strings, but only if this does not
negatively affect the results of above.

Summary of summary:
There is no really low hanging fruit in string hashing.

ciao - chris

-- 
Christian Tismer             :^)   <mailto:tismer@tismer.com>
Mission Impossible 5oftware  :     Have a break! Take a ride on Python's
Kaunstr. 26                  :    *Starship* http://starship.python.net
14163 Berlin                 :     PGP key -> http://wwwkeys.pgp.net
PGP Fingerprint       E182 71C7 1A9D 66E9 9D15  D3CC D4D7 93E2 1FAE F6DF
     where do you want to jump today?   http://www.stackless.com
--------------E592273E7D1C3FC9F78A4489
Content-Type: text/plain; charset=us-ascii;
 name="dictest.py"
Content-Transfer-Encoding: 7bit
Content-Disposition: inline;
 filename="dictest.py"

## dictest.py
## Test of a new rehash algorithm
## Chris Tismer
## 2000-12-17
## Mission Impossible 5oftware Team

# The following is a partial re-implementation of
# Python dictionaries in Python.
# The original algorithm was literally turned
# into Python code.

##/*
##Table of irreducible polynomials to efficiently cycle through
##GF(2^n)-{0}, 2<=n<=30.
##*/
polys = [
    4 + 3,
    8 + 3,
    16 + 3,
    32 + 5,
    64 + 3,
    128 + 3,
    256 + 29,
    512 + 17,
    1024 + 9,
    2048 + 5,
    4096 + 83,
    8192 + 27,
    16384 + 43,
    32768 + 3,
    65536 + 45,
    131072 + 9,
    262144 + 39,
    524288 + 39,
    1048576 + 9,
    2097152 + 5,
    4194304 + 3,
    8388608 + 33,
    16777216 + 27,
    33554432 + 9,
    67108864 + 71,
    134217728 + 39,
    268435456 + 9,
    536870912 + 5,
    1073741824 + 83,
    0
]
polys = map(long, polys)

class NULL: pass

class Dictionary:
    dummy = "<dummy key>"
    def __init__(mp, newalg=0):
        mp.ma_size = 0
        mp.ma_poly = 0
        mp.ma_table = []
        mp.ma_fill = 0
        mp.ma_used = 0
        mp.oldalg = not newalg
        mp.warmup = newalg>1
        mp.trips = 0

    def getTrips(self):
        trips = self.trips
        self.trips = 0
        return trips

    def lookdict(mp, key, _hash):
        me_hash, me_key, me_value = range(3) # rec slots
        dummy = mp.dummy
        
        mask = mp.ma_size-1
        ep0 = mp.ma_table
        i = (~_hash) & mask
        ep = ep0[i]
        if ep[me_key] is NULL or ep[me_key] == key:
            return ep
        if ep[me_key] == dummy:
            freeslot = ep
        else:
            if (ep[me_hash] == _hash and
                cmp(ep[me_key], key) == 0) :
                return ep
            freeslot = NULL

###### FROM HERE
        if mp.oldalg:
            incr = (_hash ^ (_hash >> 3)) & mask
        else:
            # note that we do not mask!
            # the shifting is worth it in the incremental case.

            ## added after posting to python-dev:
            uhash = _hash & 0xffffffffl
            if mp.warmup:
                incr = uhash
                mask2 = 0xffffffffl ^ mask
                while mask2 > mask:
                    if (incr & 1):
                        incr = incr ^ mp.ma_poly
                    incr = incr >> 1
                    mask2 = mask2>>1
                # this loop *can* be sped up by tables
                # with precomputed multiple shifts.
                # But I'm not sure if it is worth it at all.
            else:
                incr = uhash ^ (uhash >> 3)

###### TO HERE
            
        if (not incr):
            incr = mask
        while 1:
            mp.trips = mp.trips+1
            
            ep = ep0[int((i+incr)&mask)]
            if (ep[me_key] is NULL) :
                if (freeslot is not NULL) :
                    return freeslot
                else:
                    return ep
            if (ep[me_key] == dummy) :
                if (freeslot == NULL):
                    freeslot = ep
            elif (ep[me_key] == key or
                 (ep[me_hash] == _hash and
                  cmp(ep[me_key], key) == 0)) :
                return ep

            # Cycle through GF(2^n)-{0}
###### FROM HERE
            if mp.oldalg:
                incr = incr << 1
                if (incr > mask):
                    incr = incr ^ mp.ma_poly
            else:
                # new algorithm: do a division
                if (incr & 1):
                    incr = incr ^ mp.ma_poly
                incr = incr >> 1
###### TO HERE

    def insertdict(mp, key, _hash, value):
        me_hash, me_key, me_value = range(3) # rec slots
        
        ep = mp.lookdict(key, _hash)
        if (ep[me_value] is not NULL) :
            old_value = ep[me_value]
            ep[me_value] = value
        else :
            if (ep[me_key] is NULL):
                mp.ma_fill=mp.ma_fill+1
            ep[me_key] = key
            ep[me_hash] = _hash
            ep[me_value] = value
            mp.ma_used = mp.ma_used+1

    def dictresize(mp, minused):
        me_hash, me_key, me_value = range(3) # rec slots

        oldsize = mp.ma_size
        oldtable = mp.ma_table
        MINSIZE = 4
        newsize = MINSIZE
        for i in range(len(polys)):
            if (newsize > minused) :
                newpoly = polys[i]
                break
            newsize = newsize << 1
        else:
            return -1

        _nullentry = range(3)
        _nullentry[me_hash] = 0
        _nullentry[me_key] = NULL
        _nullentry[me_value] = NULL

        newtable = map(lambda x,y=_nullentry:y[:], range(newsize))      

        mp.ma_size = newsize
        mp.ma_poly = newpoly
        mp.ma_table = newtable
        mp.ma_fill = 0
        mp.ma_used = 0

        for ep in oldtable:
            if (ep[me_value] is not NULL):
                mp.insertdict(ep[me_key],ep[me_hash],ep[me_value])
        return 0

    # PyDict_GetItem
    def __getitem__(op, key):
        me_hash, me_key, me_value = range(3) # rec slots

        if not op.ma_table:
            raise KeyError, key
        _hash = hash(key)
        return op.lookdict(key, _hash)[me_value]

    # PyDict_SetItem
    def __setitem__(op, key, value):
        mp = op
        _hash = hash(key)
##      /* if fill >= 2/3 size, double in size */
        if (mp.ma_fill*3 >= mp.ma_size*2) :
            if (mp.dictresize(mp.ma_used*2) != 0):
                if (mp.ma_fill+1 > mp.ma_size):
                    raise MemoryError
        mp.insertdict(key, _hash, value)

    # more interface functions
    def keys(self):
        me_hash, me_key, me_value = range(3) # rec slots
        res = []
        for _hash, _key, _value in self.ma_table:
            if _value is not NULL:
                res.append( _key)
        return res

    def values(self):
        me_hash, me_key, me_value = range(3) # rec slots
        res = []
        for _hash, _key, _value in self.ma_table:
            if _value is not NULL:
                res.append( _value)
        return res

    def items(self):
        me_hash, me_key, me_value = range(3) # rec slots
        res = []
        for _hash, _key, _value in self.ma_table:
            if _value is not NULL:
                res.append( (_key, _value) )
        return res

    def __cmp__(self, other):
        mine = self.items()
        others = other.items()
        mine.sort()
        others.sort()
        return cmp(mine, others)

######################################################
## tests

def test(lis, dic):
    for key in lis: dic[key]

def nulltest(lis, dic):
    for key in lis: dic

def string_dicts(n):
    d1 = Dictionary()   # original
    d2 = Dictionary(1)  # other rehash
    d3 = Dictionary(2)  # with warmup
    for i in range(n):
        s = str(i) #* 5
        #s = chr(i%256) + chr(i>>8)##
        d1[s] = d2[s] = d3[s] = i
    return d1, d2, d3

def istring_dicts(n):
    d1 = Dictionary()   # original
    d2 = Dictionary(1)  # other rehash
    d3 = Dictionary(2)  # with warmup
    for i in range(n):
        s = chr(i%256) + chr(i>>8)
        d1[s] = d2[s] = d3[s] = i
    return d1, d2, d3

def random_dicts(n):
    d1 = Dictionary()   # original
    d2 = Dictionary(1)  # other rehash
    d3 = Dictionary(2)  # with warmup
    from whrandom import randint
    import sys
    keys = []
    for i in range(n):
        keys.append(randint(0, sys.maxint-1))
    for i in keys:
        d1[i] = d2[i] = d3[i] = i
    return d1, d2, d3

def badnum_dicts(n):
	d1 = Dictionary()   # original
	d2 = Dictionary(1)  # other rehash
	d3 = Dictionary(2)  # with warmup
	shift = 10
	if EXTREME:
		shift = 16
	for i in range(n):
		bad = i << 16
		d2[bad] = d3[bad] = i
		if n <= 1000: d1[bad] = i
	return d1, d2, d3

def names_dicts(n):
    d1 = Dictionary()   # original
    d2 = Dictionary(1)  # other rehash
    d3 = Dictionary(2)  # with warmup
    import win32con
    keys = win32con.__dict__.keys()
    if len(keys) < n:
        keys = []
    for s in keys[:n]:
        d1[s] = d2[s] = d3[s] = s
    return d1, d2, d3

def do_test(dict):
    keys = dict.keys()
    dict.getTrips() # reset
    test(keys, dict)
    return dict.getTrips()

EXTREME=1

if __name__ == "__main__":

    for N in (1000,2000,3000,4000):    

        sdold, sdnew, sdnew2 = string_dicts(N)
        idold, idnew, idnew2 = istring_dicts(N)
        bdold, bdnew, bdnew2 = badnum_dicts(N)
        rdold, rdnew, rdnew2 = random_dicts(N)
        ndold, ndnew, ndnew2 = names_dicts(N)

        print "N=%d" %N        
        print "trips for strings old=%d new=%d new2=%d" % tuple(
            map(do_test, (sdold, sdnew, sdnew2)) )
        print "trips for bin strings old=%d new=%d new2=%d" % tuple(
            map(do_test, (idold, idnew, idnew2)) )
        print "trips for bad integers old=%d new=%d new2=%d" % tuple(
            map(do_test, (bdold, bdnew, bdnew2)))
        print "trips for random integers old=%d new=%d new2=%d" % tuple(
            map(do_test, (rdold, rdnew, rdnew2)))
        print "trips for windows names old=%d new=%d new2=%d" % tuple(
            map(do_test, (ndold, ndnew, ndnew2)))

"""
Results with a shift of 10 (EXTREME=0):
D:\crml_doc\platf\py>python dictest.py
timing for strings old=5.097 new=5.088
timing for bad integers old=101.540 new=12.610

Results with a shift of 16 (EXTREME=1):
D:\crml_doc\platf\py>python dictest.py
timing for strings old=5.218 new=5.147
timing for bad integers old=571.210 new=19.220
"""


--------------E592273E7D1C3FC9F78A4489--