BTW, one lesson to take from SETL: a vital set operation in practice is a mutating "pick a 'random' element E from the set, remove it from the set, and return it". An enormous number of set algorithms are of the form
while not S.empty(): pick some element from S deal with it, possibly mutating S
This operation can't be done efficiently in Python code if the set is represented by a dict (the best you can do is materialize the full list of keys first, and pick one of those). That means my Set class often takes quadratic time for what *should* be linear-time algorithms.
Hmmm...actually, I've been wanting a method .key() for dictionaries a long time. So if we give dictionaries this one small method, then we *can* do this in Python. -- Moshe Zadka <sig@zadka.site.co.il> This is a signature anti-virus. Please stop the spread of signature viruses!
Moshe Zadka wrote:
BTW, one lesson to take from SETL: a vital set operation in practice is a mutating "pick a 'random' element E from the set, remove it from the set, and return it". An enormous number of set algorithms are of the form
while not S.empty(): pick some element from S deal with it, possibly mutating S
This operation can't be done efficiently in Python code if the set is represented by a dict (the best you can do is materialize the full list of keys first, and pick one of those). That means my Set class often takes quadratic time for what *should* be linear-time algorithms.
Hmmm...actually, I've been wanting a method .key() for dictionaries a long time. So if we give dictionaries this one small method, then we *can* do this in Python.
Shouldn't be hard to do... the C API for this is already in place: PyDict_Next(). I'd prefer a method .getitem(), though, which then returns a tuple (key, value). -- Marc-Andre Lemburg ______________________________________________________________________ Company: http://www.egenix.com/ Consulting: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/
[Me]
Hmmm...actually, I've been wanting a method .key() for dictionaries a long time. So if we give dictionaries this one small method, then we *can* do this in Python.
[MAL]
Shouldn't be hard to do... the C API for this is already in place: PyDict_Next(). I'd prefer a method .getitem(), though, which then returns a tuple (key, value).
Well, I'm sure the C API allows it. And if we do it, we might as well do it right: add .key(), .value() and .item(). -- Moshe Zadka <moshez@math.huji.ac.il> -- 95855124 http://advogato.org/person/moshez
M.-A. Lemburg:
Shouldn't be hard to do... the C API for this is already in place: PyDict_Next(). I'd prefer a method .getitem(), though, which then returns a tuple (key, value).
IMO '.pickitem()' would be a better name, since many people would think, that 'getitem()' would take some kind of index as parameter. Nevertheless I think this is a nice idea, though. Regards, Peter -- Peter Funk, Oldenburger Str.86, D-27777 Ganderkesee, Germany, Fax:+49 4222950260 office: +49 421 20419-0 (ArtCom GmbH, Grazer Str.8, D-28359 Bremen)
Peter Funk wrote:
M.-A. Lemburg:
Shouldn't be hard to do... the C API for this is already in place: PyDict_Next(). I'd prefer a method .getitem(), though, which then returns a tuple (key, value).
IMO '.pickitem()' would be a better name, since many people would think, that 'getitem()' would take some kind of index as parameter. Nevertheless I think this is a nice idea, though.
Point taken. How about .fetchitem() -- this doesn't imply any ordering and can't get confused with __getitem__() et al. BTW, Moshe's .key() and .value() would break the strong binding between dictionary keys and values -- I don't see much use in them. Besides, .fetchitem() provides the same implied functionality. -- Marc-Andre Lemburg ______________________________________________________________________ Company: http://www.egenix.com/ Consulting: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/
M.-A. Lemburg:
Shouldn't be hard to do... the C API for this is already in place: PyDict_Next(). I'd prefer a method .getitem(), though, which then returns a tuple (key, value).
IMO '.pickitem()' would be a better name, since many people would think, that 'getitem()' would take some kind of index as parameter. Nevertheless I think this is a nice idea, though.
Pronouncement: It is only efficient to get the *first* item, so let's make that explicit. The method names will be: .firstkey() .firstvalue() .firstitem() Moshe will check in a patch. Thinking aloud: Would it be useful to also implement popkey(), popvalue(), popitem(), which would remove the first item and then return the relevant part of it? --Guido van Rossum (home page: http://www.python.org/~guido/)
Guido van Rossum wrote:
M.-A. Lemburg:
Shouldn't be hard to do... the C API for this is already in place: PyDict_Next(). I'd prefer a method .getitem(), though, which then returns a tuple (key, value).
IMO '.pickitem()' would be a better name, since many people would think, that 'getitem()' would take some kind of index as parameter. Nevertheless I think this is a nice idea, though.
Pronouncement:
It is only efficient to get the *first* item, so let's make that explicit. The method names will be:
.firstkey()
-1
.firstvalue()
-1
.firstitem()
+1 I really think that .firstitem() is both more intuitive and efficient -- .firstvalue() and .firstkey() are simply derivatives of this method or how would you define the return value of those two methods ?
Moshe will check in a patch.
Thinking aloud:
Would it be useful to also implement popkey(), popvalue(), popitem(), which would remove the first item and then return the relevant part of it?
+1 Same comment as above... .popitem() suffices.
--Guido van Rossum (home page: http://www.python.org/~guido/)
_______________________________________________ Python-Dev mailing list Python-Dev@python.org http://www.python.org/mailman/listinfo/python-dev
-- Marc-Andre Lemburg ______________________________________________________________________ Company: http://www.egenix.com/ Consulting: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/
Pronouncement:
It is only efficient to get the *first* item, so let's make that explicit. The method names will be:
.firstkey()
-1
.firstvalue()
-1
.firstitem()
+1
I really think that .firstitem() is both more intuitive and efficient -- .firstvalue() and .firstkey() are simply derivatives of this method or how would you define the return value of those two methods ?
Yes, firstvalue() and firstkey() return just the value or just the key. I can see that the first value is not very useful (since you don't have the key, you can't mutate the dict). firstkey() might be just the ticket e.g. for set implementations. OTOH you may be right that it's best to add the minimal number of new methods.
Moshe will check in a patch.
Thinking aloud:
Would it be useful to also implement popkey(), popvalue(), popitem(), which would remove the first item and then return the relevant part of it?
+1
Same comment as above... .popitem() suffices.
And for minimality reasons I'm at most -0 on it. --Guido van Rossum (home page: http://www.python.org/~guido/)
Guido van Rossum: [...]
Thinking aloud:
Would it be useful to also implement popkey(), popvalue(), popitem(), which would remove the first item and then return the relevant part of it?
Since [].pop() takes in optional index as parameter which defaults to the *LAST* element, it would be better to make the name explicit: .popfirstitem() This is however some what long. DEJAVU: the {}.setdefault() discussion. May it is better to leave these methods out for the sake of simplicity, although they look useful. Regards, Peter
"PF" == Peter Funk <pf@artcom-gmbh.de> writes:
PF> Since [].pop() takes in optional index as parameter which PF> defaults to the *LAST* element, it would be better to make the PF> name explicit: .popfirstitem() {}.pop() sounds like the right thing given its symmetry with [].pop(). It would have the semantics that it would remove and return a random /item/ from the dictionary, which the implementation can make the "first" item for convenience purposes. It shouldn't dictate though, that {}.pop() == dict.items().pop() or make any other guarantees about which item gets popped, because other implementations (or indeed, other mapping-conformant objects) may use different underlying representations that could make this inefficient. For added symmetry, {}.pop() should take an optional argument which is the key of the item to remove and return. -Barry
Barry Warsaw wrote: {}.pop() sounds like the right thing given its symmetry with [].pop(). [details]
Should we then add dict.push(), which would add a key/value pair to the dictionary? I realize it's redundant (same effect as 'dict[x] = y'), but I think anyone who's gone through a CS course in data structures and algorithms will expect push() where there's pop(), and it will allow functions to be polymorphic over lists and dicts. Greg (who's not lobbying for it, just wondering if it's a loose end)
Greg Wilson writes:
Should we then add dict.push(), which would add a key/value pair to the dictionary? I realize it's redundant (same effect as 'dict[x] = y'),
We don't have [].push(), and I don't think the metaphor really works well with dictionaries. There's also an implied ordering relationship between push() and pop(), and this doesn't hold for dictionaries either.
but I think anyone who's gone through a CS course in data structures and algorithms will expect push() where there's pop(), and it will allow functions to be polymorphic over lists and dicts.
There's only a limited amount on conceptual polymorphism between the two; artificial extending that may lead to a lot of programmers using the wrong structure for their application. -Fred -- Fred L. Drake, Jr. <fdrake at acm.org> PythonLabs at Digital Creations
On Thu, 30 Nov 2000, Greg Wilson wrote:
Barry Warsaw wrote: {}.pop() sounds like the right thing given its symmetry with [].pop(). [details]
Should we then add dict.push(), which would add a key/value pair to the dictionary? I realize it's redundant (same effect as 'dict[x] = y'), but I think anyone who's gone through a CS course in data structures and algorithms will expect push() where there's pop(), and it will allow functions to be polymorphic over lists and dicts.
'cept there's no list.push - and I don't think you're proposing spelling "dict.push" "dict.append", are you? Cheers, M.
{}.pop() sounds like the right thing given its symmetry with [].pop().
It would have the semantics that it would remove and return a random /item/ from the dictionary, which the implementation can make the "first" item for convenience purposes.
It shouldn't dictate though, that {}.pop() == dict.items().pop() or make any other guarantees about which item gets popped, because other implementations (or indeed, other mapping-conformant objects) may use different underlying representations that could make this inefficient.
For added symmetry, {}.pop() should take an optional argument which is the key of the item to remove and return.
That seems a nice touch, until you realize that [].pop() returns the value, while {}.pop() returns a (key, value) pair. Also: if you already know the key, the argument that you can't do it in O(1) time is no longer valid. So, a -0.5. One more concern: if you repeatedly remove the *first* item, the hash table will start looking lobsided. Since we don't resize the hash table on deletes, maybe picking an item at random (but not using an expensive random generator!) would be better. --Guido van Rossum (home page: http://www.python.org/~guido/)
[Guido]
... One more concern: if you repeatedly remove the *first* item, the hash table will start looking lobsided. Since we don't resize the hash table on deletes, maybe picking an item at random (but not using an expensive random generator!) would be better.
Which is the reason SETL doesn't specify *which* set item is removed: if you always start looking at "the front" of a dict that's being consumed, the dict fills with turds without shrinking, you skip over them again and again, and consuming the entire dict is still quadratic time. Unfortunately, while using a random start point is almost always quicker than that, the expected time for consuming the whole dict remains quadratic. The clearest way around that is to save a per-dict search finger, recording where the last search left off. Start from its current value. Failure if it wraps around. This is linear time in non-pathological cases (a pathological case is one in which it isn't linear time <wink>).
After the last round of discussion, I was left with the idea that the best thing we could do to help destructive iteration is to introduce a {}.popitem() that returns an arbitrary (key, value) pair and deletes it. I wrote about this:
One more concern: if you repeatedly remove the *first* item, the hash table will start looking lobsided. Since we don't resize the hash table on deletes, maybe picking an item at random (but not using an expensive random generator!) would be better.
and Tim replied:
Which is the reason SETL doesn't specify *which* set item is removed: if you always start looking at "the front" of a dict that's being consumed, the dict fills with turds without shrinking, you skip over them again and again, and consuming the entire dict is still quadratic time.
Unfortunately, while using a random start point is almost always quicker than that, the expected time for consuming the whole dict remains quadratic.
The clearest way around that is to save a per-dict search finger, recording where the last search left off. Start from its current value. Failure if it wraps around. This is linear time in non-pathological cases (a pathological case is one in which it isn't linear time <wink>).
I've implemented this, except I use a static variable for the finger intead of a per-dict finger. I'm concerned about adding 4-8 extra bytes to each dict object for a feature that most dictionaries never need. So, instead, I use a single shared finger. This works just as well as long as this is used for a single dictionary. For multiple dictionaries (either used by the same thread or in different threads), it'll work almost as well, although it's possible to make up a pathological example that would work qadratically. An easy example of such a pathological example is to call popitem() for two identical dictionaries in lock step. Comments please! We could: - Live with the pathological cases. - Forget the whole thing; and then also forget about firstkey() etc. which has the same problem only worse. - Fix the algorithm. Maybe jumping criss-cross through the hash table like lookdict does would improve that; but I don't understand the math used for that ("Cycle through GF(2^n)-{0}" ???). I've placed a patch on SourceForge: http://sourceforge.net/patch/?func=detailpatch&patch_id=102733&group_id=5470 The algorithm is: static PyObject * dict_popitem(dictobject *mp, PyObject *args) { static int finger = 0; int i; dictentry *ep; PyObject *res; if (!PyArg_NoArgs(args)) return NULL; if (mp->ma_used == 0) { PyErr_SetString(PyExc_KeyError, "popitem(): dictionary is empty"); return NULL; } i = finger; if (i >= mp->ma_size) ir = 0; while ((ep = &mp->ma_table[i])->me_value == NULL) { i++; if (i >= mp->ma_size) i = 0; } finger = i+1; res = PyTuple_New(2); if (res != NULL) { PyTuple_SET_ITEM(res, 0, ep->me_key); PyTuple_SET_ITEM(res, 1, ep->me_value); Py_INCREF(dummy); ep->me_key = dummy; ep->me_value = NULL; mp->ma_used--; } return res; } --Guido van Rossum (home page: http://www.python.org/~guido/)
[Guido, on sharing a search finger and getting worse-than-linear behavior in a simple test case] See my reply on SourceForge (crossed in the mails). I predict that fixing this in an acceptable way (not bulletproof, but linear-time for all predictably common cases) is a two-character change. Surprise, although maybe I'm hallucinating (would someone please confirm?): when I went to the SF patch manager page to look for your patch (using the Open Patches view), I couldn't find it. My guess is that if there are "too many" patches to fit on one screen, then unlike the SF *bug* manager, you don't get any indication that more patches exist or any control to go to the next page.
"TP" == Tim Peters <tim.one@home.com> writes:
TP> Surprise, although maybe I'm hallucinating (would someone TP> please confirm?): when I went to the SF patch manager page to TP> look for your patch (using the Open Patches view), I couldn't TP> find it. My guess is that if there are "too many" patches to TP> fit on one screen, then unlike the SF *bug* manager, you don't TP> get any indication that more patches exist or any control to TP> go to the next page. I haven't checked recently, but this was definitely true a few weeks ago. I think I even submitted an admin request on it, but I don't remember for sure. -Barry
On Fri, Dec 08, 2000 at 01:43:39PM -0500, Guido van Rossum wrote:
... Comments please! We could:
- Live with the pathological cases.
I agree: live with it. The typical case will operate just fine.
- Forget the whole thing; and then also forget about firstkey() etc. which has the same problem only worse.
No opinion.
- Fix the algorithm. Maybe jumping criss-cross through the hash table like lookdict does would improve that; but I don't understand the math used for that ("Cycle through GF(2^n)-{0}" ???).
No need. The keys were inserted randomly, so sequencing through is effectively random. :-)
... static PyObject * dict_popitem(dictobject *mp, PyObject *args) { static int finger = 0; int i; dictentry *ep; PyObject *res;
if (!PyArg_NoArgs(args)) return NULL; if (mp->ma_used == 0) { PyErr_SetString(PyExc_KeyError, "popitem(): dictionary is empty"); return NULL; } i = finger; if (i >= mp->ma_size) ir = 0;
Should be "i = 0" Cheers, -g -- Greg Stein, http://www.lyra.org/
Guido van Rossum wrote:
After the last round of discussion, I was left with the idea that the best thing we could do to help destructive iteration is to introduce a {}.popitem() that returns an arbitrary (key, value) pair and deletes it. I wrote about this:
One more concern: if you repeatedly remove the *first* item, the hash table will start looking lobsided. Since we don't resize the hash table on deletes, maybe picking an item at random (but not using an expensive random generator!) would be better.
and Tim replied:
Which is the reason SETL doesn't specify *which* set item is removed: if you always start looking at "the front" of a dict that's being consumed, the dict fills with turds without shrinking, you skip over them again and again, and consuming the entire dict is still quadratic time.
Unfortunately, while using a random start point is almost always quicker than that, the expected time for consuming the whole dict remains quadratic.
The clearest way around that is to save a per-dict search finger, recording where the last search left off. Start from its current value. Failure if it wraps around. This is linear time in non-pathological cases (a pathological case is one in which it isn't linear time <wink>).
I've implemented this, except I use a static variable for the finger intead of a per-dict finger. I'm concerned about adding 4-8 extra bytes to each dict object for a feature that most dictionaries never need. So, instead, I use a single shared finger. This works just as well as long as this is used for a single dictionary. For multiple dictionaries (either used by the same thread or in different threads), it'll work almost as well, although it's possible to make up a pathological example that would work qadratically.
An easy example of such a pathological example is to call popitem() for two identical dictionaries in lock step.
Comments please! We could:
- Live with the pathological cases.
- Forget the whole thing; and then also forget about firstkey() etc. which has the same problem only worse.
- Fix the algorithm. Maybe jumping criss-cross through the hash table like lookdict does would improve that; but I don't understand the math used for that ("Cycle through GF(2^n)-{0}" ???).
That algorithm is really a gem which you should know, so let me try to explain it. Intro: A little story about finite field theory (very basic). ------------------------------------------------------------- For every prime p and every power p^n, there exists a Galois Field ( GF(p^n) ), which is a finite field. The additive group is called "elementary Abelian", it is commutative, and it looks a little like a vector space, since addition works in cycles modulo p for every p cell. The multiplicative group is cyclic, and it never touches 0. Cyclic groups are generated by a single primitive element. The powers of that element make up all the other elements. For all elements of the multiplication group GF(p^n)* the equality x^(p^n -1) == 1 . A generator element is therefore a primitive (p^n-1)th root of unity.
christian wrote:
That algorithm is really a gem which you should know, so let me try to explain it.
I think someone just won the "brain exploder 2000" award ;-) to paraphrase Bertrand Russell, "Mathematics may be defined as the subject where I never know what you are talking about, nor whether what you are saying is true" cheers /F
On Mon, Dec 11, 2000 at 12:36:53PM +0100, Fredrik Lundh wrote:
christian wrote:
That algorithm is really a gem which you should know, so let me try to explain it.
I think someone just won the "brain exploder 2000" award ;-)
By acclamation, I'd expect. I know it was the best laugh I had since last week's Have I Got News For You, even though trying to understand it made me glad I had boring meetings to recuperate in ;) Highschool-dropout-ly y'rs, -- Thomas Wouters <thomas@xs4all.net> Hi! I'm a .signature virus! copy me into your .signature file to help me spread!
Fredrik Lundh wrote:
christian wrote:
That algorithm is really a gem which you should know, so let me try to explain it.
I think someone just won the "brain exploder 2000" award ;-)
to paraphrase Bertrand Russell,
"Mathematics may be defined as the subject where I never know what you are talking about, nor whether what you are saying is true"
Hmm, I must have missed that one... care to repost ? -- Marc-Andre Lemburg ______________________________________________________________________ Company: http://www.egenix.com/ Consulting: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/
Hmm, I must have missed that one... care to repost ?
doesn't everyone here read the daily URL? here's a link: http://mail.python.org/pipermail/python-dev/2000-December/010913.html </F>
Fredrik Lundh wrote:
Hmm, I must have missed that one... care to repost ?
doesn't everyone here read the daily URL?
No time for pull logic... only push logic ;-)
here's a link: http://mail.python.org/pipermail/python-dev/2000-December/010913.html
Thanks. A very nice introduction indeed. The only thing which didn't come through in the first reading: why do we need GF(p^n)'s in the first place ? The second reading then made this clear: we need to assure that by iterating through the set of possible coefficients we can actually reach all slots in the dictionary... a gem indeed. Now if we could only figure out an equally simple way of producing perfect hash functions on-the-fly we could eliminate the need for the PyObject_Compare()s... ;-) -- Marc-Andre Lemburg ______________________________________________________________________ Company: http://www.egenix.com/ Consulting: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/
Fredrik Lundh wrote:
christian wrote:
That algorithm is really a gem which you should know, so let me try to explain it.
I think someone just won the "brain exploder 2000" award ;-)
to paraphrase Bertrand Russell,
"Mathematics may be defined as the subject where I never know what you are talking about, nor whether what you are saying is true"
:-)) Well, I was primarily targeting Guido, who said that he came from math, and one cannot study math without standing a basic algebra course, I think. I tried my best to explain it for those who know at least how groups, fields, rings and automorphisms work. Going into more details of the theory would be off-topic for python-dev, but I will try it in an upcoming DDJ article. As you might have guessed, I didn't do this just for fun. It is the old game of explaining what is there, convincing everybody that you at least know what you are talking about, and then three days later coming up with an improved application of the theory. Today is Monday, 2 days left. :-) ciao - chris -- Christian Tismer :^) <mailto:tismer@tismer.com> Mission Impossible 5oftware : Have a break! Take a ride on Python's Kaunstr. 26 : *Starship* http://starship.python.net 14163 Berlin : PGP key -> http://wwwkeys.pgp.net PGP Fingerprint E182 71C7 1A9D 66E9 9D15 D3CC D4D7 93E2 1FAE F6DF where do you want to jump today? http://www.stackless.com
Fredrik Lundh wrote:
christian wrote:
That algorithm is really a gem which you should know, so let me try to explain it.
I think someone just won the "brain exploder 2000" award ;-)
to paraphrase Bertrand Russell,
"Mathematics may be defined as the subject where I never know what you are talking about, nor whether what you are saying is true"
:-))
Well, I was primarily targeting Guido, who said that he came from math, and one cannot study math without standing a basic algebra course, I think. I tried my best to explain it for those who know at least how groups, fields, rings and automorphisms work. Going into more details of the theory would be off-topic for python-dev, but I will try it in an upcoming DDJ article.
I do have a math degree, but it is 18 years old and I had to give up after the first paragraph of your explanation. It made me vividly recall the first and only class on Galois Theory that I ever took -- after one hour I realized that this was not for me and I didn't have a math brain after all. I went back to the basement where the software development lab was (i.e. a row of card punches :-).
As you might have guessed, I didn't do this just for fun. It is the old game of explaining what is there, convincing everybody that you at least know what you are talking about, and then three days later coming up with an improved application of the theory.
Today is Monday, 2 days left. :-)
I'm very impressed. --Guido van Rossum (home page: http://www.python.org/~guido/)
Old topic: {}.popitem() (was Re: {}.first[key,value,item] ...) Christian Tismer wrote:
Fredrik Lundh wrote:
christian wrote:
That algorithm is really a gem which you should know, so let me try to explain it.
I think someone just won the "brain exploder 2000" award ;-)
<snip>
As you might have guessed, I didn't do this just for fun. It is the old game of explaining what is there, convincing everybody that you at least know what you are talking about, and then three days later coming up with an improved application of the theory.
Today is Monday, 2 days left. :-)
Ok, today is Sunday, I had no time to finish this. But now it is here. =========================== ===== Claim: ===== =========================== - Dictionary access time can be improved with a minimal change - On the hash() function: All Objects are supposed to provide a hash function which is as good as possible. Good means to provide a wide range of different keys for different values. Problem: There are hash functions which are "good" in this sense, but they do not spread their randomness uniformly over the 32 bits. Example: Integers use their own value as hash. This is ok, as far as the integers are uniformly distributed. But if they all contain a high power of two, for instance, the low bits give a very bad hash function. Take a dictionary with integers range(1000) as keys and access all entries. Then use a dictionay with the integers shifted left by 16. Access time is slowed down by a factor of 100, since every access is a linear search now. This is not an urgent problem, although applications exist where this can play a role (memory addresses for instance can have high factors of two when people do statistics on page accesses...) While this is not a big problem, it is ugly enough to think of a solution. Solution 1: ------------- Try to involve more bits of the hash value by doing extra shuffling, either a) in the dictlook function, or b) in the hash generation itself. I believe, both can't be really justified for a rare problem. But how about changing the existing solution in a way that an improvement is gained without extra cost? Solution 2: (*the* solution) ---------------------------- Some people may remember what I wrote about re-hashing functions through the multiplicative group GF(2^n)*, and I don't want to repeat this here. The simple idea can be summarized quickly: The original algorithm uses multiplication by polynomials, and it is guaranteed that these re-hash values are jittering through all possible nonzero patterns of the n bits. Observation: Whe are using an operation of a finite field. This means that the inverse of multiplication also exists! Old algortithm (multiplication): shift the index left by 1 if index > mask: xor the index with the generator polynomial New algorithm (division): if low bit of index set: xor the index with the generator polynomial shift the index right by 1 What does this mean? Not so much, we are just cycling through our bit patterns in reverse order. But now for the big difference. First change: We change from multiplication to division. Second change: We do not mask the hash value before! The second change is what I was after: By not masking the hash value when computing the initial index, all the existing bits in the hash come into play. This can be seen like a polynomial division, but the initial remainder (the hash value) was not normalized. After a number of steps, all the extra bits are wheeled into our index, but not wasted by masking them off. That gives our re-hash some more randomness. When all the extra bits are sucked in, the guaranteed single-visit cycle begins. There cannot be more than 27 extra cycles in the worst case (dict size = 32, so there are 27 bits to consume). I do not expect any bad effect from this modification. Here some results, dictionaries have 1000 entries: timing for strings old= 5.097 new= 5.088 timing for bad integers (<<10) old=101.540 new=12.610 timing for bad integers (<<16) old=571.210 new=19.220 On strings, both algorithms behave the same. On numbers, they differ dramatically. While the current algorithm is 110 times slower on a worst case dict (quadratic behavior), the new algorithm accounts a little for the extra cycle, but is only 4 times slower. Alternative implementation: The above approach is conservative in the sense that it tries not to slow down the current implementation in any way. An alternative would be to comsume all of the extra bits at once. But this would add an extra "warmup" loop like this to the algorithm: while index > mask: if low bit of index set: xor the index with the generator polynomial shift the index right by 1 This is of course a very good digest of the higher bits, since it is a polynomial division and not just some bit xor-ing which might give quite predictable cancellations, therefore it is "the right way" in my sense. It might be cheap, but would add over 20 cycles to every small dict. I therefore don't think it is worth to do this. Personally, I prefer the solution to merge the bits during the actual lookup, since it suffices to get access time from quadratic down to logarithmic. Attached is a direct translation of the relevant parts of dictobject.c into Python, with both algorithms implemented. cheers - chris -- Christian Tismer :^) <mailto:tismer@tismer.com> Mission Impossible 5oftware : Have a break! Take a ride on Python's Kaunstr. 26 : *Starship* http://starship.python.net 14163 Berlin : PGP key -> http://wwwkeys.pgp.net PGP Fingerprint E182 71C7 1A9D 66E9 9D15 D3CC D4D7 93E2 1FAE F6DF where do you want to jump today? http://www.stackless.com ## dictest.py ## Test of a new rehash algorithm ## Chris Tismer ## 2000-12-17 ## Mission Impossible 5oftware Team # The following is a partial re-implementation of # Python dictionaries in Python. # The original algorithm was literally turned # into Python code. ##/* ##Table of irreducible polynomials to efficiently cycle through ##GF(2^n)-{0}, 2<=n<=30. ##*/ polys = [ 4 + 3, 8 + 3, 16 + 3, 32 + 5, 64 + 3, 128 + 3, 256 + 29, 512 + 17, 1024 + 9, 2048 + 5, 4096 + 83, 8192 + 27, 16384 + 43, 32768 + 3, 65536 + 45, 131072 + 9, 262144 + 39, 524288 + 39, 1048576 + 9, 2097152 + 5, 4194304 + 3, 8388608 + 33, 16777216 + 27, 33554432 + 9, 67108864 + 71, 134217728 + 39, 268435456 + 9, 536870912 + 5, 1073741824 + 83, 0 ] class NULL: pass class Dictionary: dummy = "<dummy key>" def __init__(mp, newalg=0): mp.ma_size = 0 mp.ma_poly = 0 mp.ma_table = [] mp.ma_fill = 0 mp.ma_used = 0 mp.oldalg = not newalg def lookdict(mp, key, _hash): me_hash, me_key, me_value = range(3) # rec slots dummy = mp.dummy mask = mp.ma_size-1 ep0 = mp.ma_table i = (~_hash) & mask ep = ep0[i] if ep[me_key] is NULL or ep[me_key] == key: return ep if ep[me_key] == dummy: freeslot = ep else: if (ep[me_hash] == _hash and cmp(ep[me_key], key) == 0) : return ep freeslot = NULL ###### FROM HERE if mp.oldalg: incr = (_hash ^ (_hash >> 3)) & mask else: # note that we do not mask! # even the shifting my not be worth it. incr = _hash ^ (_hash >> 3) ###### TO HERE if (not incr): incr = mask while 1: ep = ep0[(i+incr)&mask] if (ep[me_key] is NULL) : if (freeslot != NULL) : return freeslot else: return ep if (ep[me_key] == dummy) : if (freeslot == NULL): freeslot = ep elif (ep[me_key] == key or (ep[me_hash] == _hash and cmp(ep[me_key], key) == 0)) : return ep # Cycle through GF(2^n)-{0} ###### FROM HERE if mp.oldalg: incr = incr << 1 if (incr > mask): incr = incr ^ mp.ma_poly else: # new algorithm: do a division if (incr & 1): incr = incr ^ mp.ma_poly incr = incr >> 1 ###### TO HERE def insertdict(mp, key, _hash, value): me_hash, me_key, me_value = range(3) # rec slots ep = mp.lookdict(key, _hash) if (ep[me_value] is not NULL) : old_value = ep[me_value] ep[me_value] = value else : if (ep[me_key] is NULL): mp.ma_fill=mp.ma_fill+1 ep[me_key] = key ep[me_hash] = _hash ep[me_value] = value mp.ma_used = mp.ma_used+1 def dictresize(mp, minused): me_hash, me_key, me_value = range(3) # rec slots oldsize = mp.ma_size oldtable = mp.ma_table MINSIZE = 4 newsize = MINSIZE for i in range(len(polys)): if (newsize > minused) : newpoly = polys[i] break newsize = newsize << 1 else: return -1 _nullentry = range(3) _nullentry[me_hash] = 0 _nullentry[me_key] = NULL _nullentry[me_value] = NULL newtable = map(lambda x,y=_nullentry:y[:], range(newsize)) mp.ma_size = newsize mp.ma_poly = newpoly mp.ma_table = newtable mp.ma_fill = 0 mp.ma_used = 0 for ep in oldtable: if (ep[me_value] is not NULL): mp.insertdict(ep[me_key],ep[me_hash],ep[me_value]) return 0 # PyDict_GetItem def __getitem__(op, key): me_hash, me_key, me_value = range(3) # rec slots if not op.ma_table: raise KeyError, key _hash = hash(key) return op.lookdict(key, _hash)[me_value] # PyDict_SetItem def __setitem__(op, key, value): mp = op _hash = hash(key) ## /* if fill >= 2/3 size, double in size */ if (mp.ma_fill*3 >= mp.ma_size*2) : if (mp.dictresize(mp.ma_used*2) != 0): if (mp.ma_fill+1 > mp.ma_size): raise MemoryError mp.insertdict(key, _hash, value) # more interface functions def keys(self): me_hash, me_key, me_value = range(3) # rec slots res = [] for _hash, _key, _value in self.ma_table: if _value is not NULL: res.append( _key) return res def values(self): me_hash, me_key, me_value = range(3) # rec slots res = [] for _hash, _key, _value in self.ma_table: if _value is not NULL: res.append( _value) return res def items(self): me_hash, me_key, me_value = range(3) # rec slots res = [] for _hash, _key, _value in self.ma_table: if _value is not NULL: res.append( (_key, _value) ) return res def __cmp__(self, other): mine = self.items() others = other.items() mine.sort() others.sort() return cmp(mine, others) ###################################################### ## tests def timing(func, args=None, n=1, **keywords) : import time time=time.time appl=apply if args is None: args = () if type(args) != type(()) : args=(args,) rep=range(n) dummyarg = ("",) dummykw = {} dummyfunc = len if keywords: before=time() for i in rep: res=appl(dummyfunc, dummyarg, dummykw) empty = time()-before before=time() for i in rep: res=appl(func, args, keywords) else: before=time() for i in rep: res=appl(dummyfunc, dummyarg) empty = time()-before before=time() for i in rep: res=appl(func, args) after = time() return round(after-before-empty,4), res def test(lis, dic): for key in lis: dic[key] def nulltest(lis, dic): for key in lis: dic def string_dicts(): d1 = Dictionary() # original d2 = Dictionary(1) # other rehash for i in range(1000): s = str(i) * 5 d1[s] = d2[s] = i return d1, d2 def badnum_dicts(): d1 = Dictionary() # original d2 = Dictionary(1) # other rehash for i in range(1000): bad = i << 16 d1[bad] = d2[bad] = i return d1, d2 def do_test(dict, keys, n): t0 = timing(nulltest, (keys, dict), n)[0] t1 = timing(test, (keys, dict), n)[0] return t1-t0 if __name__ == "__main__": sdold, sdnew = string_dicts() bdold, bdnew = badnum_dicts() print "timing for strings old=%.3f new=%.3f" % ( do_test(sdold, sdold.keys(), 100), do_test(sdnew, sdnew.keys(), 100) ) print "timing for bad integers old=%.3f new=%.3f" % ( do_test(bdold, bdold.keys(), 10) *10, do_test(bdnew, bdnew.keys(), 10) *10) """ D:\crml_doc\platf\py>python dictest.py timing for strings old=5.097 new=5.088 timing for bad integers old=101.540 new=12.610 """
Christian Tismer wrote: ... (my timings) Attached is the updated script with the timings mentioned in the last posting. Sorry, I posted an older version before. -- Christian Tismer :^) <mailto:tismer@tismer.com> Mission Impossible 5oftware : Have a break! Take a ride on Python's Kaunstr. 26 : *Starship* http://starship.python.net 14163 Berlin : PGP key -> http://wwwkeys.pgp.net PGP Fingerprint E182 71C7 1A9D 66E9 9D15 D3CC D4D7 93E2 1FAE F6DF where do you want to jump today? http://www.stackless.com ## dictest.py ## Test of a new rehash algorithm ## Chris Tismer ## 2000-12-17 ## Mission Impossible 5oftware Team # The following is a partial re-implementation of # Python dictionaries in Python. # The original algorithm was literally turned # into Python code. ##/* ##Table of irreducible polynomials to efficiently cycle through ##GF(2^n)-{0}, 2<=n<=30. ##*/ polys = [ 4 + 3, 8 + 3, 16 + 3, 32 + 5, 64 + 3, 128 + 3, 256 + 29, 512 + 17, 1024 + 9, 2048 + 5, 4096 + 83, 8192 + 27, 16384 + 43, 32768 + 3, 65536 + 45, 131072 + 9, 262144 + 39, 524288 + 39, 1048576 + 9, 2097152 + 5, 4194304 + 3, 8388608 + 33, 16777216 + 27, 33554432 + 9, 67108864 + 71, 134217728 + 39, 268435456 + 9, 536870912 + 5, 1073741824 + 83, 0 ] class NULL: pass class Dictionary: dummy = "<dummy key>" def __init__(mp, newalg=0): mp.ma_size = 0 mp.ma_poly = 0 mp.ma_table = [] mp.ma_fill = 0 mp.ma_used = 0 mp.oldalg = not newalg def lookdict(mp, key, _hash): me_hash, me_key, me_value = range(3) # rec slots dummy = mp.dummy mask = mp.ma_size-1 ep0 = mp.ma_table i = (~_hash) & mask ep = ep0[i] if ep[me_key] is NULL or ep[me_key] == key: return ep if ep[me_key] == dummy: freeslot = ep else: if (ep[me_hash] == _hash and cmp(ep[me_key], key) == 0) : return ep freeslot = NULL ###### FROM HERE if mp.oldalg: incr = (_hash ^ (_hash >> 3)) & mask else: # note that we do not mask! # even the shifting my not be worth it. incr = _hash ^ (_hash >> 3) ###### TO HERE if (not incr): incr = mask while 1: ep = ep0[(i+incr)&mask] if (ep[me_key] is NULL) : if (freeslot != NULL) : return freeslot else: return ep if (ep[me_key] == dummy) : if (freeslot == NULL): freeslot = ep elif (ep[me_key] == key or (ep[me_hash] == _hash and cmp(ep[me_key], key) == 0)) : return ep # Cycle through GF(2^n)-{0} ###### FROM HERE if mp.oldalg: incr = incr << 1 if (incr > mask): incr = incr ^ mp.ma_poly else: # new algorithm: do a division if (incr & 1): incr = incr ^ mp.ma_poly incr = incr >> 1 ###### TO HERE def insertdict(mp, key, _hash, value): me_hash, me_key, me_value = range(3) # rec slots ep = mp.lookdict(key, _hash) if (ep[me_value] is not NULL) : old_value = ep[me_value] ep[me_value] = value else : if (ep[me_key] is NULL): mp.ma_fill=mp.ma_fill+1 ep[me_key] = key ep[me_hash] = _hash ep[me_value] = value mp.ma_used = mp.ma_used+1 def dictresize(mp, minused): me_hash, me_key, me_value = range(3) # rec slots oldsize = mp.ma_size oldtable = mp.ma_table MINSIZE = 4 newsize = MINSIZE for i in range(len(polys)): if (newsize > minused) : newpoly = polys[i] break newsize = newsize << 1 else: return -1 _nullentry = range(3) _nullentry[me_hash] = 0 _nullentry[me_key] = NULL _nullentry[me_value] = NULL newtable = map(lambda x,y=_nullentry:y[:], range(newsize)) mp.ma_size = newsize mp.ma_poly = newpoly mp.ma_table = newtable mp.ma_fill = 0 mp.ma_used = 0 for ep in oldtable: if (ep[me_value] is not NULL): mp.insertdict(ep[me_key],ep[me_hash],ep[me_value]) return 0 # PyDict_GetItem def __getitem__(op, key): me_hash, me_key, me_value = range(3) # rec slots if not op.ma_table: raise KeyError, key _hash = hash(key) return op.lookdict(key, _hash)[me_value] # PyDict_SetItem def __setitem__(op, key, value): mp = op _hash = hash(key) ## /* if fill >= 2/3 size, double in size */ if (mp.ma_fill*3 >= mp.ma_size*2) : if (mp.dictresize(mp.ma_used*2) != 0): if (mp.ma_fill+1 > mp.ma_size): raise MemoryError mp.insertdict(key, _hash, value) # more interface functions def keys(self): me_hash, me_key, me_value = range(3) # rec slots res = [] for _hash, _key, _value in self.ma_table: if _value is not NULL: res.append( _key) return res def values(self): me_hash, me_key, me_value = range(3) # rec slots res = [] for _hash, _key, _value in self.ma_table: if _value is not NULL: res.append( _value) return res def items(self): me_hash, me_key, me_value = range(3) # rec slots res = [] for _hash, _key, _value in self.ma_table: if _value is not NULL: res.append( (_key, _value) ) return res def __cmp__(self, other): mine = self.items() others = other.items() mine.sort() others.sort() return cmp(mine, others) ###################################################### ## tests def timing(func, args=None, n=1, **keywords) : import time time=time.time appl=apply if args is None: args = () if type(args) != type(()) : args=(args,) rep=range(n) dummyarg = ("",) dummykw = {} dummyfunc = len if keywords: before=time() for i in rep: res=appl(dummyfunc, dummyarg, dummykw) empty = time()-before before=time() for i in rep: res=appl(func, args, keywords) else: before=time() for i in rep: res=appl(dummyfunc, dummyarg) empty = time()-before before=time() for i in rep: res=appl(func, args) after = time() return round(after-before-empty,4), res def test(lis, dic): for key in lis: dic[key] def nulltest(lis, dic): for key in lis: dic def string_dicts(): d1 = Dictionary() # original d2 = Dictionary(1) # other rehash for i in range(1000): s = str(i) * 5 d1[s] = d2[s] = i return d1, d2 def badnum_dicts(): d1 = Dictionary() # original d2 = Dictionary(1) # other rehash shift = 10 if EXTREME: shift = 16 for i in range(1000): bad = i << 16 d1[bad] = d2[bad] = i return d1, d2 def do_test(dict, keys, n): t0 = timing(nulltest, (keys, dict), n)[0] t1 = timing(test, (keys, dict), n)[0] return t1-t0 EXTREME=1 if __name__ == "__main__": sdold, sdnew = string_dicts() bdold, bdnew = badnum_dicts() print "timing for strings old=%.3f new=%.3f" % ( do_test(sdold, sdold.keys(), 100), do_test(sdnew, sdnew.keys(), 100) ) print "timing for bad integers old=%.3f new=%.3f" % ( do_test(bdold, bdold.keys(), 10) *10, do_test(bdnew, bdnew.keys(), 10) *10) """ Results with a shift of 10 (EXTREME=0): D:\crml_doc\platf\py>python dictest.py timing for strings old=5.097 new=5.088 timing for bad integers old=101.540 new=12.610 Results with a shift of 16 (EXTREME=1): D:\crml_doc\platf\py>python dictest.py timing for strings old=5.218 new=5.147 timing for bad integers old=571.210 new=19.220 """
Here some results, dictionaries have 1000 entries:
timing for strings old= 5.097 new= 5.088 timing for bad integers (<<10) old=101.540 new=12.610 timing for bad integers (<<16) old=571.210 new=19.220
Even though I think concentrating on string keys would provide more performance boost for Python in general, I think you have a point there. +1 from here. BTW, would changing the hash function on strings from the simple XOR scheme to something a little smarter help improve the performance too (e.g. most strings used in programming never use the 8-th bit) ? I also think that we could inline the string compare function in dictobject:lookdict_string to achieve even better performance. Currently it uses a function which doesn't trigger compiler inlining. And finally: I think a generic PyString_Compare() API would be useful in a lot of places where strings are being compared (e.g. dictionaries and keyword parameters). Unicode already has such an API (along with dozens of other useful APIs which are not available for strings). -- Marc-Andre Lemburg ______________________________________________________________________ Company: http://www.egenix.com/ Consulting: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/
"M.-A. Lemburg" wrote:
Here some results, dictionaries have 1000 entries:
timing for strings old= 5.097 new= 5.088 timing for bad integers (<<10) old=101.540 new=12.610 timing for bad integers (<<16) old=571.210 new=19.220
Even though I think concentrating on string keys would provide more performance boost for Python in general, I think you have a point there. +1 from here.
BTW, would changing the hash function on strings from the simple XOR scheme to something a little smarter help improve the performance too (e.g. most strings used in programming never use the 8-th bit) ?
Yes, it would. I spent the rest of last night to do more accurate tests, also refined the implementation (using longs for the shifts etc), and turned from timing over to trip counting, i.e. a dict counts every round through the re-hash. That showed two things: - The bits used from the string hash are not well distributed - using a "warmup wheel" on the hash to suck all bits in gives the same quality of hashes like random numbers. I will publish some results later today.
I also think that we could inline the string compare function in dictobject:lookdict_string to achieve even better performance. Currently it uses a function which doesn't trigger compiler inlining.
Sure! ciao - chris -- Christian Tismer :^) <mailto:tismer@tismer.com> Mission Impossible 5oftware : Have a break! Take a ride on Python's Kaunstr. 26 : *Starship* http://starship.python.net 14163 Berlin : PGP key -> http://wwwkeys.pgp.net PGP Fingerprint E182 71C7 1A9D 66E9 9D15 D3CC D4D7 93E2 1FAE F6DF where do you want to jump today? http://www.stackless.com
Problem: There are hash functions which are "good" in this sense, but they do not spread their randomness uniformly over the 32 bits.
Example: Integers use their own value as hash. This is ok, as far as the integers are uniformly distributed. But if they all contain a high power of two, for instance, the low bits give a very bad hash function.
Take a dictionary with integers range(1000) as keys and access all entries. Then use a dictionay with the integers shifted left by 16. Access time is slowed down by a factor of 100, since every access is a linear search now.
Ai. I think what happened is this: long ago, the hash table sizes were primes, or at least not powers of two! I'll leave it to the more mathematically-inclined to judge your solution... --Guido van Rossum (home page: http://www.python.org/~guido/)
Guido van Rossum wrote: [me, expanding on hashes, integers,and how to tame them cheaply]
Ai. I think what happened is this: long ago, the hash table sizes were primes, or at least not powers of two!
At some time I will wake up and they tell me that I'm reducible :-)
I'll leave it to the more mathematically-inclined to judge your solution...
I love small lists! - ciao - chris +1 (being a member, hopefully) -- Christian Tismer :^) <mailto:tismer@tismer.com> Mission Impossible 5oftware : Have a break! Take a ride on Python's Kaunstr. 26 : *Starship* http://starship.python.net 14163 Berlin : PGP key -> http://wwwkeys.pgp.net PGP Fingerprint E182 71C7 1A9D 66E9 9D15 D3CC D4D7 93E2 1FAE F6DF where do you want to jump today? http://www.stackless.com
Sounds good to me! It's a very cheap way to get the high bits into play.
i = (~_hash) & mask
The ~ here seems like pure superstition to me (and the comments in the C code don't justify it at all -- I added a nag of my own about that the last time I checked in dictobject.c -- and see below for a bad consequence of doing ~).
# note that we do not mask! # even the shifting my not be worth it. incr = _hash ^ (_hash >> 3)
The shifting was just another cheap trick to get *some* influence from the high bits. It's very limited, though. Toss it (it appears to be from the "random operations yield random results" <wink/sigh> matchbook school of design). [MAL]
BTW, would changing the hash function on strings from the simple XOR scheme to something a little smarter help improve the performance too (e.g. most strings used in programming never use the 8-th bit) ?
Don't understand -- the string hash uses multiplication: x = (1000003*x) ^ *p++; in a loop. Replacing "^" there by "+" should yield slightly better results. As is, string hashes are a lot like integer hashes, in that "consecutive" strings J001 J002 J003 J004 ... yield hashes very close together in value. But, because the current dict algorithm uses ~ on the full hash but does not use ~ on the initial increment, (~hash)+incr too often yields the same result for distinct hashes (i.e., there's a systematic (but weak) form of clustering). Note that Python is doing something very unusual: hashes are *usually* designed to yield an approximation to randomness across all bits. But Python's hashes never achieve that. This drives theoreticians mad (like the fellow who originally came up with the GF idea), but tends to work "better than random" in practice (e.g., a truly random hash function would almost certainly produce many collisions when fed a fat range of consecutive integers but still less than half the table size; but Python's trivial "identity" integer hash produces no collisions in that common case). [Christian]
- The bits used from the string hash are not well distributed - using a "warmup wheel" on the hash to suck all bits in gives the same quality of hashes like random numbers.
See above and be very cautious: none of Python's hash functions produce well-distributed bits, and-- in effect --that's why Python dicts often perform "better than random" on common data. Even what you've done so far appears to provide marginally worse statistics for Guido's favorite kind of test case ("worse" in two senses: total number of collisions (a measure of amortized lookup cost), and maximum collision chain length (a measure of worst-case lookup cost)): d = {} for i in range(N): d[repr(i)] = i check-in-one-thing-then-let-it-simmer-ly y'rs - tim
Tim Peters wrote:
Sounds good to me! It's a very cheap way to get the high bits into play.
That's what I wanted to hear. It's also the reason why I try to stay conservative: Just do an obviously useful bit, but do not break any of the inherent benefits, like those "better than random" amenities. Python's dictionary algorithm appears to be "near perfect" and of "never touch but veery carefully or redo it completely". I tried the tightrope walk of just adding a tiny topping.
i = (~_hash) & mask
Yes that stuff was 2 hours last nite :-) I just decided to not touch it. Arbitrary crap! Although an XOR with hash >> number of mask bits would perform much better (in many cases but not all). Anyway, simple shifting cannot solve general bit distribution problems. Nor can I :-)
The ~ here seems like pure superstition to me (and the comments in the C code don't justify it at all -- I added a nag of my own about that the last time I checked in dictobject.c -- and see below for a bad consequence of doing ~).
# note that we do not mask! # even the shifting my not be worth it. incr = _hash ^ (_hash >> 3)
The shifting was just another cheap trick to get *some* influence from the high bits. It's very limited, though. Toss it (it appears to be from the "random operations yield random results" <wink/sigh> matchbook school of design).
Now, comment it out, and you see my new algorithm perform much worse. I just kept it since it had an advantage on "my case". (bad guy I know). And I wanted to have an argument for my change to get accepted. "No cost, just profit, nearly the same" was what I tried to sell.
[MAL]
BTW, would changing the hash function on strings from the simple XOR scheme to something a little smarter help improve the performance too (e.g. most strings used in programming never use the 8-th bit) ?
Don't understand -- the string hash uses multiplication:
x = (1000003*x) ^ *p++;
in a loop. Replacing "^" there by "+" should yield slightly better results.
For short strings, this prime has bad influence on the low bits, making it perform supoptimally for small dicts. See the new2 algo which funnily corrects for that. The reason is obvious: Just look at the bit pattern of 1000003: '0xf4243' Without giving proof, this smells like bad bit distribution on small strings to me. You smell it too, right?
As is, string hashes are a lot like integer hashes, in that "consecutive" strings
J001 J002 J003 J004 ...
yield hashes very close together in value.
A bad generator in that case. I'll look for a better one.
But, because the current dict algorithm uses ~ on the full hash but does not use ~ on the initial increment, (~hash)+incr too often yields the same result for distinct hashes (i.e., there's a systematic (but weak) form of clustering).
You name it.
Note that Python is doing something very unusual: hashes are *usually* designed to yield an approximation to randomness across all bits. But Python's hashes never achieve that. This drives theoreticians mad (like the fellow who originally came up with the GF idea), but tends to work "better than random" in practice (e.g., a truly random hash function would almost certainly produce many collisions when fed a fat range of consecutive integers but still less than half the table size; but Python's trivial "identity" integer hash produces no collisions in that common case).
A good reason to be careful with changes(ahem).
[Christian]
- The bits used from the string hash are not well distributed - using a "warmup wheel" on the hash to suck all bits in gives the same quality of hashes like random numbers.
See above and be very cautious: none of Python's hash functions produce well-distributed bits, and-- in effect --that's why Python dicts often perform "better than random" on common data. Even what you've done so far appears to provide marginally worse statistics for Guido's favorite kind of test case ("worse" in two senses: total number of collisions (a measure of amortized lookup cost), and maximum collision chain length (a measure of worst-case lookup cost)):
d = {} for i in range(N): d[repr(i)] = i
Nah, I did quite a lot of tests, and the trip number shows a variation of about 10%, without judging old or new for better. This is just the randomness inside.
check-in-one-thing-then-let-it-simmer-ly y'rs - tim
This is why I think to be even more conservative: Try to use a division wheel, but with the inverses of the original primitive roots, just in order to get at Guido's results :-) making-perfect-hashes-of-interneds-still-looks-promising - ly y'rs - chris -- Christian Tismer :^) <mailto:tismer@tismer.com> Mission Impossible 5oftware : Have a break! Take a ride on Python's Kaunstr. 26 : *Starship* http://starship.python.net 14163 Berlin : PGP key -> http://wwwkeys.pgp.net PGP Fingerprint E182 71C7 1A9D 66E9 9D15 D3CC D4D7 93E2 1FAE F6DF where do you want to jump today? http://www.stackless.com
[Christian Tismer]
... For short strings, this prime has bad influence on the low bits, making it perform supoptimally for small dicts. See the new2 algo which funnily corrects for that. The reason is obvious: Just look at the bit pattern of 1000003: '0xf4243'
Without giving proof, this smells like bad bit distribution on small strings to me. You smell it too, right? ...
[Tim]
As is, string hashes are a lot like integer hashes, in that "consecutive" strings
J001 J002 J003 J004 ...
yield hashes very close together in value.
[back to Christian]
A bad generator in that case. I'll look for a better one.
Not necessarily! It's for that same reason "consecutive strings" can have "better than random" behavior today. And consecutive strings-- like consecutive ints --are a common case. Here are the numbers for the synthesized string cases: N=1000 trips for strings old=293 new=302 new2=221 trips for bin strings old=0 new=0 new2=0 N=2000 trips for strings old=1093 new=1109 new2=786 trips for bin strings old=0 new=0 new2=0 N=3000 trips for strings old=810 new=839 new2=609 trips for bin strings old=0 new=0 new2=0 N=4000 trips for strings old=1850 new=1843 new2=1375 trips for bin strings old=0 new=0 new2=0 Here they are again, after doing nothing except changing the "^" to "+" in the string hash, i.e. replacing x = (1000003*x) ^ *p++; by x = (1000003*x) + *p++; N=1000 trips for strings old=140 new=127 new2=108 trips for bin strings old=0 new=0 new2=0 N=2000 trips for strings old=480 new=434 new2=411 trips for bin strings old=0 new=0 new2=0 N=3000 trips for strings old=821 new=857 new2=631 trips for bin strings old=0 new=0 new2=0 N=4000 trips for strings old=1892 new=1852 new2=1476 trips for bin strings old=0 new=0 new2=0 The first two sizes are dramatically better, the last two a wash. If you want to see a real disaster, replace the "+" with "*" <wink>: N=1000 trips for strings old=71429 new=6449 new2=2048 trips for bin strings old=81187 new=41117 new2=41584 N=2000 trips for strings old=26882 new=9300 new2=6103 trips for bin strings old=96018 new=46932 new2=42408 I got tired of waiting at that point ... suspecting-a-better-string-hash-is-hard-to-find-ly y'rs - tim
Again... Tim Peters wrote:
Sounds good to me! It's a very cheap way to get the high bits into play.
...
[Christian]
- The bits used from the string hash are not well distributed - using a "warmup wheel" on the hash to suck all bits in gives the same quality of hashes like random numbers.
See above and be very cautious: none of Python's hash functions produce well-distributed bits, and-- in effect --that's why Python dicts often perform "better than random" on common data. Even what you've done so far appears to provide marginally worse statistics for Guido's favorite kind of test case ("worse" in two senses: total number of collisions (a measure of amortized lookup cost), and maximum collision chain length (a measure of worst-case lookup cost)):
d = {} for i in range(N): d[repr(i)] = i
I will look into this.
check-in-one-thing-then-let-it-simmer-ly y'rs - tim
Are you saying I should check the thing in? Really? In another reply to this message I was saying """ This is why I think to be even more conservative: Try to use a division wheel, but with the inverses of the original primitive roots, just in order to get at Guido's results :-) """ This was a religious desire, but such an inverse cannot exist. Well, all inverses exists, but it is an error to think that they can produce similar bit patterns. Changing the root means changing the whole system, since we have just a *representation* of a goup, via polynomial coefficients. A simple example which renders my thought useless is this: There is no general pattern that can turn a physical right shift into a left shift, for all bit combinations. Anyway, how can I produce a nearly complete scheme like today with the same "cheaper than random" properties? Ok, we have to stick with the given polymomials to stay compatible, and we also have to shift left. How do we then rotate the random bits in? Well, we can in fact do a rotation of the whole index, moving the highest bit into the lowest. Too bad that this isn't supported in C. It is a native machine instruction on X86 machines. We would then have: incr = ROTATE_LEFT(incr, 1) if (incr > mask): incr = incr ^ mp.ma_poly The effect is similar to the "old" algorithm, bits are shiftet left. Only if the hash happens to have hight bits, they appear in the modulus. On the current "faster than random" cases, I assume that high bits in the hash are less likely than low bits, so it is more likely that an entry finds its good place in the dict, before bits are rotated in. hence the "good" cases would be kept. I did all tests again, now including maximum trip length, and added a "rotate-left" version as well: D:\crml_doc\platf\py>python dictest.py N=1000 trips for strings old=293/9 new=302/7 new2=221/7 rot=278/5 trips for bad integers old=499500/999 new=13187/31 new2=999/1 rot=16754/31 trips for random integers old=360/8 new=369/8 new2=358/6 rot=356/7 trips for windows names old=230/5 new=207/7 new2=200/5 rot=225/5 N=2000 trips for strings old=1093/11 new=1109/10 new2=786/6 rot=1082/8 trips for bad integers old=0/0 new=26455/32 new2=1999/1 rot=33524/34 trips for random integers old=704/7 new=686/8 new2=685/7 rot=693/7 trips for windows names old=503/8 new=542/9 new2=564/6 rot=529/7 N=3000 trips for strings old=810/5 new=839/6 new2=609/5 rot=796/5 trips for bad integers old=0/0 new=38681/36 new2=2999/1 rot=49828/38 trips for random integers old=708/5 new=723/7 new2=724/5 rot=722/6 trips for windows names old=712/6 new=711/5 new2=691/5 rot=738/9 N=4000 trips for strings old=1850/9 new=1843/8 new2=1375/11 rot=1848/10 trips for bad integers old=0/0 new=52994/39 new2=3999/1 rot=66356/38 trips for random integers old=1395/9 new=1397/8 new2=1435/9 rot=1394/13 trips for windows names old=1449/8 new=1434/8 new2=1457/11 rot=1513/9 D:\crml_doc\platf\py> Concerning trip length, rotate is better than old in most cases. Random integers seem to withstand any of these procedures. For bad integers, rot takes naturally more trips than new, since the path to the bits is longer. All in all I don't see more than marginal differences between the approaches, and I tent to stick with "new", since it is theapest to implement. (it does not cost anything and might instead be a little cheaper for some compilers, since it does not reference the mask variable). I'd say let's do the patch -- ciao - chris -- Christian Tismer :^) <mailto:tismer@tismer.com> Mission Impossible 5oftware : Have a break! Take a ride on Python's Kaunstr. 26 : *Starship* http://starship.python.net 14163 Berlin : PGP key -> http://wwwkeys.pgp.net PGP Fingerprint E182 71C7 1A9D 66E9 9D15 D3CC D4D7 93E2 1FAE F6DF where do you want to jump today? http://www.stackless.com ## dictest.py ## Test of a new rehash algorithm ## Chris Tismer ## 2000-12-17 ## Mission Impossible 5oftware Team # The following is a partial re-implementation of # Python dictionaries in Python. # The original algorithm was literally turned # into Python code. ##/* ##Table of irreducible polynomials to efficiently cycle through ##GF(2^n)-{0}, 2<=n<=30. ##*/ polys = [ 4 + 3, 8 + 3, 16 + 3, 32 + 5, 64 + 3, 128 + 3, 256 + 29, 512 + 17, 1024 + 9, 2048 + 5, 4096 + 83, 8192 + 27, 16384 + 43, 32768 + 3, 65536 + 45, 131072 + 9, 262144 + 39, 524288 + 39, 1048576 + 9, 2097152 + 5, 4194304 + 3, 8388608 + 33, 16777216 + 27, 33554432 + 9, 67108864 + 71, 134217728 + 39, 268435456 + 9, 536870912 + 5, 1073741824 + 83, 0 ] polys = map(long, polys) class NULL: pass class Dictionary: dummy = "<dummy key>" def __init__(mp, newalg=0): mp.ma_size = 0 mp.ma_poly = 0 mp.ma_table = [] mp.ma_fill = 0 mp.ma_used = 0 mp.oldalg = not newalg mp.warmup = newalg==2 mp.rotleft = newalg==3 mp.trips = 0 mp.tripmax = 0 def getTrips(self): trips, tripmax = self.trips, self.tripmax self.trips = self.tripmax = 0 return trips, tripmax def lookdict(mp, key, _hash): me_hash, me_key, me_value = range(3) # rec slots dummy = mp.dummy mask = mp.ma_size-1 ep0 = mp.ma_table i = (~_hash) & mask ep = ep0[i] if ep[me_key] is NULL or ep[me_key] == key: return ep if ep[me_key] == dummy: freeslot = ep else: if (ep[me_hash] == _hash and cmp(ep[me_key], key) == 0) : return ep freeslot = NULL ###### FROM HERE if mp.oldalg: incr = (_hash ^ (_hash >> 3)) & mask else: # note that we do not mask! # the shifting is worth it in the incremental case. ## added after posting to python-dev: uhash = _hash & 0xffffffffl if mp.warmup: incr = uhash mask2 = 0xffffffffl ^ mask while mask2 > mask: if (incr & 1): incr = incr ^ mp.ma_poly incr = incr >> 1 mask2 = mask2>>1 # this loop *can* be sped up by tables # with precomputed multiple shifts. # But I'm not sure if it is worth it at all. else: incr = uhash ^ (uhash >> 3) ###### TO HERE if (not incr): incr = mask triplen = 0 while 1: mp.trips = mp.trips+1 triplen = triplen+1 if triplen > mp.tripmax: mp.tripmax = triplen ep = ep0[int((i+incr)&mask)] if (ep[me_key] is NULL) : if (freeslot is not NULL) : return freeslot else: return ep if (ep[me_key] == dummy) : if (freeslot == NULL): freeslot = ep elif (ep[me_key] == key or (ep[me_hash] == _hash and cmp(ep[me_key], key) == 0)) : return ep # Cycle through GF(2^n)-{0} ###### FROM HERE if mp.oldalg: incr = incr << 1 if (incr > mask): incr = incr ^ mp.ma_poly elif mp.rotleft: if incr &0x80000000L: incr = (incr << 1) | 1 else: incr = incr << 1 if (incr > mask): incr = incr ^ mp.ma_poly else: # new algorithm: do a division if (incr & 1): incr = incr ^ mp.ma_poly incr = incr >> 1 ###### TO HERE def insertdict(mp, key, _hash, value): me_hash, me_key, me_value = range(3) # rec slots ep = mp.lookdict(key, _hash) if (ep[me_value] is not NULL) : old_value = ep[me_value] ep[me_value] = value else : if (ep[me_key] is NULL): mp.ma_fill=mp.ma_fill+1 ep[me_key] = key ep[me_hash] = _hash ep[me_value] = value mp.ma_used = mp.ma_used+1 def dictresize(mp, minused): me_hash, me_key, me_value = range(3) # rec slots oldsize = mp.ma_size oldtable = mp.ma_table MINSIZE = 4 newsize = MINSIZE for i in range(len(polys)): if (newsize > minused) : newpoly = polys[i] break newsize = newsize << 1 else: return -1 _nullentry = range(3) _nullentry[me_hash] = 0 _nullentry[me_key] = NULL _nullentry[me_value] = NULL newtable = map(lambda x,y=_nullentry:y[:], range(newsize)) mp.ma_size = newsize mp.ma_poly = newpoly mp.ma_table = newtable mp.ma_fill = 0 mp.ma_used = 0 for ep in oldtable: if (ep[me_value] is not NULL): mp.insertdict(ep[me_key],ep[me_hash],ep[me_value]) return 0 # PyDict_GetItem def __getitem__(op, key): me_hash, me_key, me_value = range(3) # rec slots if not op.ma_table: raise KeyError, key _hash = hash(key) return op.lookdict(key, _hash)[me_value] # PyDict_SetItem def __setitem__(op, key, value): mp = op _hash = hash(key) ## /* if fill >= 2/3 size, double in size */ if (mp.ma_fill*3 >= mp.ma_size*2) : if (mp.dictresize(mp.ma_used*2) != 0): if (mp.ma_fill+1 > mp.ma_size): raise MemoryError mp.insertdict(key, _hash, value) # more interface functions def keys(self): me_hash, me_key, me_value = range(3) # rec slots res = [] for _hash, _key, _value in self.ma_table: if _value is not NULL: res.append( _key) return res def values(self): me_hash, me_key, me_value = range(3) # rec slots res = [] for _hash, _key, _value in self.ma_table: if _value is not NULL: res.append( _value) return res def items(self): me_hash, me_key, me_value = range(3) # rec slots res = [] for _hash, _key, _value in self.ma_table: if _value is not NULL: res.append( (_key, _value) ) return res def __cmp__(self, other): mine = self.items() others = other.items() mine.sort() others.sort() return cmp(mine, others) ###################################################### ## tests def test(lis, dic): for key in lis: dic[key] def nulltest(lis, dic): for key in lis: dic def string_dicts(n): d1 = Dictionary() # original d2 = Dictionary(1) # other rehash d3 = Dictionary(2) # with warmup d4 = Dictionary(3) # rotleft for i in range(n): s = str(i) #* 5 #s = chr(i%256) + chr(i>>8)## d1[s] = d2[s] = d3[s] = d4[s] = i return d1, d2, d3, d4 def istring_dicts(n): d1 = Dictionary() # original d2 = Dictionary(1) # other rehash d3 = Dictionary(2) # with warmup d4 = Dictionary(3) # rotleft for i in range(n): s = chr(i%256) + chr(i>>8) d1[s] = d2[s] = d3[s] = d4[s] = i return d1, d2, d3, d4 def random_dicts(n): d1 = Dictionary() # original d2 = Dictionary(1) # other rehash d3 = Dictionary(2) # with warmup d4 = Dictionary(3) # rotleft from whrandom import randint import sys keys = [] for i in range(n): keys.append(randint(0, sys.maxint-1)) for i in keys: d1[i] = d2[i] = d3[i] = d4[i] = i return d1, d2, d3, d4 def badnum_dicts(n): d1 = Dictionary() # original d2 = Dictionary(1) # other rehash d3 = Dictionary(2) # with warmup d4 = Dictionary(3) # rotleft shift = 10 if EXTREME: shift = 16 for i in range(n): bad = i << 16 d2[bad] = d3[bad] = d4[bad] = i if n <= 1000: d1[bad] = i return d1, d2, d3, d4 def names_dicts(n): d1 = Dictionary() # original d2 = Dictionary(1) # other rehash d3 = Dictionary(2) # with warmup d4 = Dictionary(3) # rotleft import win32con keys = win32con.__dict__.keys() if len(keys) < n: keys = [] for s in keys[:n]: d1[s] = d2[s] = d3[s] = d4[s] = s return d1, d2, d3, d4 def do_test(dict): keys = dict.keys() dict.getTrips() # reset test(keys, dict) return "%d/%d" % dict.getTrips() EXTREME=1 if __name__ == "__main__": for N in (1000,2000,3000,4000): sdold, sdnew, sdnew2, sdrot = string_dicts(N) #idold, idnew, idnew2, idrot = istring_dicts(N) bdold, bdnew, bdnew2, bdrot = badnum_dicts(N) rdold, rdnew, rdnew2, rdrot = random_dicts(N) ndold, ndnew, ndnew2, ndrot = names_dicts(N) fmt = "old=%s new=%s new2=%s rot=%s" print "N=%d" %N print ("trips for strings "+fmt) % tuple( map(do_test, (sdold, sdnew, sdnew2, sdrot)) ) #print ("trips for bin strings "+fmt) % tuple( # map(do_test, (idold, idnew, idnew2, idrot)) ) print ("trips for bad integers "+fmt) % tuple( map(do_test, (bdold, bdnew, bdnew2, bdrot))) print ("trips for random integers "+fmt) % tuple( map(do_test, (rdold, rdnew, rdnew2, rdrot))) print ("trips for windows names "+fmt) % tuple( map(do_test, (ndold, ndnew, ndnew2, ndrot))) """ Results with a shift of 10 (EXTREME=0): D:\crml_doc\platf\py>python dictest.py timing for strings old=5.097 new=5.088 timing for bad integers old=101.540 new=12.610 Results with a shift of 16 (EXTREME=1): D:\crml_doc\platf\py>python dictest.py timing for strings old=5.218 new=5.147 timing for bad integers old=571.210 new=19.220 """
Christian Tismer wrote: ... When talking about left rotation, an error crept in. Sorry!
We would then have:
incr = ROTATE_LEFT(incr, 1) if (incr > mask): incr = incr ^ mp.ma_poly
If incr contains the high bits of the hash, then the above must be replaced by incr = ROTATE_LEFT(incr, 1) if (incr & (mask+1)): incr = incr ^ mp.ma_poly or the multiplicative group is not guaranteed to be generated, obviously. This doesn't change my results, rotating right is still my choice. ciao - chris D:\crml_doc\platf\py>python dictest.py N=1000 trips for strings old=293/9 new=302/7 new2=221/7 rot=272/8 trips for bad integers old=499500/999 new=13187/31 new2=999/1 rot=16982/27 trips for random integers old=339/9 new=337/7 new2=343/10 rot=342/8 trips for windows names old=230/5 new=207/7 new2=200/5 rot=225/6 N=2000 trips for strings old=1093/11 new=1109/10 new2=786/6 rot=1090/9 trips for bad integers old=0/0 new=26455/32 new2=1999/1 rot=33985/31 trips for random integers old=747/10 new=733/7 new2=734/7 rot=728/8 trips for windows names old=503/8 new=542/9 new2=564/6 rot=521/11 N=3000 trips for strings old=810/5 new=839/6 new2=609/5 rot=820/6 trips for bad integers old=0/0 new=38681/36 new2=2999/1 rot=50985/26 trips for random integers old=709/4 new=728/5 new2=767/5 rot=711/6 trips for windows names old=712/6 new=711/5 new2=691/5 rot=727/7 N=4000 trips for strings old=1850/9 new=1843/8 new2=1375/11 rot=1861/9 trips for bad integers old=0/0 new=52994/39 new2=3999/1 rot=67986/26 trips for random integers old=1584/9 new=1606/8 new2=1505/9 rot=1579/8 trips for windows names old=1449/8 new=1434/8 new2=1457/11 rot=1476/7 -- Christian Tismer :^) <mailto:tismer@tismer.com> Mission Impossible 5oftware : Have a break! Take a ride on Python's Kaunstr. 26 : *Starship* http://starship.python.net 14163 Berlin : PGP key -> http://wwwkeys.pgp.net PGP Fingerprint E182 71C7 1A9D 66E9 9D15 D3CC D4D7 93E2 1FAE F6DF where do you want to jump today? http://www.stackless.com
[Christian Tismer]
Are you saying I should check the thing in? Really?
Of course. The first thing you talked about showed a major improvement in some bad cases, did no harm in the others, and both results were more than just plausible -- they made compelling sense and were backed by simulation. So why not check it in? It's a clear net win! Stuff since then has been a spattering of maybe-good maybe-bad maybe-neutral ideas that hasn't gotten anywhere conclusive. What I want to avoid is another "Unicode compression" scenario, where we avoid grabbing a clear win for months just because it may not be the best possible of all conceivable compression schemes -- and then mistakes get made in a last-second rush to get *any* improvement. Checking in a clear improvement today does not preclude checking in a better one next week <wink>.
... Ok, we have to stick with the given polymomials to stay compatible,
Na, feel free to explore that too, if you like. It really should get some study! The polys there now are utterly arbitrary: of all polys that happen to be irreducible and that have x as a primitive root in the induced multiplicative group, these are simply the smallest when viewed as binary integers. That's because they were *found* by trying all odd binary ints with odd parity (even ints and ints with even parity necessarily correspond to reducible polys), starting with 2**N+3 and going up until finding the first one that was both irreducible and had x as a primitive root. There's no theory at all that I know of to say that any such poly is any better for this purpose than any other. And they weren't tested for that either -- they're simply the first ones "that worked at all" in a brute search. Besides, Python's "better than random" dict behavior-- when it obtains! --is almost entirely due to that its hash functions produce distinct starting indices more often than a random hash function would. The contribution of the GF-based probe sequence in case of collision is to avoid the terrible behavior most other forms of probe sequence would cause given that Python's hash functions also tend to fill solid contiguous slices of the table more often than would a random hash function. [stuff about rotation]
... Too bad that this isn't supported in C. It is a native machine instruction on X86 machines.
Guido long ago rejected hash functions based on rotation for this reason; he's not likely to approve of rotations more in the probe sequence <wink>. A similar frustration is that almost modern CPUs have a fast instruction to get at the high 32 bits of a 32x32->64 bit multiply: another way to get the high bits of the hash code into play is to multiply the 32-bit hash code by a 32-bit constant (see Knuth for "Fibonacci hashing" details), and take the least-significant N bits of the *upper* 32 bits of the 64-bit product as the initial table index. If the constant is chosen correctly, this defines a permutation on the space of 32-bit unsigned ints, and can be very effective at "scrambling" arithmetic progressions (which Python's hash functions often produce). But C doesn't give a decent way to get at that either.
... On the current "faster than random" cases, I assume that high bits in the hash are less likely than low bits,
I'm not sure what this means. As the comment in dictobject.c says, it's common for Python's hash functions to return a result with lots of leading zeroes. But the lookup currently applies ~ to those first (which is a bad idea -- see earlier msgs), so the actual hash that gets *used* often has lots of leading ones.
so it is more likely that an entry finds its good place in the dict, before bits are rotated in. hence the "good" cases would be kept.
I can agree with this easily if I read the above as asserting that in the very good cases today, the low bits of hashes (whether or not ~ is applied) vary more than the high bits.
... Random integers seem to withstand any of these procedures.
If you wanted to, you could *define* random this way <wink>.
... I'd say let's do the patch -- ciao - chris
full-circle-ly y'rs - tim
[/F, on Christian's GF tutorial]
I think someone just won the "brain exploder 2000" award ;-)
Well, anyone can play. When keys collide, what we need is a function f(i) such that repeating i = f(i) visits every int in (0, 2**N) exactly once before setting i back to its initial value, for a fixed N and where the first i is in (0, 2**N). This is the quickest: def f(i): i -= 1 if i == 0: i = 2**N-1 return i Unfortunately, this leads to performance-destroying "primary collisions" (see Knuth, or any other text w/ a section on hashing). Other *good* possibilities include a pseudo-random number generator of maximal period, or viewing the ints in (0, 2**N) as bit vectors indicating set membership and generating all subsets of an N-element set in a Grey code order. The *form* of the function dictobject.c actually uses is: def f(i): i <<= 1 if i >= 2**N: i ^= MAGIC_CONSTANT_DEPENDING_ON_N return i which is suitably non-linear and as fast as the naive method. Given the form of the function, you don't need any theory at all to find a value for MAGIC_CONSTANT_DEPENDING_ON_N that simply works. In fact, I verified all the magic values in dictobject.c via brute force, because the person who contributed the original code botched the theory slightly and gave us some values that didn't work. I'll rely on the theory if and only if we have to extend this to 64-bit machines someday: I'm too old to wait for a brute search of a space with 2**64 elements <wink>. mathematics-is-a-battle-against-mortality-ly y'rs - tim
[Guido]
... I've implemented this, except I use a static variable for the finger intead of a per-dict finger. I'm concerned about adding 4-8 extra bytes to each dict object for a feature that most dictionaries never need.
It's a bit ironic that dicts are guaranteed to be at least 1/3 wasted space <wink>. Let's pick on Christian's idea to reclaim a few bytes of that.
So, instead, I use a single shared finger. This works just as well as long as this is used for a single dictionary. For multiple dictionaries (either used by the same thread or in different threads), it'll work almost as well, although it's possible to make up a pathological example that would work qadratically.
An easy example of such a pathological example is to call popitem() for two identical dictionaries in lock step.
Please see my later comments attached to the patch: http://sourceforge.net/patch/?func=detailpatch&patch_id=102733&group_id=5470 In short, for me (truly) identical dicts perform well with or without my suggestion, while dicts cloned via dict.copy() perform horribly with or without my suggestion (their internal structures differ); still curious as to whether that's also true for you (am I looking at a Windows bug? I don't see how, but it's possible ...). In any case, my suggestion turned out to be worthless on my box. Playing around via simulations suggests that a shared finger is going to be disastrous when consuming more than one dict unless they have identical internal structure (not just compare equal). As soon as they get a little out of synch, it just gets worse with each succeeding probe.
Comments please! We could:
- Live with the pathological cases.
How boring <wink>.
- Forget the whole thing; and then also forget about firstkey() etc. which has the same problem only worse.
I don't know that this is an important idea for dicts in general (it is important for sets) -- it's akin to an xrange for dicts. But then I've had more than one real-life program that built giant dicts then ran out of memory trying to iterate over them! I'd like to fix that.
- Fix the algorithm. Maybe jumping criss-cross through the hash table like lookdict does would improve that; but I don't understand the math used for that ("Cycle through GF(2^n)-{0}" ???).
Christian explained that well (thanks!). However, I still don't see any point to doing that business in .popitem(): when inserting keys, the jitterbug probe sequence has the crucial benefit of preventing primary clustering when keys collide. But when we consume a dict, we just want to visit every slot as quickly as possible. [Christian]
Appendix, on the use of finger: -------------------------------
Instead of using a global finger variable, you can do the following (involving a cast from object to int) :
- if the 0'th slot of the dict is non-empty: return this element and insert the dummy element as key. Set the value field to the Dictionary Algorithm would give for the removed object's hash. This is the next finger. - else: treat the value field of the 0'th slot as the last finger. If it is zero, initialize it with 2^n-1. Repetitively use the DA until you find an entry. Save the finger in slot 0 again.
This dosn't cost an extra slot, and even when the dictionary is written between removals, the chance to loose the finger is just 1:(2^n-1) on every insertion.
I like that, except: 1) As above, I don't believe the GF business buys anything over a straightforward search when consuming a dict. 2) Overloading the value field bristles with problems, in part because it breaks the invariant that a slot is unused if and only if the value field is NULL, in part because C doesn't guarantee that you can get away with casting an arbitrary int to a pointer and back again. None of the problems in #2 arise if we abuse the me_hash field instead, so the attached does that. Here's a typical run of Guido's test case using this (on an 866MHz machine w/ 256Mb RAM -- the early values jump all over the place from run to run): run = 0 log2size = 10 size = 1024 7.4 usec per item to build (total 0.008 sec) 3.4 usec per item to destroy twins (total 0.003 sec) log2size = 11 size = 2048 6.7 usec per item to build (total 0.014 sec) 3.4 usec per item to destroy twins (total 0.007 sec) log2size = 12 size = 4096 7.0 usec per item to build (total 0.029 sec) 3.7 usec per item to destroy twins (total 0.015 sec) log2size = 13 size = 8192 7.1 usec per item to build (total 0.058 sec) 5.9 usec per item to destroy twins (total 0.048 sec) log2size = 14 size = 16384 14.7 usec per item to build (total 0.241 sec) 6.4 usec per item to destroy twins (total 0.105 sec) log2size = 15 size = 32768 12.2 usec per item to build (total 0.401 sec) 3.9 usec per item to destroy twins (total 0.128 sec) log2size = 16 size = 65536 7.8 usec per item to build (total 0.509 sec) 4.0 usec per item to destroy twins (total 0.265 sec) log2size = 17 size = 131072 7.9 usec per item to build (total 1.031 sec) 4.1 usec per item to destroy twins (total 0.543 sec) The last one is over 100 usec per item using the original patch (with or without my first suggestion). if-i-were-a-betting-man-i'd-say-"bingo"-ly y'rs - tim Drop-in replacement for the popitem in the patch: static PyObject * dict_popitem(dictobject *mp, PyObject *args) { int i = 0; dictentry *ep; PyObject *res; if (!PyArg_NoArgs(args)) return NULL; if (mp->ma_used == 0) { PyErr_SetString(PyExc_KeyError, "popitem(): dictionary is empty"); return NULL; } /* Set ep to "the first" dict entry with a value. We abuse the hash * field of slot 0 to hold a search finger: * If slot 0 has a value, use slot 0. * Else slot 0 is being used to hold a search finger, * and we use its hash value as the first index to look. */ ep = &mp->ma_table[0]; if (ep->me_value == NULL) { i = (int)ep->me_hash; /* The hash field may be uninitialized trash, or it * may be a real hash value, or it may be a legit * search finger, or it may be a once-legit search * finger that's out of bounds now because it * wrapped around or the table shrunk -- simply * make sure it's in bounds now. */ if (i >= mp->ma_size || i < 1) i = 1; /* skip slot 0 */ while ((ep = &mp->ma_table[i])->me_value == NULL) { i++; if (i >= mp->ma_size) i = 1; } } res = PyTuple_New(2); if (res != NULL) { PyTuple_SET_ITEM(res, 0, ep->me_key); PyTuple_SET_ITEM(res, 1, ep->me_value); Py_INCREF(dummy); ep->me_key = dummy; ep->me_value = NULL; mp->ma_used--; } assert(mp->ma_table[0].me_value == NULL); mp->ma_table[0].me_hash = i + 1; /* next place to start */ return res; }
participants (16)
-
barry@digicool.com
-
Christian Tismer
-
Fred L. Drake, Jr.
-
Fredrik Lundh
-
Fredrik Lundh
-
Greg Stein
-
Greg Wilson
-
Guido van Rossum
-
M.-A. Lemburg
-
Michael Hudson
-
Moshe Zadka
-
Moshe Zadka
-
pf@artcom-gmbh.de
-
Thomas Wouters
-
Tim Peters
-
Tim Peters