Reducing collisions in small dicts/sets

Short course: the average number of probes needed when searching small dicts/sets can be reduced, in both successful ("found") and failing ("not found") cases. But I'm not going to pursue this. This is a brain dump for someone who's willing to endure the interminable pain of arguing about benchmarks ;-) Background: http://bugs.python.org/issue30671 raised some questions about how dict collisions are handled. While the analysis there didn't make sense to me, I wrote enough code to dig into it. As detailed in that bug report, the current implementation appeared to meet the theoretical performance of "uniform hashing", meaning there was no room left for improvement. However, that missed something: the simple expressions for expected probes under uniform hashing are upper bounds, and while they're excellent approximations for modest load factors in sizable tables, for small tables they're significantly overstated. For example, for a table with 5 items in 8 slots, the load factor is a = 5/8 = 0.625, and avg probes when found = log(1/(1-a))/a = 1.57 when not found = 1/(1-a) = 2.67 However, exact analysis gives 1.34 and 2.25 instead. The current code achieves the upper bounds, but not the exact values. As a sanity check, a painfully slow implementation of uniform hashing does achieve the exact values. Code for all this is attached, in a framework that allows you to easily plug in any probe sequence strategy. The current strategy is implemented by generator "current". There are also implementations of "linear" probing, "quadratic" probing, "pre28201" probing (the current strategy before bug 28201 repaired an error in its coding), "uniform" probing, and ... "double". The last is one form of "double hashing" that gets very close to "uniform". Its loop guts are significantly cheaper than the current scheme, just 1 addition and 1 mask. However, it requires a non-trivial modulus to get started, and that's expensive. Is there a cheaper way to get close to "uniform"? I don't know - this was just the best I came up with so far. Does it matter? See above ;-) If you want to pursue this, take these as given: 1. The first probe must be simply the last `nbits` bits of the hash code. The speed of the first probe is supremely important, that's the fastest possible first probe, and it guarantees no collisions at all for a dict indexed by a contiguous range of integers (an important use case). 2. The probe sequence must (at least eventually) depend on every bit in the hash code. Else it's waaay too easy to stumble into quadratic-time behavior for "bad" sets of keys, even by accident. Have fun :-)

Two bits of new info: first, it's possible to get the performance of "double" without division, at least via this way: """ # Double hashing using Fibonacci multiplication for the increment. This # does about as well as `double`, but doesn't require division. # # The multiplicative constant depends on the word size W, and is the # nearest odd integer to 2**W/((1 + sqrt(5))/2). So for a 64-bit box: # # >>> 2**64 / ((1 + decimal.getcontext().sqrt(5))/2) # Decimal('11400714819323198485.95161059') # # For a 32-bit box, it's 2654435769. # # The result is the high-order `nbits` bits of the low-order W bits of # the product. In C, the "& M" part isn't needed (unsigned * in C # returns only the low-order bits to begin with). # # Annoyance: I don't think Python dicts store `nbits`, just 2**nbits. def dfib(h, nbits, M=M64): mask = (1 << nbits) - 1 i = h & mask yield i inc = (((h * 11400714819323198485) & M) >> (64 - nbits)) | 1 while True: i = (i + inc) & mask yield i """ Second, the program I posted uses string objects as test cases. The current string hash acts like a good-quality pseudo-random number generator: change a character in the string, and the hash typically changes "a lot". It's also important to look at hashes with common patterns, because, e.g., all "sufficiently small" integers are their _own_ hash codes (e.g., hash(3) == 3). Indeed, long ago Python changed its dict implementation to avoid quadratic-time behavior for unfortunate sets of real-life integer keys. Example: change `objgen()` to `yield i << 12` instead of `yield str(i)`. Then we generate integers all of whose final 12 bits are zeroes. For all sufficiently small tables, then, these ints _all_ map to the same initial table slot (since the initial probe is taken from the last `nbits` bits). The collision resolution scheme is needed to prevent disaster. Here's a chunk of output from that, for dicts of size 2,730: """ bits 12 nslots 4,096 dlen 2,730 alpha 0.67 # built 37 theoretical avg probes for uniform hashing when found 1.65 not found 3.00 crisp when found 1.65 not found 3.00 prober linear found min 1:0.04% max 2730 mean 1365.50 fail min 2731:100.00% max 2731 mean 2731.00 prober quadratic found min 1:0.04% max 2730 mean 1365.50 fail min 2731:100.00% max 2731 mean 2731.00 prober pre28201 found min 1:0.04% max 61 mean 6.17 fail min 5:28.94% max 68 mean 9.78 prober current found min 1:0.04% max 58 mean 5.17 fail min 4:29.30% max 70 mean 8.62 prober double found min 1:0.04% max 5 mean 2.73 fail min 2:41.47% max 9 mean 3.94 prober dfib found min 1:0.04% max 9 mean 2.53 fail min 2:10.87% max 17 mean 4.48 prober uniform found min 1:66.52% max 21 mean 1.65 fail min 1:33.41% max 30 mean 2.99 """ It's a worst-case disaster for linear and quadratic probing. The `current` scheme does fine, but `double` and `dfib` do significantly better on all measures shown. `uniform` is oblivious, still achieving its theoretical average-case performance. That last isn't surprising "in theory", since it theoretically picks a probe sequence uniformly at random from the set of all table slot permutations. It's perhaps a bit surprising that this approximate _implementation_ also achieves it. But `random.seed(h)` does a relatively enormous amount of work to spray the bits of `h` all over the Twister's internal state, and then shuffle them around. In any case, the "double hashing" variants ("double" and "dfib") look very promising, doing better than the current scheme for both small tables and pathological key sets.

Some explanations and cautions. An advantage of sticking with pure Python instead of C is that spare moments are devoted to investigating the problem instead of thrashing with micro-optimizations ;-) Why does the current scheme suffer for small tables? With hindsight it's pretty obvious: it can visit table slots more than once (the other schemes cannot), and while the odds of that happening are tiny in large tables, they're high for small tables. For example, in a 3-bit table (8 slots), suppose the initial index is 1, and there's a collision. The 5*i+1 recurrence _would_ visit slot 6 next, but we also shift in 5 new bits and add them. If those bits are "random", there's a 1-in-8 chance that we'll visit index 1 _again_. If we don't, and have another collision, shifting in the next 5 bits has a 2-in-8 chance of repeating one of the two slots we already visited. And so on. Here using the `str(i)` generator (which yields random-ish hash codes): """ bits 3 nslots 8 dlen 5 alpha 0.62 # built 20,000 ... prober current found min 1:74.80% max 15 mean 1.42 fail min 1:37.62% max 18 mean 2.66 """ So despite that only 5 slots are filled, in at least one case it took 15 probes to find an existing key, and 18 probes to realize a key was missing. In the other schemes, it takes at most 5 probes for success and 6 probes for failure. This is worse for 64-bit hash codes than for 32-bit ones, because we can loop around twice as often before `perturb` permanently becomes 0 (at which point the pure 5*i+1 recurrence visits each slot at most once). The larger the table, the less likely a repeat due to `perturb` becomes. For example, suppose we have an 8-bit table and again visit index 1 first. We may visit any index in range(6, 6+32) next (depending on the 5 fresh bits shifted in), but repeating 1 is _not_ a possibility. Why do the double hashing methods do better for the `i << 12` generator? In any case where the hash codes have "a lot" of trailing bits in common, they all map to the same table index at first, and the probe sequence remains the same for all until the loop shifts `perturb` far enough right that the rightmost differing bits finally show up in the addition. In the double hashing methods, _all_ the bits of the hash code affect the value of `inc` computed before the loop starts, so they have a good chance of differing already on the second probe. This is again potentially worse for the current scheme with 64-bit hash codes, since the sheer number of common trailing bits _can_ be up to 63. However, the double-hashing schemes have pathological cases too, that the current scheme avoids. The first I tried was a `yield i * 1023` generator. These are spectacularly _good_ values for all schemes except `uniform` for successful searches, because i*1023 = j*1023 (mod 2**k) implies i=j (mod 2**k) That is, there are no first-probe collisions at all across a contiguous range of `i` with no more than 2**k values. But the `double` scheme for a 10-bit table degenerates to linear probing in this case, because inc = (h % mask) | 1 # force it odd is always 1 when h is divisible by 1023 (== mask for a 10-bit table). This is terrible for a failing search; e.g., for a 20-bit table (note that 2**20-1 is divisible by 1023): """ bits 20 nslots 1,048,576 dlen 699,050 alpha 0.67 # built 1 theoretical avg probes for uniform hashing when found 1.65 not found 3.00 ... prober current found min 1:100.00% max 1 mean 1.00 fail min 1:33.33% max 34 mean 3.04 prober double found min 1:100.00% max 1 mean 1.00 fail min 1:33.33% max 699049 mean 1867.51 prober dfib found min 1:100.00% max 1 mean 1.00 fail min 1:33.33% max 427625 mean 8.09 prober uniform found min 1:66.65% max 24 mean 1.65 fail min 1:33.35% max 35 mean 3.00 """ While that's a failing-search disaster for `double`, it's also bad for `dfib` (& I don't have a slick explanation for that). So where does that leave us? I don't know. All schemes have good & bad points ;-) I haven't yet thought of a cheap way to compute an `inc` for double-hashing that isn't vulnerable to bad behavior for _some_ easily constructed set of int keys. If you forget "cheap", it's easy; e.g., random.seed(h) inc = random.choice(range(1, mask + 1, 2)) Regardless, I'll attach the current version of the code.

[MRAB <python@mrabarnett.plus.com>]
If the current scheme suffers only for small tables, couldn't you use an alternative scheme only for small tables?
Sure. But whether that's desirable partly depends on timing actual C code. Try it ;-) For maintenance sanity, it's obviously better to have only one scheme to wrestle with. Note that "may visit a slot more than once" isn't the only thing in play, just one of the seemingly important things. For example, the current scheme requires 3 adds, 2 shifts, and a mask on each loop iteration (or 1 add and 1 shift of those could be replaced by 1 multiplication). The double-hashing schemes only require 1 add and 1 mask per iteration. In cases of collision, that difference is probably swamped by waiting for cache misses. But, as I said in the first msg: """ This is a brain dump for someone who's willing to endure the interminable pain of arguing about benchmarks ;-) """

[Tim]
Heh. I always tell people not to believe what they think: run code to make sure! So I did ;-) def duni(h, nbits): from random import seed, choice mask = (1 << nbits) - 1 i = h & mask yield i seed(h) inc = choice(range(1, mask + 1, 2)) while True: i = (i + inc) & mask yield i On the terrible-for-good-reasons-for-failing-`double`-1023*i-searches case, it turned out `duni` did just as bad as `dfib`: """ bits 20 nslots 1,048,576 dlen 699,050 alpha 0.67 # built 1 theoretical avg probes for uniform hashing when found 1.65 not found 3.00 ... crisp ... skipping (slow!) prober current found min 1:100.00% max 1 mean 1.00 fail min 1:33.33% max 34 mean 3.04 prober double found min 1:100.00% max 1 mean 1.00 fail min 1:33.33% max 699049 mean 1867.51 prober dfib found min 1:100.00% max 1 mean 1.00 fail min 1:33.33% max 427625 mean 8.09 prober duni found min 1:100.00% max 1 mean 1.00 fail min 1:33.33% max 433846 mean 9.84 prober uniform found min 1:66.65% max 24 mean 1.65 fail min 1:33.35% max 35 mean 3.00 Yikes! That really surprised me. So - are integers of the form 1023*i so magical they also manage to provoke random.seed() into amazingly regular behavior? Or is there something about the whole idea of "double hashing" that leaves it vulnerable no matter how "random" an odd increment it picks? It's not the former. More code showed that a great number of distinct `inc` values are in fact used. And double hashing is widely & successfully used in other contexts. so it's not a problem with the idea _in general_. What does appear to be the case: it doesn't always play well with taking the last `nbits` bits as the initial table index. In other double-hashing contexts, they strive to pick a pseudo-random initial table index too. Why this matters: for any odd integer N, N*i = N*j (mod 2**k) if and only if i = j (mod 2**k) So, when integers are their own hash codes, for any `i` all the values in range(i*N, i*N + N*2**k, N) hash to unique slots in a table with 2**k slots (actually any table with 2**m slots for any m >= k). In particular, mod 2**k they map to the slots i*N + 0*N i*N + 1*N i*N + 2*N i*N + 3*N ... So, on a failing search for an integer of the form j*N, whenever double-hashing picks an increment that happens to be a multiple of N, it can bump into a huge chain of collisions, because double-hashing also uses a fixed increment between probes. And this is so for any odd N. For example, using a "yield i * 11" generator: """ bits 20 nslots 1,048,576 dlen 699,050 alpha 0.67 # built 1 theoretical avg probes for uniform hashing when found 1.65 not found 3.00 ... prober current found min 1:100.00% max 1 mean 1.00 fail min 1:33.33% max 34 mean 3.00 prober double found min 1:100.00% max 1 mean 1.00 fail min 1:33.33% max 667274 mean 34.10 prober dfib found min 1:100.00% max 1 mean 1.00 fail min 1:33.33% max 539865 mean 9.30 prober duni found min 1:100.00% max 1 mean 1.00 fail min 1:33.33% max 418562 mean 8.65 prober uniform found min 1:66.63% max 25 mean 1.65 fail min 1:33.31% max 32 mean 3.00 """ All three double-hashing variants have horrible max collision chains, for the reasons just explained. In "normal" double-hashing contexts, the initial table indices are scrambled; it's my fault they follow a (mod 2**k) arithmetic progression in Python. Which fault I gladly accept ;-) It's valuable in practice. But, so long as that stays, it kills the double-hashing idea for me: while it's great for random-ish hash codes, the worst cases are horribly bad and very easily provoked (even by accident).

Two bits of new info: first, it's possible to get the performance of "double" without division, at least via this way: """ # Double hashing using Fibonacci multiplication for the increment. This # does about as well as `double`, but doesn't require division. # # The multiplicative constant depends on the word size W, and is the # nearest odd integer to 2**W/((1 + sqrt(5))/2). So for a 64-bit box: # # >>> 2**64 / ((1 + decimal.getcontext().sqrt(5))/2) # Decimal('11400714819323198485.95161059') # # For a 32-bit box, it's 2654435769. # # The result is the high-order `nbits` bits of the low-order W bits of # the product. In C, the "& M" part isn't needed (unsigned * in C # returns only the low-order bits to begin with). # # Annoyance: I don't think Python dicts store `nbits`, just 2**nbits. def dfib(h, nbits, M=M64): mask = (1 << nbits) - 1 i = h & mask yield i inc = (((h * 11400714819323198485) & M) >> (64 - nbits)) | 1 while True: i = (i + inc) & mask yield i """ Second, the program I posted uses string objects as test cases. The current string hash acts like a good-quality pseudo-random number generator: change a character in the string, and the hash typically changes "a lot". It's also important to look at hashes with common patterns, because, e.g., all "sufficiently small" integers are their _own_ hash codes (e.g., hash(3) == 3). Indeed, long ago Python changed its dict implementation to avoid quadratic-time behavior for unfortunate sets of real-life integer keys. Example: change `objgen()` to `yield i << 12` instead of `yield str(i)`. Then we generate integers all of whose final 12 bits are zeroes. For all sufficiently small tables, then, these ints _all_ map to the same initial table slot (since the initial probe is taken from the last `nbits` bits). The collision resolution scheme is needed to prevent disaster. Here's a chunk of output from that, for dicts of size 2,730: """ bits 12 nslots 4,096 dlen 2,730 alpha 0.67 # built 37 theoretical avg probes for uniform hashing when found 1.65 not found 3.00 crisp when found 1.65 not found 3.00 prober linear found min 1:0.04% max 2730 mean 1365.50 fail min 2731:100.00% max 2731 mean 2731.00 prober quadratic found min 1:0.04% max 2730 mean 1365.50 fail min 2731:100.00% max 2731 mean 2731.00 prober pre28201 found min 1:0.04% max 61 mean 6.17 fail min 5:28.94% max 68 mean 9.78 prober current found min 1:0.04% max 58 mean 5.17 fail min 4:29.30% max 70 mean 8.62 prober double found min 1:0.04% max 5 mean 2.73 fail min 2:41.47% max 9 mean 3.94 prober dfib found min 1:0.04% max 9 mean 2.53 fail min 2:10.87% max 17 mean 4.48 prober uniform found min 1:66.52% max 21 mean 1.65 fail min 1:33.41% max 30 mean 2.99 """ It's a worst-case disaster for linear and quadratic probing. The `current` scheme does fine, but `double` and `dfib` do significantly better on all measures shown. `uniform` is oblivious, still achieving its theoretical average-case performance. That last isn't surprising "in theory", since it theoretically picks a probe sequence uniformly at random from the set of all table slot permutations. It's perhaps a bit surprising that this approximate _implementation_ also achieves it. But `random.seed(h)` does a relatively enormous amount of work to spray the bits of `h` all over the Twister's internal state, and then shuffle them around. In any case, the "double hashing" variants ("double" and "dfib") look very promising, doing better than the current scheme for both small tables and pathological key sets.

Some explanations and cautions. An advantage of sticking with pure Python instead of C is that spare moments are devoted to investigating the problem instead of thrashing with micro-optimizations ;-) Why does the current scheme suffer for small tables? With hindsight it's pretty obvious: it can visit table slots more than once (the other schemes cannot), and while the odds of that happening are tiny in large tables, they're high for small tables. For example, in a 3-bit table (8 slots), suppose the initial index is 1, and there's a collision. The 5*i+1 recurrence _would_ visit slot 6 next, but we also shift in 5 new bits and add them. If those bits are "random", there's a 1-in-8 chance that we'll visit index 1 _again_. If we don't, and have another collision, shifting in the next 5 bits has a 2-in-8 chance of repeating one of the two slots we already visited. And so on. Here using the `str(i)` generator (which yields random-ish hash codes): """ bits 3 nslots 8 dlen 5 alpha 0.62 # built 20,000 ... prober current found min 1:74.80% max 15 mean 1.42 fail min 1:37.62% max 18 mean 2.66 """ So despite that only 5 slots are filled, in at least one case it took 15 probes to find an existing key, and 18 probes to realize a key was missing. In the other schemes, it takes at most 5 probes for success and 6 probes for failure. This is worse for 64-bit hash codes than for 32-bit ones, because we can loop around twice as often before `perturb` permanently becomes 0 (at which point the pure 5*i+1 recurrence visits each slot at most once). The larger the table, the less likely a repeat due to `perturb` becomes. For example, suppose we have an 8-bit table and again visit index 1 first. We may visit any index in range(6, 6+32) next (depending on the 5 fresh bits shifted in), but repeating 1 is _not_ a possibility. Why do the double hashing methods do better for the `i << 12` generator? In any case where the hash codes have "a lot" of trailing bits in common, they all map to the same table index at first, and the probe sequence remains the same for all until the loop shifts `perturb` far enough right that the rightmost differing bits finally show up in the addition. In the double hashing methods, _all_ the bits of the hash code affect the value of `inc` computed before the loop starts, so they have a good chance of differing already on the second probe. This is again potentially worse for the current scheme with 64-bit hash codes, since the sheer number of common trailing bits _can_ be up to 63. However, the double-hashing schemes have pathological cases too, that the current scheme avoids. The first I tried was a `yield i * 1023` generator. These are spectacularly _good_ values for all schemes except `uniform` for successful searches, because i*1023 = j*1023 (mod 2**k) implies i=j (mod 2**k) That is, there are no first-probe collisions at all across a contiguous range of `i` with no more than 2**k values. But the `double` scheme for a 10-bit table degenerates to linear probing in this case, because inc = (h % mask) | 1 # force it odd is always 1 when h is divisible by 1023 (== mask for a 10-bit table). This is terrible for a failing search; e.g., for a 20-bit table (note that 2**20-1 is divisible by 1023): """ bits 20 nslots 1,048,576 dlen 699,050 alpha 0.67 # built 1 theoretical avg probes for uniform hashing when found 1.65 not found 3.00 ... prober current found min 1:100.00% max 1 mean 1.00 fail min 1:33.33% max 34 mean 3.04 prober double found min 1:100.00% max 1 mean 1.00 fail min 1:33.33% max 699049 mean 1867.51 prober dfib found min 1:100.00% max 1 mean 1.00 fail min 1:33.33% max 427625 mean 8.09 prober uniform found min 1:66.65% max 24 mean 1.65 fail min 1:33.35% max 35 mean 3.00 """ While that's a failing-search disaster for `double`, it's also bad for `dfib` (& I don't have a slick explanation for that). So where does that leave us? I don't know. All schemes have good & bad points ;-) I haven't yet thought of a cheap way to compute an `inc` for double-hashing that isn't vulnerable to bad behavior for _some_ easily constructed set of int keys. If you forget "cheap", it's easy; e.g., random.seed(h) inc = random.choice(range(1, mask + 1, 2)) Regardless, I'll attach the current version of the code.

[MRAB <python@mrabarnett.plus.com>]
If the current scheme suffers only for small tables, couldn't you use an alternative scheme only for small tables?
Sure. But whether that's desirable partly depends on timing actual C code. Try it ;-) For maintenance sanity, it's obviously better to have only one scheme to wrestle with. Note that "may visit a slot more than once" isn't the only thing in play, just one of the seemingly important things. For example, the current scheme requires 3 adds, 2 shifts, and a mask on each loop iteration (or 1 add and 1 shift of those could be replaced by 1 multiplication). The double-hashing schemes only require 1 add and 1 mask per iteration. In cases of collision, that difference is probably swamped by waiting for cache misses. But, as I said in the first msg: """ This is a brain dump for someone who's willing to endure the interminable pain of arguing about benchmarks ;-) """

[Tim]
Heh. I always tell people not to believe what they think: run code to make sure! So I did ;-) def duni(h, nbits): from random import seed, choice mask = (1 << nbits) - 1 i = h & mask yield i seed(h) inc = choice(range(1, mask + 1, 2)) while True: i = (i + inc) & mask yield i On the terrible-for-good-reasons-for-failing-`double`-1023*i-searches case, it turned out `duni` did just as bad as `dfib`: """ bits 20 nslots 1,048,576 dlen 699,050 alpha 0.67 # built 1 theoretical avg probes for uniform hashing when found 1.65 not found 3.00 ... crisp ... skipping (slow!) prober current found min 1:100.00% max 1 mean 1.00 fail min 1:33.33% max 34 mean 3.04 prober double found min 1:100.00% max 1 mean 1.00 fail min 1:33.33% max 699049 mean 1867.51 prober dfib found min 1:100.00% max 1 mean 1.00 fail min 1:33.33% max 427625 mean 8.09 prober duni found min 1:100.00% max 1 mean 1.00 fail min 1:33.33% max 433846 mean 9.84 prober uniform found min 1:66.65% max 24 mean 1.65 fail min 1:33.35% max 35 mean 3.00 Yikes! That really surprised me. So - are integers of the form 1023*i so magical they also manage to provoke random.seed() into amazingly regular behavior? Or is there something about the whole idea of "double hashing" that leaves it vulnerable no matter how "random" an odd increment it picks? It's not the former. More code showed that a great number of distinct `inc` values are in fact used. And double hashing is widely & successfully used in other contexts. so it's not a problem with the idea _in general_. What does appear to be the case: it doesn't always play well with taking the last `nbits` bits as the initial table index. In other double-hashing contexts, they strive to pick a pseudo-random initial table index too. Why this matters: for any odd integer N, N*i = N*j (mod 2**k) if and only if i = j (mod 2**k) So, when integers are their own hash codes, for any `i` all the values in range(i*N, i*N + N*2**k, N) hash to unique slots in a table with 2**k slots (actually any table with 2**m slots for any m >= k). In particular, mod 2**k they map to the slots i*N + 0*N i*N + 1*N i*N + 2*N i*N + 3*N ... So, on a failing search for an integer of the form j*N, whenever double-hashing picks an increment that happens to be a multiple of N, it can bump into a huge chain of collisions, because double-hashing also uses a fixed increment between probes. And this is so for any odd N. For example, using a "yield i * 11" generator: """ bits 20 nslots 1,048,576 dlen 699,050 alpha 0.67 # built 1 theoretical avg probes for uniform hashing when found 1.65 not found 3.00 ... prober current found min 1:100.00% max 1 mean 1.00 fail min 1:33.33% max 34 mean 3.00 prober double found min 1:100.00% max 1 mean 1.00 fail min 1:33.33% max 667274 mean 34.10 prober dfib found min 1:100.00% max 1 mean 1.00 fail min 1:33.33% max 539865 mean 9.30 prober duni found min 1:100.00% max 1 mean 1.00 fail min 1:33.33% max 418562 mean 8.65 prober uniform found min 1:66.63% max 25 mean 1.65 fail min 1:33.31% max 32 mean 3.00 """ All three double-hashing variants have horrible max collision chains, for the reasons just explained. In "normal" double-hashing contexts, the initial table indices are scrambled; it's my fault they follow a (mod 2**k) arithmetic progression in Python. Which fault I gladly accept ;-) It's valuable in practice. But, so long as that stays, it kills the double-hashing idea for me: while it's great for random-ish hash codes, the worst cases are horribly bad and very easily provoked (even by accident).
participants (2)
-
MRAB
-
Tim Peters