[Python-Dev] More compact dictionaries with faster iteration

Thu Dec 13 08:26:47 CET 2012

On 12/13/2012 06:11 AM, PJ Eby wrote:
> On Wed, Dec 12, 2012 at 5:06 PM, Dag Sverre Seljebotn
> <d.s.seljebotn at astro.uio.no> wrote:
>> Perfect hashing already separates "hash table" from "contents" (sort of),
>> and saves the memory in much the same way.
>>
>> Yes, you can repeat the trick and have 2 levels of indirection, but that
>> then requires an additional table of small ints which is pure overhead
>> present for the sorting; in short, it's no longer an optimization but just
>> overhead for the sortability.
>
> I'm confused.  I understood your algorithm to require repacking,
> rather than it being a suitable algorithm for incremental change to an
> existing dictionary.  ISTM that that would mean you still pay some
> sort of overhead (either in time or space) while the dictionary is
> still being mutated.

As-is the algorithm just assumes all key-value-pairs are available at 
creation time.

So yes, if you don't reallocate when making the dict perfect then it 
could make sense to combine it with the scheme discussed in this thread.

If one does leave some free slots open there's some probability of an 
insertion working without complete repacking, but the probability is 
smaller than with a normal dict. Hybrid schemes and trade-offs in this 
direction could be possible.

>
> Also, I'm not sure how 2 levels of indirection come into it.   The
> algorithm you describe has, as I understand it, only up to 12
> perturbation values ("bins"), so it's a constant space overhead, not a
> variable one.  What's more, you can possibly avoid the extra memory
> access by using a different perfect hashing algorithm, at the cost of
> a slower optimization step or using a little more memory.

I said there's k perturbation values; you need an additional array

some_int_t d[k]

where some_int_t is large enough to hold n (the number of entries). Just 
like what's proposed in this thread.

The paper recommends k > 2*n, but in my experiments I could get away 
with k = n in 99.9% of the cases (given perfect entropy in the 
hashes...). So the overhead is roughly the same as what's proposed here.

I think the most promising thing would be to have always have a single 
integer table and either use it for indirection (usual case) or perfect 
hash function parameters (say, after a pack() method has been called and 
before new insertions).

>> Note: I'm NOT suggesting the use of perfect hashing, just making
>> sure it's existence is mentioned and that one is aware that if
>> always-ordered dicts become the language standard it precludes
>> this option far off in the future.
>
> Not really.  It means that some forms of perfect hashing might require
> adding a few more ints worth of overhead for the dictionaries that use
> it.  If it's really a performance benefit for very-frequently-used
> dictionaries, that might still be worthwhile.
>

As mentioned above the overhead is larger.

I think the main challenge is to switch to a hashing scheme with larger 
entropy for strings, like murmurhash3. Having lots of zero bits in the 
string for short strings will kill the scheme, since it needs several 
attempts to succeed (the "r" parameter). So string hashing is slowed 
down a bit (given the caching I don't know how important this is).

Ideally one should make sure the hashes 64-bit on 64-bit platforms too 
(IIUC long is 32-bit on Windows but I don't know Windows well).

But the main reason I say I'm not proposing it is I don't have time to 
code it up for demonstration and people like to have something to look 
at when they get proposals :-)

Dag Sverre