[Python-Dev] More compact dictionaries with faster iteration
Dag Sverre Seljebotn
d.s.seljebotn at astro.uio.no
Thu Dec 13 08:26:47 CET 2012
On 12/13/2012 06:11 AM, PJ Eby wrote:
> On Wed, Dec 12, 2012 at 5:06 PM, Dag Sverre Seljebotn
> <d.s.seljebotn at astro.uio.no> wrote:
>> Perfect hashing already separates "hash table" from "contents" (sort of),
>> and saves the memory in much the same way.
>>
>> Yes, you can repeat the trick and have 2 levels of indirection, but that
>> then requires an additional table of small ints which is pure overhead
>> present for the sorting; in short, it's no longer an optimization but just
>> overhead for the sortability.
>
> I'm confused. I understood your algorithm to require repacking,
> rather than it being a suitable algorithm for incremental change to an
> existing dictionary. ISTM that that would mean you still pay some
> sort of overhead (either in time or space) while the dictionary is
> still being mutated.
As-is the algorithm just assumes all key-value-pairs are available at
creation time.
So yes, if you don't reallocate when making the dict perfect then it
could make sense to combine it with the scheme discussed in this thread.
If one does leave some free slots open there's some probability of an
insertion working without complete repacking, but the probability is
smaller than with a normal dict. Hybrid schemes and trade-offs in this
direction could be possible.
>
> Also, I'm not sure how 2 levels of indirection come into it. The
> algorithm you describe has, as I understand it, only up to 12
> perturbation values ("bins"), so it's a constant space overhead, not a
> variable one. What's more, you can possibly avoid the extra memory
> access by using a different perfect hashing algorithm, at the cost of
> a slower optimization step or using a little more memory.
I said there's k perturbation values; you need an additional array
some_int_t d[k]
where some_int_t is large enough to hold n (the number of entries). Just
like what's proposed in this thread.
The paper recommends k > 2*n, but in my experiments I could get away
with k = n in 99.9% of the cases (given perfect entropy in the
hashes...). So the overhead is roughly the same as what's proposed here.
I think the most promising thing would be to have always have a single
integer table and either use it for indirection (usual case) or perfect
hash function parameters (say, after a pack() method has been called and
before new insertions).
>> Note: I'm NOT suggesting the use of perfect hashing, just making
>> sure it's existence is mentioned and that one is aware that if
>> always-ordered dicts become the language standard it precludes
>> this option far off in the future.
>
> Not really. It means that some forms of perfect hashing might require
> adding a few more ints worth of overhead for the dictionaries that use
> it. If it's really a performance benefit for very-frequently-used
> dictionaries, that might still be worthwhile.
>
As mentioned above the overhead is larger.
I think the main challenge is to switch to a hashing scheme with larger
entropy for strings, like murmurhash3. Having lots of zero bits in the
string for short strings will kill the scheme, since it needs several
attempts to succeed (the "r" parameter). So string hashing is slowed
down a bit (given the caching I don't know how important this is).
Ideally one should make sure the hashes 64-bit on 64-bit platforms too
(IIUC long is 32-bit on Windows but I don't know Windows well).
But the main reason I say I'm not proposing it is I don't have time to
code it up for demonstration and people like to have something to look
at when they get proposals :-)
Dag Sverre
More information about the Python-Dev
mailing list