[Python-Dev] A new dict for Xmas?

Fri Dec 23 12:21:26 CET 2011

Martin v. Löwis wrote:
>>> - it would be useful to have a specialized representation for
>>>   all-keys-are-strings. In that case, me_hash could be dropped
>>>   from the representation. You would get savings compared to
>>>   the status quo even in the non-shared case.
>> It might tricky switching key tables and I dont think it would save much
>> memory as keys that are widely shared take up very little memory anyway,
>> and not many other dicts are long-lived.
> 
> Why do you say that? In a plain 3.3 interpreter, I counted 595 dict
> objects (see script below). Of these, 563 (so nearly of them) had
> only strings as keys. Among those, I found 286 different key sets,
> where 231 key sets occurred only once (i.e. wouldn't be shared).
> 
> Together, the string dictionaries had 13282 keys, and you could save
> as many pointers (actually more, because there will be more key slots
> than keys).

The question is how much memory needs to be saved to be worth adding the 
complexity, 10kb: No, 100Mb: yes.
So data from "real" benchmarks would be useful.

Also, I'm assuming that it would be tricky to implement correctly due to 
implicit assumptions in the rest of the code.
If I'm wrong and its easy to implement then please do.

> 
> I'm not sure why you think the string dicts with unshared keys would be
> short-lived. 
Not all, but most. Most dicts with unshared keys would most likely be
for keyword parameters. Explicit dicts tend to be few in number.
(When I say few I mean up to 1k, not 100k or 1M).

Module dicts are very likely to have unshared keys; they number in the 
10s or 100s, but they do tend to be large.

> But even if they were, what matters is the steady-state
> number of dictionaries - if for every short-lived dictionary that
> gets released another one is created, any memory savings from reducing
> the dict size would still materialize. 
But only a few kb?

> 
>>> - I wonder whether the shared keys could be computed at compile
>>>   time, considering all attribute names that get assigned for
>>>   self. The compiler could list those in the code object, and
>>>   class creation could iterate over all methods (taking base
>>>   classes into account).
>>>
>> It probably wouldn't be that hard to make a guess at compile time as to
>> what the shared keys would be, but it doesn't really matter.
>> The generation of intermediate shared keys will only happen once per
>> class, so the overhead would be negligible.
> 
> I'm not so much concerned about overhead, but about correctness/
> effectiveness of the heuristics. For a class with dynamic attributes,
> you may well come up with a very large key set. With source analysis,
> you wouldn't attempt to grow the keyset beyond what likely is being
> shared.
I agree some sort of heuristic is required to limit excessive growth
and prevent pathological behaviour.
The current implementation just has a cut off at a certain size;
it could definitely be improved.

As I said, I'll update the code soon and then, well what's the phase...
Oh yes, "patches welcome" ;)

Thanks for the feedback.

Cheers,
Mark.

> 
> Regards,
> Martin
> 
> import sys
> d = sys.getobjects(0,dict)
> print(len(d), "dicts")
> d2 = []
> for o in d:
>     keys = o.keys()
>     if not keys:continue
>     types = tuple(set(type(k) for k in keys))
>     if types != (str,):
>         continue
>     d2.append(tuple(sorted(keys)))
> print(len(d2), "str dicts")
> freq = {}
> for keys in d2:
>     freq[keys] = freq.get(keys,0)+1
> print(len(freq), "different key sets")
> freq = sorted(freq.items(), key=lambda t:t[1])
> print(len([o for o in freq if o[1]==1]), "unsharable")
> print(sum(len(o[0]) for o in freq), "keys")
> print(freq[-10:])