[Python-Dev] PEP for new dictionary implementation

Thu Feb 16 19:24:22 CET 2012

PEP author Mark Shannon wrote
(in http://mail.python.org/pipermail/python-dev/attachments/20120208/05be469a/attachment.txt):

> ... allows ... (the ``__dict__`` attribute of an object) to share
> keys with other attribute dictionaries of instances of the same class.

Is "the same class" a deliberate restriction, or just a convenience
of implementation?  I have often created subclasses (or even families
of subclasses) where instances (as opposed to the type) aren't likely
to have additional attributes.  These would benefit from key-sharing
across classes, but I grant that it is a minority use case that isn't
worth optimizing if it complicates the implementation.

> By separating the keys (and hashes) from the values it is possible
> to share the keys between multiple dictionaries and improve memory use.

Have you timed not storing the hash (in the dict) at all, at least for
(unicode) str-only dicts?  Going to the string for its own cached hash
breaks locality a bit more, but saves 1/3 of the memory for combined
tables, and may make a big difference for classes that have relatively
few instances.

> Reduction in memory use is directly related to the number of dictionaries
> with shared keys in existence at any time. These dictionaries are typically
> half the size of the current dictionary implementation.

How do you measure that?  The limit for huge N across huge numbers
of dicts should be 1/3 (because both hashes and keys are shared); I
assume that gets swamped by object overhead in typical small dicts.

> If a table is split the values in the keys table are ignored,
> instead the values are held in a separate array.

If they're just dead weight, then why not use them to hold indices
into the array, so that values arrays only have to be as long as
the number of keys, rather than rounding them up to a large-enough
power-of-two?  (On average, this should save half the slots.)

> A combined-table dictionary never becomes a split-table dictionary.

I thought it did (at least temporarily) as part of resizing; are you
saying that it will be re-split by the time another thread is
allowed to see it, so that it is never observed as combined?

Given that this optimization is limited to class instances, I think
there should be some explanation of why you didn't just automatically
add slots for each variable assigned (by hard-coded name) within a
method; the keys would still be stored on the type, and array storage
could still be used for the values; the __dict__ slot could initially
be a NULL pointer, and instance dicts could be added exactly when they
were needed, covering only the oddball keys.

I would reword (or at least reformat) the Cons section; at the
moment, it looks like there are four separate objections, and seems
to be a bit dismissive towards backwards copmatibility.  Perhaps
something like:

While this PEP does not change any documented APIs or invariants,
it does break some de facto invariants.

C extension modules may be relying on the current physical layout
of a dictionary.  That said, extensions which rely on internals may
already need to be recompiled with each feature release; there are
already changes planned for both Unicode (for efficiency) and dicts
(for security) that would require authors of these extensions to
at least review their code.

Because iteration (and repr) order can depend on the order in which
keys are inserted, it will be possible to construct instances that
iterate in a different order than they would under the current
implementation.  Note, however, that this will happen very rarely
in code which does not deliberately trigger the differences, and
that test cases which rely on a particular iteration order will
already need to be corrected in order to take advantage of the
security enhancements being discussed under hash randomization, or
for use with Jython and PyPy.

-jJ

-- 

If there are still threading problems with my replies, please 
email me with details, so that I can try to resolve them.  -jJ