[Python-Dev] Idea: Dictionary references

Thu Dec 17 13:59:32 EST 2015

On Thu, Dec 17, 2015 at 11:50 AM, Victor Stinner
<victor.stinner at gmail.com> wrote:
> I don't understand how you plan to avoid guards. The purpose of guards
> is to respect the Python semantic by falling back to the "slow"
> bytecode if something changes. So I don't understand your idea of
> avoiding completly guards. Again, that's why I guess that it's
> unrelated to FAT Python...

Yeah, I guess it is. Maybe we could've moved to python-ideas. As far
as I understand, this concept can be put into CPython.

(Note to self: Look up how PyPy handles repeated lookups.)

>> Guards would also have an issue
>> with nested scopes. You have a note on your website about it:
>> (https://faster-cpython.readthedocs.org/fat_python.html#call-pure-builtins)
>>
>>     "The optimization is disabled when the builtin function is
>> modified or if a variable with the same name is added to the global
>> namespace of the function."
>
> FAT Python doesn't emit a specialized version if it requires a builtin
> function, but a local variable with the same name is found.
>
> The check is done in the current function but also in the parent
> namespaces, up to the global namespace. I'm talking about the static
> analysis of the code.
>
> If the specialized version is built, a guard is created on the builtin
> namespace and another on the global namespace.
>
> Sorry, I don't understand the "problem" you are referring to. Can you
> maybe show an example of code where FAT Python doesn't work or where
> you idea can help?

If we have to look in scopes A, B, C in order, where the name is in C
but never in B, and there's no nesting relationship between B and C.
In that case, I do not create a refcell in B chained to C (because
there's no relationship), so I keep doing lookups in B. That's the
problem. For that, guards and versioning can prevent unnecessary
lookups in B. Though I think a better solution might be: If a name is
found in C, create empty refcells in B and A (i.e. in all previous
dicts).

(There's a nesting relationship between globals() and __builtin__, so
that's fine.)

>> RefCells "know" whether they're part of a living dict. (The dict marks
>> them as such upon its death.) If they are not, they will decref their
>> value upon their death. They do not hold a reference to their owner
>> dict.
>
> The dict contains the list all of its "own" RefCell objects, right?

It contains a table of pointers. The pointers are to RefCell objects or NULL.

The refs table is exactly the same size as the internal hash table.
This makes indexing it efficient: to find the pointer to a refcell,
find the index of the key in the hash table, then use that SAME index
on the refs table. You never need to find a refcell without also
finding its hash index, so this is cheap.

> In short, RefCell is a view on a single dict entry, to be able to
> "watch" a dict entry without the cost of a lookup? And a RefCell can
> be created, even if the dict entry doesn't exist right?

My "implementation", which had nesting and recursion in mind, had a
"create_if_none" parameter, which meant that the requester can ask for
it to be created even if the key didn't exist in the table.
Pre-creation is useful for functions which refer to globals() names
before they're defined. No-creation is useful in... I can only think
of nesting as a use (globals() -> __builtin__ shouldn't create empty
cells in __builtin__).

See `getref` in here:
https://mail.python.org/pipermail/python-dev/2015-December/142489.html

> Hum, creating many RefCell introduces an overhead somewhere. For
> example, deleting a key has to update N RefCell objects linked to this
> key, right? So del dict[x] takes O(number of RefCell), right?

There are no "outside" updates, except when a dict moves to a
different internal table or deletes its internal table. In that case,
the dict has to move and update each exposed RefCell.

For each dict, for each key, there is at most one RefCell. As long as
the dict is alive, that RefCell will hold a pointer to the pointer to
the value (enforced by the dict). When the dict dies, it makes the
Refcell point to the object, and tells the RefCell it's free (so it's
in charge of cleaning up its value).

Dict entries look like this.

    typedef struct {
        /* Cached hash code of me_key. */
        Py_hash_t me_hash;
        PyObject *me_key;
        PyObject *me_value; /* This field is only meaningful for
combined tables */
    } PyDictKeyEntry;

The internal table (which the ref table will sync with) is an array of
PyDictKeyEntrys.

(Raymond Hettinger made a design with a smaller table, and the hash
lookup would be into an array of indices. This would make synced
metadata tables both easier and smaller. See
    https://mail.python.org/pipermail/python-dev/2012-December/123028.html
and latest relevant discussion
    https://mail.python.org/pipermail/python-ideas/2015-December/037468.html
)

The refcell will hold this:

    RefCell(&PyDictKeyEntry.me_value)

A pointer to the field, not to the value itself. This means NO extra
updates are necessary, and NO O(n) anything anywhere (except on
resizing and destruction, and nesting can be O(n) in the number of
parents).

>> If a dict key is removed, the inner dict will still have the key
>> (which is true in the current implementation), but the value will be
>> decref'd and the value pointer will be NULL'd. The RefCell will not
>> need to be updated, since (as part of a living dict) it's pointing at
>> the pointer to the value, not the object itself.
>
> Ok, so deleting a dict key always destroy the value, but the key may
> stay alive longer than expected (until all RefCell objects are
> destroyed). Usually, dict keys are constants, so keeping them alive
> doesn't matter so much. It's rare to have large keys.

It MIGHT stay alive longer than the current implementation, yes. But
that's not necessarily a bad thing.

If there's no RefCell:
The current dict implementation removes the key on deletion. In my
"implementation", it doesn't remove the key on deletion. This isn't
necessary: it can safely remove the key if there's no ref. (This adds
an extra pointer check to delitem.)

If there's a RefCell:
Current dict removes key. A refdict should not remove a key with an
exposed RefCell, because that key marks the spot as "used" (so that no
key can be put inside). This is okay, because the refs are made to
avoid lookups, so while we're keeping an extra reference to the key,
the owner of the RefCell does NOT need to keep a reference to the key.
There can be fewer total references to the key than with the current
implementation, even with the extra reference. Especially in function
objects, which is what I'm trying to solve.

>> If it is detached from its owner dict (due to owner death), it will
>> own a strong reference to the value. This is necessarily the case,
>> since things that have a (strong) reference to the RefCell expect to
>> find the value (or lack of a value) there.
>
> I don't  understand this part.
>
> You said that deleting a key destroys the value. Destroying a dict
> means clearing all keys, so destroying all values. No?
>
> What is the use case of having a RefCell no more connected to the dict?

Honestly, I can only think of closures. But it's the right thing to
do. (RefCells with refcount == 1 will be deleted upon dict
destruction.)

Consider: If thing X holds a RefCell into a dict, and the dict is
destroyed, what should happen?

Without RefCells:
    Thing X would be looking things up in the dict. That means thing X
would have had a strong reference to the dict.

With RefCells:
    The dict might be deleted where it wasn't deleted before. So thing
X should have a disconnected RefCell to the value. (If it looked
things up via a weakref to the dict, it should create a weakref to the
RefCell. But if I understood an earlier message correctly, you can't
weakref a dict.)