[Python-ideas] Optimizing global names via dict references

Sat Dec 19 09:10:53 EST 2015

Dictionary References: An Idea in Three Emails

Act 2: More stuff

Table of Contents:

    Act 1: Stuff
    https://mail.python.org/pipermail/python-ideas/2015-December/037511.html
    - Definitions
    - Problem statement
    - The idea
    - The benefits
    - The costs
    - What will hold a RefCell?
    - Add `getref` to `dict`
    - Implementation issues
    - Interpreters other than CPython

    Act 2: More stuff
    https://mail.python.org/pipermail/python-ideas/2015-December/037512.html
    - Extension: Scope ancestry
    - Aside: Classes and scopes do not have the same rules
    - Extension: Read/write refs
    - Extension: Read-only refs
    - Option: Alternative RefCell structure
    - Option: Replace the methods when detaching a RefCell
    - Extension: Dynamic scoping
    - Extension: Thread safety

    Act 3: Sample pseudocode
    https://mail.python.org/pipermail/python-ideas/2015-December/037513.html
    - ScopeDict
    - Scope ancestry
    - Alternative RefCell structure
    - Thread safety

== Extension: Scope ancestry ==

It's possible for lookups in one scope to fall back into another. In
other words, any failed lookup in scope A will result in a lookup in
scope B. I will call B a parent scope of A.

For example:
- Names in functions will be local if defined there, and global otherwise.
- If function f creates function g, then g's scope includes f's.
- Object attribute lookups go through instance -> class -> [superclasses].
- Global scope includes builtins.

The first two are irrelevant. Those decisions are made at
compile-time, so there's no need for optimization.

I don't want to touch the third one. I'm told that it's already
cached, and I'm not sure refcells would help.

We're left with globals and builtins. Any lookup in globals() can mean
a lookup in __builtins__. So consider __builtins__ as a parent of
globals().

Say we have a parentScope = ScopeDict() and a childScope =
ChainScopeDict(). The childScope will have a list of its parents:
childScope.parents = [parentScope].

When a ref = ChainRefCell() is created, it will also be given a
ref.parents = [parentScope]. On resolution, it will ask its parent
dicts for values. This requires holding a PyRef to the key. It might
also request RefCells from the parents.

[See "Pseudocode: Scope ancestry"]

== Extension: Replacing a parent ==

A parent scope might be replaced after ChainRefCells have been
created. In particular, globals()['__builtins__'] can be replaced.

Fortunately, globals() will know when that particular key is changed.
I suggest that globals() goes through all of its ChainRefCells and
updates their .parents to the new __builtins__ (which invalidates
parent refcells). This will mean iterating through the whole refs
array every time __builtins__ is replaced. So if you keep replacing
__builtins__, you're going to have a bad time.

In other cases where parents can be replaced, we can do the same kind
of notification. Say ChainScopeDicts A < B < C is a relation (B is a
child scope of A, so B is bigger than A), and we replace B's parent A
with D. Then B notifies its RefCells, and C doesn't need to be
notified directly.

(This does NOT work for class hierarchies. See next section.)

== Aside: Classes and scopes do not have the same rules ==

Take this class hierarchy:

    class A(object): pass

    class B(object): pass

    class C(A, B): pass

    obj = C()

Then a lookup in C will NOT cause recursive lookups in A and B. In
other words, the family tree look like this:

    obj:
       obj -> C -> A
                -> B
                -> object
    C:
      C -> A
        -> B
        -> object
    A:
      A -> object
    B:
      B -> object

No matter how deep the class hierarchy goes, the scope hierarchy for
classes and objects is at most two levels deep: instance -> class ->
[superclasses]. This is due to Method Resolution Order. It means that
a class's direct scope parents are all of its superclasses.

This also means that the notification idea in the previous section
won't work. If a class changes its superclass, it doesn't just need to
notify its own cells. It needs to notify all of its subclasses (though
not instances), because they hold pyreferences to its original
superclass.

This can be solved by versioning: But let's not get into that when
it's not even clear that attribute lookup needs this optimization.

== Extension: Read/write refs ==

Python functions can write to globals by declaring, for example,
`global x`. It's straightforward to allow myRefCell.set(val) and
myRefCell.delete().

As a method of `dict`, this would slightly improve the memoization idiom.

Before:

    memo = {}
    def fib(n):
        if n < 2:
            return n
        try:
            return memo[n]
        except:
            result = memo[n] = fib(n-1) + fib(n-2)
                       # ^Second lookup
            return result

After:

    memo = {}
    def fib(n):
        if n < 2:
            return n

        ref = memo.getref(n)
        try:
            return ref.get()
        except:
            ref.set(fib(n-1) + fib(n-2)) # No second lookup
            return ref.get()

Also, "It's better to ask for forgiveness" might no longer be true
with a method that checks for existence.

    memo = {}
    def fib(n):
        if n < 2:
            return n

        ref = memo.getref(n)
        if ref.empty():
            ref.set(fib(n-1) + fib(n-2))

        return ref.get()

This allows modification of entries even after the scope dies. It's
like closures, except exactly the same as closures.

== Extension: Read-only refs ==

Some optimizations might be possible if a dict knows that it will be
notified for every change, through setitem/delitem. For example, if we
make dict1 = dict0.copy(), then this can be a shallow copy, and
whichever dict modifies first will split off from the other. This is
not possible with the existence of RefCells that can .set(). (See
previous section.)

Two non-exclusive solutions:

1. Distinguish read-only RefCells and read-write RefCells. Have a dict
keep track of whether it has read-write RefCells.

2. Make Refcell.set notify its owner dict. This requires a pointer to
the owner dict. (NOT a pyreference, since the pointer is only useful
as long as the owner is alive, and the owner can notify its RefCells
when it dies.)

== Option: Alternative RefCell structure ==

The RefCell I defined owns a pointer to the dict entry or a pointer to
the value, and a flag to determine whether it's part of a "living"
dict.

Instead, we can hold two pointers, one which MIGHT point to `key`. The
`key` pointer will do double duty by telling the RefCell whether it's
part of a living dict (by being NULL).

With this, when we call .get(), it will raise a KeyError with the
correct key. Otherwise, it would have to raise a generic KeyError,
since it doesn't know the key.

[See "Pseudocode: Alternative RefCell structure"]

== Option: Replace the methods when detaching a RefCell ==

Instead of figuring out whether it's pointing to a table entry or a
value itself, we can simply replace the RefCell's member functions.
They will only change once, since the owner can only die once.

== Extension: Dynamic scoping ==

In (C?)Python 3, `exec` can't change scoping. It can't even change
local variables.

    x = 'global x'
    y = 'global y'
    def f():
        x = 10
        exec("x = 'local x'")
        exec("y = 'local y'")
        print(x) # prints 10
        print(y) # prints 'global y'

    f()
    print(x) #prints 'global x'

The docs say:
    """
    The default locals act as described for function locals() below:
modifications to the default locals dictionary should not be
attempted. Pass an explicit locals dictionary if you need to see
effects of the code on locals after function exec() returns.
    """

It's possible, if we wanted, to change how this works, and have nested
function scoping behave like globals > builtins (that is,
dynamically). I'm not sure if it's even desirable, though.

(Function scope ancestry is linear, fortunately. No multiple
inheritance, so no diamond inheritance problem, so there's no need for
Python's MRO, so we wouldn't have the same issues as with classes.)

Steven D'Aprano talks more about it here:
https://mail.python.org/pipermail/python-dev/2015-December/142511.html

== Extension: Thread safety ==

(Note: Needs an expert to double-check.)

RefCells might need locks, because they have to determine whether the
dict is alive to determine where their value is. The owner dict
doesn't need to care about that, since it knows it's alive.

When a dict deletes, it only needs to hold a lock on one RefCell at a time.

[See "Pseudocode: Thread safety"]