Optimizing global names via dict references

(Previous threads from python-dev: "Re: [Python-Dev] Third milestone of FAT Python" https://mail.python.org/pipermail/python-dev/2015-December/142437.html "[Python-Dev] Idea: Dictionary references" (Forked from above thread.) https://mail.python.org/pipermail/python-dev/2015-December/142490.html ) Dictionary References: An Idea in Three Emails Act 1: Stuff Table of Contents: Act 1: Stuff https://mail.python.org/pipermail/python-ideas/2015-December/037511.html - Definitions - Problem statement - The idea - The benefits - The costs - What will hold a RefCell? - Add `getref` to `dict` - Implementation issues - Interpreters other than CPython Act 2: More stuff https://mail.python.org/pipermail/python-ideas/2015-December/037512.html - Extension: Scope ancestry - Aside: Classes and scopes do not have the same rules - Extension: Read/write refs - Extension: Read-only refs - Option: Alternative RefCell structure - Option: Replace the methods when detaching a RefCell - Extension: Dynamic scoping - Extension: Thread safety Act 3: Sample pseudocode https://mail.python.org/pipermail/python-ideas/2015-December/037513.html - ScopeDict - Scope ancestry - Alternative RefCell structure - Thread safety == Definitions == "lookup" A dictionary lookup. "exposed RefCell" A RefCell with a PyRef other than by its owner. A RefCell with a refcount > 1. "pyreference" A CPython ref-counted reference. == Problem statement == I propose a CPython interpreter optimization for global name lookups. When a function uses a global name, it requires a lookup during execution. def maxlen(lists): return sum(len(lst) #globals['len'] for lst in lists) This is necessary because: 1. The name might not exist yet. 2. The value might change. Repeated dict lookups are more expensive than, uh, not doing that. Local variables are indexed by array, rather than lookups. (See http://stackoverflow.com/questions/12590058/python-performance-with-global-v...) Pythonistas often get around this lookup by saving the name in a local variable. def maxlen(lists): len_ = len return sum(len_(lst) #globals['len'] for lst in lists) They also use a default argument for that variable, which would prevent the lookup during each call. def maxlen(lists, len=len): return sum(len_(lst) #globals['len'] for lst in lists) I think these tricks are theoretically unnecessary. We only need a single lookup, at compile-time. There should be no change to Python's semantics. In fact, this should reinforce the semantics, because the above tricks don't allow for `len` to change, so if this idea is implemented, the tricks would only be used to explicitly disallow changes. == The idea == We use a level of indirection, and allow empty values in the internal dict (scope.__inner__[key] = NULL). Let ScopeDict be a subclass of dict. (It doesn't have to be a subclass. See "ScopeDict should replace dict".) ScopeDict owns a second dict of RefCells. A normal dict owns pyreference to its values. A RefCell will have a pointer to that reference. This means that a RefCell does not need to be notified when a value is set, since it knows where the dict stores its pointer to the value. Instead, when the dict resizes, the dict has to update all of its RefCells. Upon deletion of the dict, ScopeDict will pass ownership of the value pyreferences to the RefCells. That was a simplified view. We can make these optimizations: 1. The RefCells will be held in an array, not a dict. This array will be synced with the dict's internal hash table, and resizes with it. This allows a single lookup for both the value and its RefCell (or lack thereof), because they will be at the same index in different arrays. 2. RefCells won't be created unless they're requested. Obviously. 2a. The array doesn't need to be created unless a RefCell is created. 3. Any time a ScopeDict would touch a RefCell, it would first check that it's not exposed (that is, that the RefCell has no outside references). If it isn't exposed, that Refcell can be safely deleted. 3a. The refs array can also be deleted at this time if it doesn't hold anything. 4. If a key has no corresponding RefCells, it doesn't need to be saved. This can be checked on resize and ScopeDict.__del__, and also can be checked on __delitem__. [See "Pseudocode: ScopeDict"] == The benefits == Every global lookup is done at compile-time. No need to save a local copy of a global variable. Guaranteed to have the most updated value. Even if it didn't exist at compile-time. In particular, it works even for recursive functions and wrapped functions, such as from functools import lru_cache @lru_cache() def fib(n): if n < 2: return n return fib(n-1) + fib(n-2) Access to a global variable during a call is a few C dereferences, plus a few checks and C function calls. No change to the semantics of the language. Possible change in the semantics of CPython: There might be cases where keys live longer than they used to, given the following: 1. `del scope[key]` was called. 2. There was an exposed RefCell for that key. 3. Later, the RefCell stopped being exposed 4. `scope` has not resized since then (and thus didn't clean up its dead cells yet). The above is already possible in garbage-collected interpreter. We can make ScopeDict weakref its RefCells, and have RefCells remove keys with no value when they die, but it would mean extra overhead by wrapping a RefCell with a weakref. Is that worth it? (`getref` will return a strong reference.) == The costs == Memory: - An extra pointer to the refs table in each ScopeDict, and an extra function pointer for the new function. - An extra array for each ScopeDict, 1/3 to 1/4 the size of the internal dict table (depends on how big pointers are on the platform). Note: Since the refs table is the same size as the internal hash table, Raymond Hettinger's compact dict idea will also decrease the size of the refs table. (compact dict idea: https://mail.python.org/pipermail/python-dev/2012-December/123028.html) - Extra memory for each RefCell (a Python object). (There will be at most one RefCell per dict per key.) - The dict will be bigger because it holds empty entries. Execution: - Extra compile time. Functions will do lookups when they're compiled. But we're saving lookups during execution, so it's only extra if the function never gets called or never uses the RefCell. - `getitem`: Must check if the value is NULL. Currently, it only checks the key. - `delitem`: Instead of removing the key, just set the value to NULL. In fact, this is probably cheaper. - Resizing the dict: It will have to copy over all the RefCells (and delete the unused ones) to a new array, too, and update them. - Deleting the dict: The dict will have to DecRef its RefCells, and pass on its value pyreferences to the ones that survive. Python Refs: - We're potentially holding keys even when values have been deleted. This is okay, because most pyreferences to a refcell replace a pyreference to a key. == What will hold a RefCell? == Functions will request RefCells for each global variable they use. They will hold them until they die. Code not in functions (e.g. in module scope), or using eval/exec, will request RefCells during compilation, and they will be used during execution. They will discard the RefCells when they are done executing. == Add `getref` to `dict` == I propose that all dicts should have the ability to return RefCells. As Victor Stinner pointed out in his FAT Python message, using a subclass for scopes would disallow regular dicts being used as scopes by eval and exec. (See end of https://mail.python.org/pipermail/python-dev/2015-December/142397.html) But I think refs can be useful enough that it could be added to the dict interface. Meaning `getref` (via another name... `item`?) can be a new dict method. Upon first use, the dict will generate an empty refs table, and replace its methods with ones that know how to deal with refs. This prevents O(n) memory overhead for dicts that don't need it. == Implementation issues == There are performance _concerns_. I think it can be done with only a few performance _issues_. In my experience in arguing out the idea, the overhead for both memory and execution will be better than what they replace.[*] So while I'm not promising a free lunch, I'm believing in it. There are potential issues with dict subclasses that also change the C function pointers, since I'm proposing that we replace those functions dynamically. I think these issues are solvable, but it would require understanding what's allowed so I would know how to properly wrap that functionality. [*] (Except that there's a tricky case in deep multiple inheritance of scope, which I'd have to figure out. But I don't think it's possible to do that except in classes, and I don't wanna touch class MRO until I understand how the resolution works in C.) [*] (And you can pay a lot of lookups if you keep creating functions that won't get executed, but it's impossible to do such a thing in a loop since the lookups will be done once by the parent. Unless you loop an eval/exec to keep making functions that will never be called, in which case you're a bad person.) == Interpreters other than CPython == It should be possible to use the same idea in IronPython, Jython, and PyPy. I think it's a simple enough idea that I am surprised that it's not already there. In fact, I would not be surprised if PyPy already uses it. In fact, I would not be (very) surprised if I had gotten the idea from PyPy.

Dictionary References: An Idea in Three Emails Act 2: More stuff Table of Contents: Act 1: Stuff https://mail.python.org/pipermail/python-ideas/2015-December/037511.html - Definitions - Problem statement - The idea - The benefits - The costs - What will hold a RefCell? - Add `getref` to `dict` - Implementation issues - Interpreters other than CPython Act 2: More stuff https://mail.python.org/pipermail/python-ideas/2015-December/037512.html - Extension: Scope ancestry - Aside: Classes and scopes do not have the same rules - Extension: Read/write refs - Extension: Read-only refs - Option: Alternative RefCell structure - Option: Replace the methods when detaching a RefCell - Extension: Dynamic scoping - Extension: Thread safety Act 3: Sample pseudocode https://mail.python.org/pipermail/python-ideas/2015-December/037513.html - ScopeDict - Scope ancestry - Alternative RefCell structure - Thread safety == Extension: Scope ancestry == It's possible for lookups in one scope to fall back into another. In other words, any failed lookup in scope A will result in a lookup in scope B. I will call B a parent scope of A. For example: - Names in functions will be local if defined there, and global otherwise. - If function f creates function g, then g's scope includes f's. - Object attribute lookups go through instance -> class -> [superclasses]. - Global scope includes builtins. The first two are irrelevant. Those decisions are made at compile-time, so there's no need for optimization. I don't want to touch the third one. I'm told that it's already cached, and I'm not sure refcells would help. We're left with globals and builtins. Any lookup in globals() can mean a lookup in __builtins__. So consider __builtins__ as a parent of globals(). Say we have a parentScope = ScopeDict() and a childScope = ChainScopeDict(). The childScope will have a list of its parents: childScope.parents = [parentScope]. When a ref = ChainRefCell() is created, it will also be given a ref.parents = [parentScope]. On resolution, it will ask its parent dicts for values. This requires holding a PyRef to the key. It might also request RefCells from the parents. [See "Pseudocode: Scope ancestry"] == Extension: Replacing a parent == A parent scope might be replaced after ChainRefCells have been created. In particular, globals()['__builtins__'] can be replaced. Fortunately, globals() will know when that particular key is changed. I suggest that globals() goes through all of its ChainRefCells and updates their .parents to the new __builtins__ (which invalidates parent refcells). This will mean iterating through the whole refs array every time __builtins__ is replaced. So if you keep replacing __builtins__, you're going to have a bad time. In other cases where parents can be replaced, we can do the same kind of notification. Say ChainScopeDicts A < B < C is a relation (B is a child scope of A, so B is bigger than A), and we replace B's parent A with D. Then B notifies its RefCells, and C doesn't need to be notified directly. (This does NOT work for class hierarchies. See next section.) == Aside: Classes and scopes do not have the same rules == Take this class hierarchy: class A(object): pass class B(object): pass class C(A, B): pass obj = C() Then a lookup in C will NOT cause recursive lookups in A and B. In other words, the family tree look like this: obj: obj -> C -> A -> B -> object C: C -> A -> B -> object A: A -> object B: B -> object No matter how deep the class hierarchy goes, the scope hierarchy for classes and objects is at most two levels deep: instance -> class -> [superclasses]. This is due to Method Resolution Order. It means that a class's direct scope parents are all of its superclasses. This also means that the notification idea in the previous section won't work. If a class changes its superclass, it doesn't just need to notify its own cells. It needs to notify all of its subclasses (though not instances), because they hold pyreferences to its original superclass. This can be solved by versioning: But let's not get into that when it's not even clear that attribute lookup needs this optimization. == Extension: Read/write refs == Python functions can write to globals by declaring, for example, `global x`. It's straightforward to allow myRefCell.set(val) and myRefCell.delete(). As a method of `dict`, this would slightly improve the memoization idiom. Before: memo = {} def fib(n): if n < 2: return n try: return memo[n] except: result = memo[n] = fib(n-1) + fib(n-2) # ^Second lookup return result After: memo = {} def fib(n): if n < 2: return n ref = memo.getref(n) try: return ref.get() except: ref.set(fib(n-1) + fib(n-2)) # No second lookup return ref.get() Also, "It's better to ask for forgiveness" might no longer be true with a method that checks for existence. memo = {} def fib(n): if n < 2: return n ref = memo.getref(n) if ref.empty(): ref.set(fib(n-1) + fib(n-2)) return ref.get() This allows modification of entries even after the scope dies. It's like closures, except exactly the same as closures. == Extension: Read-only refs == Some optimizations might be possible if a dict knows that it will be notified for every change, through setitem/delitem. For example, if we make dict1 = dict0.copy(), then this can be a shallow copy, and whichever dict modifies first will split off from the other. This is not possible with the existence of RefCells that can .set(). (See previous section.) Two non-exclusive solutions: 1. Distinguish read-only RefCells and read-write RefCells. Have a dict keep track of whether it has read-write RefCells. 2. Make Refcell.set notify its owner dict. This requires a pointer to the owner dict. (NOT a pyreference, since the pointer is only useful as long as the owner is alive, and the owner can notify its RefCells when it dies.) == Option: Alternative RefCell structure == The RefCell I defined owns a pointer to the dict entry or a pointer to the value, and a flag to determine whether it's part of a "living" dict. Instead, we can hold two pointers, one which MIGHT point to `key`. The `key` pointer will do double duty by telling the RefCell whether it's part of a living dict (by being NULL). With this, when we call .get(), it will raise a KeyError with the correct key. Otherwise, it would have to raise a generic KeyError, since it doesn't know the key. [See "Pseudocode: Alternative RefCell structure"] == Option: Replace the methods when detaching a RefCell == Instead of figuring out whether it's pointing to a table entry or a value itself, we can simply replace the RefCell's member functions. They will only change once, since the owner can only die once. == Extension: Dynamic scoping == In (C?)Python 3, `exec` can't change scoping. It can't even change local variables. x = 'global x' y = 'global y' def f(): x = 10 exec("x = 'local x'") exec("y = 'local y'") print(x) # prints 10 print(y) # prints 'global y' f() print(x) #prints 'global x' The docs say: """ The default locals act as described for function locals() below: modifications to the default locals dictionary should not be attempted. Pass an explicit locals dictionary if you need to see effects of the code on locals after function exec() returns. """ It's possible, if we wanted, to change how this works, and have nested function scoping behave like globals > builtins (that is, dynamically). I'm not sure if it's even desirable, though. (Function scope ancestry is linear, fortunately. No multiple inheritance, so no diamond inheritance problem, so there's no need for Python's MRO, so we wouldn't have the same issues as with classes.) Steven D'Aprano talks more about it here: https://mail.python.org/pipermail/python-dev/2015-December/142511.html == Extension: Thread safety == (Note: Needs an expert to double-check.) RefCells might need locks, because they have to determine whether the dict is alive to determine where their value is. The owner dict doesn't need to care about that, since it knows it's alive. When a dict deletes, it only needs to hold a lock on one RefCell at a time. [See "Pseudocode: Thread safety"]

Dictionary References: An Idea in Three Emails Act 3: Sample pseudocode Table of Contents: Act 1: Stuff https://mail.python.org/pipermail/python-ideas/2015-December/037511.html - Definitions - Problem statement - The idea - The benefits - The costs - What will hold a RefCell? - Add `getref` to `dict` - Implementation issues - Interpreters other than CPython Act 2: More stuff https://mail.python.org/pipermail/python-ideas/2015-December/037512.html - Extension: Scope ancestry - Aside: Classes and scopes do not have the same rules - Extension: Read/write refs - Extension: Read-only refs - Option: Alternative RefCell structure - Option: Replace the methods when detaching a RefCell - Extension: Dynamic scoping - Extension: Thread safety Act 3: Sample pseudocode https://mail.python.org/pipermail/python-ideas/2015-December/037513.html - ScopeDict - Scope ancestry - Alternative RefCell structure - Thread safety == Pseudocode: ScopeDict == The CPython dict implementation looks like this (simplified): class KVPair: __slots__ = { 'key': object | DUMMY | NULL, 'value': object | NULL, } class dict: __slots__ = { 'table': List[KVPair], 'size': int, } def lookup(self, key): # returns the entry, or an empty entry ... def __getitem__(self, key): entry = self.lookup(key) if entry.key in [NULL, DUMMY]: raise KeyError return entry.value def __setitem__(self, key, value): entry = self.lookup(key) if entry.key in [NULL, DUMMY]: entry.key = key entry.value = value self.maybe_resize() def __delitem__(self, key): entry = self.lookup(key) if entry.key in [NULL, DUMMY]: raise KeyError entry.key = DUMMY entry.value = NULL self.maybe_resize() def resize(self): old_table = self.table self.table = [KVPair() for i in range(self.predict_size())] for k, v in old_table: self[k] = v I want to add a `refs` member to hold the RefCells, and a `getref` function to acquire RefCells. (I make redundant lookups for demonstration. They are unnecessary in C.) (In C, self.table.index(entry) is a simple pointer subtraction.) (In the actual implementation, we could save a direct pointer to the KVPair's value pointer.) class ScopeDict(dict): __slots__ = { 'refs': List[RefCell | NULL], } def getref(self, key, force_create: bool): entry = self.lookup(key) if entry.key in [NULL, DUMMY]: if force_create: entry.key = key entry.value = NULL else: raise KeyError # Get the index. i = self.table.index(entry) cell = self.refs[i] = RefCell() cell.indirect = True cell.pointer = entry def __getitem__(self, key): entry = self.lookup(key) if entry.key in [NULL, DUMMY] or entry.value is NULL: raise KeyError return entry.value def __delitem__(self, key): entry = self.lookup(key) if entry.key in [NULL, DUMMY] or entry.value is NULL: raise KeyError entry.value = NULL def resize(self): old_table = self.table self.table = [KVPair() for i in range(self.predict_size())] old_refs = self.refs self.refs = [NULL] * len(self.table) for ref, (key, value) in zip(old_refs, old_table): self[key] = value if ref is NULL: continue # Update the ref entry = self.lookup(key) index = self.table.index(entry) self.refs[index] = ref ref.pointer = entry def __del__(self): # with illustrative deletes for ref, entry in zip(self.refs, self.table): delete entry.key if ref is not NULL: ref.pointer = entry.value ref.indirect = False delete ref else: delete entry.value class RefCell: __slots__ = { 'indirect': bool, 'pointer': (KVPair | object), } def get(self): if self.indirect: value = self.pointer.value else: value = self.pointer if value is NULL: raise KeyError return value def __del__(self): if not self.indirect: delete self.pointer == Pseudocode: Scope ancestry == Algorithm for ref creation: (This is complex, because of the decision-making process of whether to create a RefCell.) def ChainScopeDict.getref(self, key, force_create: bool, recursive: bool = True): # Try the normal getref. try: ref = super().getref(key, force_create=force_create) except KeyError: # We don't have this key. if not recursive: raise #give up else: ref = None # We now know: assert recursive or ref is not None if recursive: # Try to find a parent with a refcell for i, parent in enumerate(self.parents): try: parent_ref = parent.getref(key, force_create=False, recursive=True) good_parent = i break except: continue else: # No parent has a RefCell for this key if ref is None: assert not force_create raise KeyError(key) else: parent_ref = None else: assert ref is not None parent_ref = None assert parent_ref is not None or ref is not None if ref is None: assert parent_ref is not None and force_create is False ref = super().getref(key, force_create=True) # Hack to save on pseudocode: ref.__class__ = ChainRefCell if parent_ref is None: assert force_create or key in self # Create no parent refs. It will look up parents later. ref.parents = self.parents.copy() return ref # Create all refs up to the found parent_ref. # (Don't create refs for the later parents.) ref.parents = ([p.getref(key, force_create=True) for p in self.parents[:good_parent]] + [parent_ref] + self.parents[good_parent + 1:]) return ref Algorithm for chain resolution: def ChainRefCell.get(self, recursive: bool): key = self.key try: # Own value. return super().get() except KeyError: if not recursive: # Don't search parents. raise pass # Index of latest parent which is/has a RefCell. # We want to create intermediate RefCells for all things in between. last_parent_cell = 0 for i, parent in enumerate(self.parents): # We want a parent RefCell, not a parent dict if isinstance(parent, ScopeDict): try: parent = parent.getref(key, force_create=False) except KeyError: continue # Don't need the parent dict anymore. self.parents[i] = parent last_parent_cell = i # This parent is now a refcell. # Try to get the parent to resolve. try: # `recursive=False` for class-like hierarchy; # see below value = parent.get(recursive=True) except KeyError: continue else: # No parent has a value. value = NULL # Create refs in the parents which come before a refcell. # This prevents repeated failed lookups. for i, parent in enumerate(self.parents[:last_parent_cell]): if isinstance(parent, ScopeDict): self.parents[i] = parent.getref(key, force_create=True) if value is NULL: raise KeyError return value == Pseudocode: Alternative RefCell structure == This allows us to have a reference to the key (for KeyError(key)) without taking up an extra bool for checking the dictionary. class RefCell: __slots__ = { '_key': (NULL | object), '_value': (KVPair | object), } """ Two possibilities: 1. _key == NULL and isinstance(_value, KVPair) => the dict is alive. 2. isinstance(_key, object) and isinstance(_value, object) => this RefCell has been released from its dict. (KVPair and NULL are not (py-)objects.) So we use _key as a way to tell whether the dict is alive, and to tell whether _value points to the value or to the table entry. """ def get(self): if self._key is NULL: # The pointer is to the KVPair. key = self._value.key value = self._value.value else: # The pointer is to the value. key = self._key value = self._value if value is NULL: raise KeyError(key) return value def __del__(self): if self._key is not NULL: delete self._key delete self._value # Otherwise, it doesn't own the references. def key(self): if self._key is NULL: # Look it up in the dict. return self._value.key else: return self._key == Pseudocode: Thread safety == for i, (ref, (k, v)) in enumerate(zip(oldrefs, oldtable)): # Remove refcells which only have one ref, # then remove keys without either value or refcell. # No need to lock a refcell with only one ref. isdead = self.remove_if_dead(i) if isdead: continue if ref is not NULL: lock(ref) ref._value = v ref._key = k unlock(ref) else: delete k delete v

Le samedi 19 décembre 2015, Franklin? Lee <leewangzhong+python@gmail.com <javascript:_e(%7B%7D,'cvml','leewangzhong%2Bpython@gmail.com');>> a écrit :
== Problem statement ==
I propose a CPython interpreter optimization for global name lookups.
Fans of micro optimisation are probably already using various hacks like func(len=len): ... to avoid the lookup at runtime. There is an option rewriting bytecode to replace load_global with load_const (I don't recall the name of the PyPI project, it's a decorator). Serhiy also proposed to implement a new syntax to make the lookup when the function is defined. It would be interesting to mesure the cost of these lookups(ex: number of nanoseconds per lookup) and have an idea on how much load_global lookups are used in the wild (ratio on the overall number of instructions). Since the goal is a speedup, a working proof of concept is required to show that it works and it's faster (on macro benchmarks?). Do you feel able to implement it? As I already wrote, I'm not convinced that it's worth it. Your code looks more complex, will use more memory, etc. I don't think that load_global is common in hot code (the 10% taking 90% of the runtime). I expect effetcs on object lifetime which could be annoying. I implemented an optimization in FAT Python that replaces builtin lookups with load_const. I have to enhance the code to make it safe, but it works and it doesn't require deep changes in dict type. http://faster-cpython.readthedocs.org/en/latest/fat_python.html#copy-builtin... In short, it only adds a single integer per dict, incremented at each modification (create, modify, delete). Victor

On 19.12.2015 16:06, Victor Stinner wrote:
The effects are minimal and only show up in overall performance if the functions in question are used a lot. I gave a talk about such optimizations last year at PyCon UK you might want to have a look at: http://www.egenix.com/library/presentations/PyCon-UK-2014-When-performance-m... Slide 38 has the details about such lookup optimization tricks: https://downloads.egenix.com/python/PyCon-UK-2014-When-performance-matters-T...
-- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Experts (#1, Dec 19 2015)
::: We implement business ideas - efficiently in both time and costs ::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/ http://www.malemburg.com/

I messed up the links. Well, I learned something about mailman. On Sat, Dec 19, 2015 at 10:06 AM, Victor Stinner <victor.stinner@gmail.com> wrote:
I mentioned the hack, and said this would make it unnecessary. I also mentioned how the hack bypasses dynamic lookup for the sake of performance, while this would have the effect of dynamic lookup. It could possibly be faster (since you don't have to load a default arg). Rewriting bytecode has the same issues, and adds overhead during compilation, too. My idea might have less execution overhead on compilation, since you'd have to do the lookup in either case. It would improve all functions which use globals, and doesn't require the programmer to bypass dynamic lookup. No new syntax > new syntax. Not having to use a hack > using a hack.
It would be a project. I've studied some of the dict code, but I'd still have to look at OrderedDict and how it differs. I'd have to hack at the interpreter level. Worst, I'm not very experienced at C, and have no real "development" experience, so I'd struggle with just compiling CPython. But I'll definitely try, because I won't get anywhere by hoping someone else implements what I think of. Theoretically, using a RefCell is definitely faster than actually doing the lookups (what currently happens). The extra level of indirection might result in CPU cache misses, but the dictionary lookup it replaces is much more likely to do so.
At the least, it would relieve Python's mental burden: "Don't use global names in tight loops." That would be worth it. (It would also mean that you wouldn't need to implement guards, or have a "slow path" with lookups.) Even if the gains are minimal, they're gains that people work for, so this would save them work. The idea is not very complex. "Know where the value would live, and you will always be able to find it." The complexity comes from: - The complexity of dicts. Much of the code is simply managing the syncing between the dict and the refs table. - The complexity of scope relationships. - The complexity of allowing scopes to be replaced. (`globals()['__builtins__'] = {}`) - The complexity of cells which will live on after their owner dict. - Exceptions and error messages. - Preserving dict performance. For example, if normal dicts weren't a concern, the ref table would be built into dict's internal table. Most of the complexity is in trying to replicate the existing complexity. Otherwise, we can just implement it as, "The inner dict holds a length-1 list which holds the value, or NULL. Make sure to check it's not NULL." The cost of resolving a (not-chained) RefCell: Call a function, which dereferences a pointer and checks if it's NULL. If it's not, return it. It's pretty cheap. I don't know why you think it'd possibly be slower than the current way. Memory... That will have to be seen, but it would be a function of the combined sizes of the module dicts, which shouldn't be a big part of the program. Implementing compact dicts will also make the refs table smaller. Module loading will gain costs, since the names used in each module require lookups at compile-time instead of runtime, and many functions in a module can go unused. This needs to be explored. It's feasible to have the interpreter delay the lookups to first-call.
To contrast: I don't require any safety checks, I can replace `print` without incurring additional lookups, and I am insensitive to other changes in the dict (unless the dict resizes). I add a few pointers to dict (probably smaller than a PyIntObject), and the refs table won't exist unless it's needed.

On Dec 19, 2015, at 08:33, Franklin? Lee <leewangzhong+python@gmail.com> wrote:
On Dec 19, 2015, at 08:33, Franklin? Lee <leewangzhong+python@gmail.com> wrote:
It can't possibly be faster, unless you never actually use the len function (in which case it would be pretty stupid to optimize with len=len). Loading a default value is just copying from one array to another at call time, with pseudocode like "frame.localsplus[3] = function.defaults[1]". That takes about as much time as two locals lookups (and then turns every builtin lookup into a local lookup). Meanwhile, your idea replaces every locals lookup with a RefCell lookup plus a dereference, something like "frame.code.consts[1].value" instead of "frame.localsplus[3]". Loading off consts is about as fast as loading off locals, but that extra cell dereference step can't possibly be free, and in fact will be on the same order as the array lookup. As a rough guess, since your RefCell system is pretty close to what already happens for closure cells, it'll probably be about as fast as closure lookups--which are significantly faster than global lookups, meaning the optimization would help, but not as fast as local lookups, meaning the optimization would not help as much as the default value hack. And that seems like an almost unavoidable cost of keeping it dynamic. The FAT idea avoids that cost by faking the dynamic nature: it caches the value as a const, but the cache is wiped out if the value changes and the guard is tripped, so in cases where you don't ever change len it's nearly as fast the default value hack, but if you do change len it's as slow as a traditional global lookup (because, in fact, it's just falling back to the unoptimized code that does a traditional global lookup). And, since you asked about PyPy, I'd be willing to bet that what it does is a lot closer to FAT than to your idea.
But more complex under-the-covers implementation < simple implementation. Unless it actually significantly speeds up a reasonable amount of real-life code whose speed matters, that's not worth it. I suppose one way you could estimate the potential benefit is to try to figure out how common the default hack is in real life. If people need to use it frequently in production code (which presumably means there's other production code that could be benefitting from it but the programmer didn't realize it), then making it unnecessary is a real win.
Something you might want to consider doing first: Write a pure-python RefCell, RefCellDict, and RefCellChainDict. Write some code that explicitly uses a ChainMap to look things up instead of using globals and builtins. Then rewrite it as equivalent code that manually sets up a list of RefCells out of a RefCellChainDict and then uses the list. You can simulate "global" and "builtin" lookups and mutations, and see when the optimization makes things faster. If it often makes a big difference, that should inspire you to do the hard work in C. Also, it means you have real code instead of pseudocode to show off and let people play with. In fact, you could even write a pure-Python optimizer out of this without inspecting the bytecode: for each def/lambda/etc. node, compile the AST normally, get the co_names, then rewrite the AST to build and use a list of RefCells, then compile the module and replace its dict with a RefCellDict, and now all globals in the module use RefCell lookups. (Making that work for builtins might be trickier, because other modules can change that... But for a first pass, you can either leave builtins out, or assume that no code changes it after startup and throw in some asserts for tests that break that assumption.) That would allow you to run realistic benchmarks. (If you're worried that the pure-Python objects cells and dicts are too slow, Cython should help there.) And it might even be a useful project on its own if it could optimize real-life code in CPython 3.5 and 2.7, PyPy, Jython, etc.

On Sat, Dec 19, 2015 at 9:10 AM, Franklin? Lee <leewangzhong+python@gmail.com> wrote:
Dictionary References: An Idea in Three Emails *snip*
I found some earlier threads while looking for info on PyPy's "celldicts.py". (I knew it couldn't have been a new idea!) I'm taking a break from reading them, but I was writing some notes, and the fact that they exist should be known early. ==== Subject: [Python-Dev] Accessing globals without dict lookup From: Guido van Rossum Date: Fri, 08 Feb 2002 11:50:31 -0500 Link: https://mail.python.org/pipermail/python-dev/2002-February/019846.html PEP: https://www.python.org/dev/peps/pep-0280/ It goes into many of the same details I had worked out this week, including - pre-creation of empty cells - globals() > __builtins__ relationship - can't figure out a use for deeper relationships - Tim Peters lays out an idea of "celldict pointing to a realdict" which is similar to how my refs table is a dict parallel to the original dict, except that it being part of the same object means that I can do a single lookup to find both an entry and its cell. (https://mail.python.org/pipermail/python-dev/2002-February/019893.html) It also has a basic implementation in Python. (Victor, maybe you'd like reading that one better.) The PEP, along with PEP 266 and PEP 267, was deferred because "no-one has yet emerged to do the work of hashing out the differences between [them.]" Key differences: - He stores cells as values. I store them parallel to values, to reduce overhead on normal dict operations. - He uses a subclass. I want to make it the default, and dynamically convert from normal dict to celldict and back. - He packs __builtins__ cells into module cells when the module is created. I am lazier: a module cell will only request a __builtins__ cell when it fails to resolve to a value. - He converts the lookups to indexed lookups, with new corresponding bytecode ops. I... don't actually know how to get the interpreter to use the cells. - Since he uses a subclass, he doesn't have functions load cells from the module if the functions have a regular dict as a globals() scope. I want a function to always be able to request cells from its globals(), even if it is a user dict. - He uses a single-parent relationship: a cell either has a parent or it doesn't. I allow for multiple parents, but since I've already realized that scopes don't have multiple direct parents, let's take away the loop and go with a single optional parent. On semantics: """ I think this faithfully implements the current semantics (where a global can shadow a builtin), but without the need for any dict lookups when accessing globals, except in cases where an explicit dict is passed to exec or eval(). """ Tim Peters points out how it might be more expensive than a lookup. (https://mail.python.org/pipermail/python-dev/2002-February/019874.html) """ Note that a chain of 4 test+branches against NULL in "the usual case" for builtins may not be faster on average than inlining the first few useful lines of lookdict_string twice (the expected path in this routine became fat-free for 2.2): i = hash; ep = &ep0[i]; if (ep->me_key == NULL || ep->me_key == key) return ep; Win or lose, that's usually the end of a dict lookup. That is, I'm certain we're paying significantly more for layers of C-level function call overhead today than for what the dict implementation actually does now (in the usual cases). """ He also wanted to optimize the globals() -> __builtins__ relationship by loading the __builtins__ value into the globals() cell as a fast path, with an expensive notification to all modules' cells when a name in __builtins__ changes. (https://mail.python.org/pipermail/python-dev/2002-February/019904.html) ==== Subject: [Python-ideas] Fast global cacheless lookup From: Neil Toronto Date: Thu Nov 22 16:40:49 CET 2007 Link: https://mail.python.org/pipermail/python-ideas/2007-November/001212.html A proposal to modify `dict` instead of subclassing. """ What if a frame could maintain an array of pointers right into a dictionary's entry table? A global lookup would then consist of a couple of pointer dereferences, and any value change would show up immediately to the frame. """ Unfortunately, he makes the dict responsible for notifying all registered functions, instead of using a ref-counted indirection. So his resizes are an additional O(len(functions)), while mine are an additional O(len(table) + #(refs)). Subject: [Python-Dev] PATCH: Fast globals/builtins lookups for 2.6 From: Neil Toronto Date: Thu Nov 29 11:26:37 CET 2007 Link: http://mail.python.org/pipermail/python-ideas/2007-November/001212.html The patch. He ended up using versioning, because the notification was expensive. (*Possibly* not as much of an issue for me.) The patch slightly changed the behavior of the __builtins__ dict itself.

Dictionary References: An Idea in Three Emails Act 2: More stuff Table of Contents: Act 1: Stuff https://mail.python.org/pipermail/python-ideas/2015-December/037511.html - Definitions - Problem statement - The idea - The benefits - The costs - What will hold a RefCell? - Add `getref` to `dict` - Implementation issues - Interpreters other than CPython Act 2: More stuff https://mail.python.org/pipermail/python-ideas/2015-December/037512.html - Extension: Scope ancestry - Aside: Classes and scopes do not have the same rules - Extension: Read/write refs - Extension: Read-only refs - Option: Alternative RefCell structure - Option: Replace the methods when detaching a RefCell - Extension: Dynamic scoping - Extension: Thread safety Act 3: Sample pseudocode https://mail.python.org/pipermail/python-ideas/2015-December/037513.html - ScopeDict - Scope ancestry - Alternative RefCell structure - Thread safety == Extension: Scope ancestry == It's possible for lookups in one scope to fall back into another. In other words, any failed lookup in scope A will result in a lookup in scope B. I will call B a parent scope of A. For example: - Names in functions will be local if defined there, and global otherwise. - If function f creates function g, then g's scope includes f's. - Object attribute lookups go through instance -> class -> [superclasses]. - Global scope includes builtins. The first two are irrelevant. Those decisions are made at compile-time, so there's no need for optimization. I don't want to touch the third one. I'm told that it's already cached, and I'm not sure refcells would help. We're left with globals and builtins. Any lookup in globals() can mean a lookup in __builtins__. So consider __builtins__ as a parent of globals(). Say we have a parentScope = ScopeDict() and a childScope = ChainScopeDict(). The childScope will have a list of its parents: childScope.parents = [parentScope]. When a ref = ChainRefCell() is created, it will also be given a ref.parents = [parentScope]. On resolution, it will ask its parent dicts for values. This requires holding a PyRef to the key. It might also request RefCells from the parents. [See "Pseudocode: Scope ancestry"] == Extension: Replacing a parent == A parent scope might be replaced after ChainRefCells have been created. In particular, globals()['__builtins__'] can be replaced. Fortunately, globals() will know when that particular key is changed. I suggest that globals() goes through all of its ChainRefCells and updates their .parents to the new __builtins__ (which invalidates parent refcells). This will mean iterating through the whole refs array every time __builtins__ is replaced. So if you keep replacing __builtins__, you're going to have a bad time. In other cases where parents can be replaced, we can do the same kind of notification. Say ChainScopeDicts A < B < C is a relation (B is a child scope of A, so B is bigger than A), and we replace B's parent A with D. Then B notifies its RefCells, and C doesn't need to be notified directly. (This does NOT work for class hierarchies. See next section.) == Aside: Classes and scopes do not have the same rules == Take this class hierarchy: class A(object): pass class B(object): pass class C(A, B): pass obj = C() Then a lookup in C will NOT cause recursive lookups in A and B. In other words, the family tree look like this: obj: obj -> C -> A -> B -> object C: C -> A -> B -> object A: A -> object B: B -> object No matter how deep the class hierarchy goes, the scope hierarchy for classes and objects is at most two levels deep: instance -> class -> [superclasses]. This is due to Method Resolution Order. It means that a class's direct scope parents are all of its superclasses. This also means that the notification idea in the previous section won't work. If a class changes its superclass, it doesn't just need to notify its own cells. It needs to notify all of its subclasses (though not instances), because they hold pyreferences to its original superclass. This can be solved by versioning: But let's not get into that when it's not even clear that attribute lookup needs this optimization. == Extension: Read/write refs == Python functions can write to globals by declaring, for example, `global x`. It's straightforward to allow myRefCell.set(val) and myRefCell.delete(). As a method of `dict`, this would slightly improve the memoization idiom. Before: memo = {} def fib(n): if n < 2: return n try: return memo[n] except: result = memo[n] = fib(n-1) + fib(n-2) # ^Second lookup return result After: memo = {} def fib(n): if n < 2: return n ref = memo.getref(n) try: return ref.get() except: ref.set(fib(n-1) + fib(n-2)) # No second lookup return ref.get() Also, "It's better to ask for forgiveness" might no longer be true with a method that checks for existence. memo = {} def fib(n): if n < 2: return n ref = memo.getref(n) if ref.empty(): ref.set(fib(n-1) + fib(n-2)) return ref.get() This allows modification of entries even after the scope dies. It's like closures, except exactly the same as closures. == Extension: Read-only refs == Some optimizations might be possible if a dict knows that it will be notified for every change, through setitem/delitem. For example, if we make dict1 = dict0.copy(), then this can be a shallow copy, and whichever dict modifies first will split off from the other. This is not possible with the existence of RefCells that can .set(). (See previous section.) Two non-exclusive solutions: 1. Distinguish read-only RefCells and read-write RefCells. Have a dict keep track of whether it has read-write RefCells. 2. Make Refcell.set notify its owner dict. This requires a pointer to the owner dict. (NOT a pyreference, since the pointer is only useful as long as the owner is alive, and the owner can notify its RefCells when it dies.) == Option: Alternative RefCell structure == The RefCell I defined owns a pointer to the dict entry or a pointer to the value, and a flag to determine whether it's part of a "living" dict. Instead, we can hold two pointers, one which MIGHT point to `key`. The `key` pointer will do double duty by telling the RefCell whether it's part of a living dict (by being NULL). With this, when we call .get(), it will raise a KeyError with the correct key. Otherwise, it would have to raise a generic KeyError, since it doesn't know the key. [See "Pseudocode: Alternative RefCell structure"] == Option: Replace the methods when detaching a RefCell == Instead of figuring out whether it's pointing to a table entry or a value itself, we can simply replace the RefCell's member functions. They will only change once, since the owner can only die once. == Extension: Dynamic scoping == In (C?)Python 3, `exec` can't change scoping. It can't even change local variables. x = 'global x' y = 'global y' def f(): x = 10 exec("x = 'local x'") exec("y = 'local y'") print(x) # prints 10 print(y) # prints 'global y' f() print(x) #prints 'global x' The docs say: """ The default locals act as described for function locals() below: modifications to the default locals dictionary should not be attempted. Pass an explicit locals dictionary if you need to see effects of the code on locals after function exec() returns. """ It's possible, if we wanted, to change how this works, and have nested function scoping behave like globals > builtins (that is, dynamically). I'm not sure if it's even desirable, though. (Function scope ancestry is linear, fortunately. No multiple inheritance, so no diamond inheritance problem, so there's no need for Python's MRO, so we wouldn't have the same issues as with classes.) Steven D'Aprano talks more about it here: https://mail.python.org/pipermail/python-dev/2015-December/142511.html == Extension: Thread safety == (Note: Needs an expert to double-check.) RefCells might need locks, because they have to determine whether the dict is alive to determine where their value is. The owner dict doesn't need to care about that, since it knows it's alive. When a dict deletes, it only needs to hold a lock on one RefCell at a time. [See "Pseudocode: Thread safety"]

Dictionary References: An Idea in Three Emails Act 3: Sample pseudocode Table of Contents: Act 1: Stuff https://mail.python.org/pipermail/python-ideas/2015-December/037511.html - Definitions - Problem statement - The idea - The benefits - The costs - What will hold a RefCell? - Add `getref` to `dict` - Implementation issues - Interpreters other than CPython Act 2: More stuff https://mail.python.org/pipermail/python-ideas/2015-December/037512.html - Extension: Scope ancestry - Aside: Classes and scopes do not have the same rules - Extension: Read/write refs - Extension: Read-only refs - Option: Alternative RefCell structure - Option: Replace the methods when detaching a RefCell - Extension: Dynamic scoping - Extension: Thread safety Act 3: Sample pseudocode https://mail.python.org/pipermail/python-ideas/2015-December/037513.html - ScopeDict - Scope ancestry - Alternative RefCell structure - Thread safety == Pseudocode: ScopeDict == The CPython dict implementation looks like this (simplified): class KVPair: __slots__ = { 'key': object | DUMMY | NULL, 'value': object | NULL, } class dict: __slots__ = { 'table': List[KVPair], 'size': int, } def lookup(self, key): # returns the entry, or an empty entry ... def __getitem__(self, key): entry = self.lookup(key) if entry.key in [NULL, DUMMY]: raise KeyError return entry.value def __setitem__(self, key, value): entry = self.lookup(key) if entry.key in [NULL, DUMMY]: entry.key = key entry.value = value self.maybe_resize() def __delitem__(self, key): entry = self.lookup(key) if entry.key in [NULL, DUMMY]: raise KeyError entry.key = DUMMY entry.value = NULL self.maybe_resize() def resize(self): old_table = self.table self.table = [KVPair() for i in range(self.predict_size())] for k, v in old_table: self[k] = v I want to add a `refs` member to hold the RefCells, and a `getref` function to acquire RefCells. (I make redundant lookups for demonstration. They are unnecessary in C.) (In C, self.table.index(entry) is a simple pointer subtraction.) (In the actual implementation, we could save a direct pointer to the KVPair's value pointer.) class ScopeDict(dict): __slots__ = { 'refs': List[RefCell | NULL], } def getref(self, key, force_create: bool): entry = self.lookup(key) if entry.key in [NULL, DUMMY]: if force_create: entry.key = key entry.value = NULL else: raise KeyError # Get the index. i = self.table.index(entry) cell = self.refs[i] = RefCell() cell.indirect = True cell.pointer = entry def __getitem__(self, key): entry = self.lookup(key) if entry.key in [NULL, DUMMY] or entry.value is NULL: raise KeyError return entry.value def __delitem__(self, key): entry = self.lookup(key) if entry.key in [NULL, DUMMY] or entry.value is NULL: raise KeyError entry.value = NULL def resize(self): old_table = self.table self.table = [KVPair() for i in range(self.predict_size())] old_refs = self.refs self.refs = [NULL] * len(self.table) for ref, (key, value) in zip(old_refs, old_table): self[key] = value if ref is NULL: continue # Update the ref entry = self.lookup(key) index = self.table.index(entry) self.refs[index] = ref ref.pointer = entry def __del__(self): # with illustrative deletes for ref, entry in zip(self.refs, self.table): delete entry.key if ref is not NULL: ref.pointer = entry.value ref.indirect = False delete ref else: delete entry.value class RefCell: __slots__ = { 'indirect': bool, 'pointer': (KVPair | object), } def get(self): if self.indirect: value = self.pointer.value else: value = self.pointer if value is NULL: raise KeyError return value def __del__(self): if not self.indirect: delete self.pointer == Pseudocode: Scope ancestry == Algorithm for ref creation: (This is complex, because of the decision-making process of whether to create a RefCell.) def ChainScopeDict.getref(self, key, force_create: bool, recursive: bool = True): # Try the normal getref. try: ref = super().getref(key, force_create=force_create) except KeyError: # We don't have this key. if not recursive: raise #give up else: ref = None # We now know: assert recursive or ref is not None if recursive: # Try to find a parent with a refcell for i, parent in enumerate(self.parents): try: parent_ref = parent.getref(key, force_create=False, recursive=True) good_parent = i break except: continue else: # No parent has a RefCell for this key if ref is None: assert not force_create raise KeyError(key) else: parent_ref = None else: assert ref is not None parent_ref = None assert parent_ref is not None or ref is not None if ref is None: assert parent_ref is not None and force_create is False ref = super().getref(key, force_create=True) # Hack to save on pseudocode: ref.__class__ = ChainRefCell if parent_ref is None: assert force_create or key in self # Create no parent refs. It will look up parents later. ref.parents = self.parents.copy() return ref # Create all refs up to the found parent_ref. # (Don't create refs for the later parents.) ref.parents = ([p.getref(key, force_create=True) for p in self.parents[:good_parent]] + [parent_ref] + self.parents[good_parent + 1:]) return ref Algorithm for chain resolution: def ChainRefCell.get(self, recursive: bool): key = self.key try: # Own value. return super().get() except KeyError: if not recursive: # Don't search parents. raise pass # Index of latest parent which is/has a RefCell. # We want to create intermediate RefCells for all things in between. last_parent_cell = 0 for i, parent in enumerate(self.parents): # We want a parent RefCell, not a parent dict if isinstance(parent, ScopeDict): try: parent = parent.getref(key, force_create=False) except KeyError: continue # Don't need the parent dict anymore. self.parents[i] = parent last_parent_cell = i # This parent is now a refcell. # Try to get the parent to resolve. try: # `recursive=False` for class-like hierarchy; # see below value = parent.get(recursive=True) except KeyError: continue else: # No parent has a value. value = NULL # Create refs in the parents which come before a refcell. # This prevents repeated failed lookups. for i, parent in enumerate(self.parents[:last_parent_cell]): if isinstance(parent, ScopeDict): self.parents[i] = parent.getref(key, force_create=True) if value is NULL: raise KeyError return value == Pseudocode: Alternative RefCell structure == This allows us to have a reference to the key (for KeyError(key)) without taking up an extra bool for checking the dictionary. class RefCell: __slots__ = { '_key': (NULL | object), '_value': (KVPair | object), } """ Two possibilities: 1. _key == NULL and isinstance(_value, KVPair) => the dict is alive. 2. isinstance(_key, object) and isinstance(_value, object) => this RefCell has been released from its dict. (KVPair and NULL are not (py-)objects.) So we use _key as a way to tell whether the dict is alive, and to tell whether _value points to the value or to the table entry. """ def get(self): if self._key is NULL: # The pointer is to the KVPair. key = self._value.key value = self._value.value else: # The pointer is to the value. key = self._key value = self._value if value is NULL: raise KeyError(key) return value def __del__(self): if self._key is not NULL: delete self._key delete self._value # Otherwise, it doesn't own the references. def key(self): if self._key is NULL: # Look it up in the dict. return self._value.key else: return self._key == Pseudocode: Thread safety == for i, (ref, (k, v)) in enumerate(zip(oldrefs, oldtable)): # Remove refcells which only have one ref, # then remove keys without either value or refcell. # No need to lock a refcell with only one ref. isdead = self.remove_if_dead(i) if isdead: continue if ref is not NULL: lock(ref) ref._value = v ref._key = k unlock(ref) else: delete k delete v

Le samedi 19 décembre 2015, Franklin? Lee <leewangzhong+python@gmail.com <javascript:_e(%7B%7D,'cvml','leewangzhong%2Bpython@gmail.com');>> a écrit :
== Problem statement ==
I propose a CPython interpreter optimization for global name lookups.
Fans of micro optimisation are probably already using various hacks like func(len=len): ... to avoid the lookup at runtime. There is an option rewriting bytecode to replace load_global with load_const (I don't recall the name of the PyPI project, it's a decorator). Serhiy also proposed to implement a new syntax to make the lookup when the function is defined. It would be interesting to mesure the cost of these lookups(ex: number of nanoseconds per lookup) and have an idea on how much load_global lookups are used in the wild (ratio on the overall number of instructions). Since the goal is a speedup, a working proof of concept is required to show that it works and it's faster (on macro benchmarks?). Do you feel able to implement it? As I already wrote, I'm not convinced that it's worth it. Your code looks more complex, will use more memory, etc. I don't think that load_global is common in hot code (the 10% taking 90% of the runtime). I expect effetcs on object lifetime which could be annoying. I implemented an optimization in FAT Python that replaces builtin lookups with load_const. I have to enhance the code to make it safe, but it works and it doesn't require deep changes in dict type. http://faster-cpython.readthedocs.org/en/latest/fat_python.html#copy-builtin... In short, it only adds a single integer per dict, incremented at each modification (create, modify, delete). Victor

On 19.12.2015 16:06, Victor Stinner wrote:
The effects are minimal and only show up in overall performance if the functions in question are used a lot. I gave a talk about such optimizations last year at PyCon UK you might want to have a look at: http://www.egenix.com/library/presentations/PyCon-UK-2014-When-performance-m... Slide 38 has the details about such lookup optimization tricks: https://downloads.egenix.com/python/PyCon-UK-2014-When-performance-matters-T...
-- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Experts (#1, Dec 19 2015)
::: We implement business ideas - efficiently in both time and costs ::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/ http://www.malemburg.com/

I messed up the links. Well, I learned something about mailman. On Sat, Dec 19, 2015 at 10:06 AM, Victor Stinner <victor.stinner@gmail.com> wrote:
I mentioned the hack, and said this would make it unnecessary. I also mentioned how the hack bypasses dynamic lookup for the sake of performance, while this would have the effect of dynamic lookup. It could possibly be faster (since you don't have to load a default arg). Rewriting bytecode has the same issues, and adds overhead during compilation, too. My idea might have less execution overhead on compilation, since you'd have to do the lookup in either case. It would improve all functions which use globals, and doesn't require the programmer to bypass dynamic lookup. No new syntax > new syntax. Not having to use a hack > using a hack.
It would be a project. I've studied some of the dict code, but I'd still have to look at OrderedDict and how it differs. I'd have to hack at the interpreter level. Worst, I'm not very experienced at C, and have no real "development" experience, so I'd struggle with just compiling CPython. But I'll definitely try, because I won't get anywhere by hoping someone else implements what I think of. Theoretically, using a RefCell is definitely faster than actually doing the lookups (what currently happens). The extra level of indirection might result in CPU cache misses, but the dictionary lookup it replaces is much more likely to do so.
At the least, it would relieve Python's mental burden: "Don't use global names in tight loops." That would be worth it. (It would also mean that you wouldn't need to implement guards, or have a "slow path" with lookups.) Even if the gains are minimal, they're gains that people work for, so this would save them work. The idea is not very complex. "Know where the value would live, and you will always be able to find it." The complexity comes from: - The complexity of dicts. Much of the code is simply managing the syncing between the dict and the refs table. - The complexity of scope relationships. - The complexity of allowing scopes to be replaced. (`globals()['__builtins__'] = {}`) - The complexity of cells which will live on after their owner dict. - Exceptions and error messages. - Preserving dict performance. For example, if normal dicts weren't a concern, the ref table would be built into dict's internal table. Most of the complexity is in trying to replicate the existing complexity. Otherwise, we can just implement it as, "The inner dict holds a length-1 list which holds the value, or NULL. Make sure to check it's not NULL." The cost of resolving a (not-chained) RefCell: Call a function, which dereferences a pointer and checks if it's NULL. If it's not, return it. It's pretty cheap. I don't know why you think it'd possibly be slower than the current way. Memory... That will have to be seen, but it would be a function of the combined sizes of the module dicts, which shouldn't be a big part of the program. Implementing compact dicts will also make the refs table smaller. Module loading will gain costs, since the names used in each module require lookups at compile-time instead of runtime, and many functions in a module can go unused. This needs to be explored. It's feasible to have the interpreter delay the lookups to first-call.
To contrast: I don't require any safety checks, I can replace `print` without incurring additional lookups, and I am insensitive to other changes in the dict (unless the dict resizes). I add a few pointers to dict (probably smaller than a PyIntObject), and the refs table won't exist unless it's needed.

On Dec 19, 2015, at 08:33, Franklin? Lee <leewangzhong+python@gmail.com> wrote:
On Dec 19, 2015, at 08:33, Franklin? Lee <leewangzhong+python@gmail.com> wrote:
It can't possibly be faster, unless you never actually use the len function (in which case it would be pretty stupid to optimize with len=len). Loading a default value is just copying from one array to another at call time, with pseudocode like "frame.localsplus[3] = function.defaults[1]". That takes about as much time as two locals lookups (and then turns every builtin lookup into a local lookup). Meanwhile, your idea replaces every locals lookup with a RefCell lookup plus a dereference, something like "frame.code.consts[1].value" instead of "frame.localsplus[3]". Loading off consts is about as fast as loading off locals, but that extra cell dereference step can't possibly be free, and in fact will be on the same order as the array lookup. As a rough guess, since your RefCell system is pretty close to what already happens for closure cells, it'll probably be about as fast as closure lookups--which are significantly faster than global lookups, meaning the optimization would help, but not as fast as local lookups, meaning the optimization would not help as much as the default value hack. And that seems like an almost unavoidable cost of keeping it dynamic. The FAT idea avoids that cost by faking the dynamic nature: it caches the value as a const, but the cache is wiped out if the value changes and the guard is tripped, so in cases where you don't ever change len it's nearly as fast the default value hack, but if you do change len it's as slow as a traditional global lookup (because, in fact, it's just falling back to the unoptimized code that does a traditional global lookup). And, since you asked about PyPy, I'd be willing to bet that what it does is a lot closer to FAT than to your idea.
But more complex under-the-covers implementation < simple implementation. Unless it actually significantly speeds up a reasonable amount of real-life code whose speed matters, that's not worth it. I suppose one way you could estimate the potential benefit is to try to figure out how common the default hack is in real life. If people need to use it frequently in production code (which presumably means there's other production code that could be benefitting from it but the programmer didn't realize it), then making it unnecessary is a real win.
Something you might want to consider doing first: Write a pure-python RefCell, RefCellDict, and RefCellChainDict. Write some code that explicitly uses a ChainMap to look things up instead of using globals and builtins. Then rewrite it as equivalent code that manually sets up a list of RefCells out of a RefCellChainDict and then uses the list. You can simulate "global" and "builtin" lookups and mutations, and see when the optimization makes things faster. If it often makes a big difference, that should inspire you to do the hard work in C. Also, it means you have real code instead of pseudocode to show off and let people play with. In fact, you could even write a pure-Python optimizer out of this without inspecting the bytecode: for each def/lambda/etc. node, compile the AST normally, get the co_names, then rewrite the AST to build and use a list of RefCells, then compile the module and replace its dict with a RefCellDict, and now all globals in the module use RefCell lookups. (Making that work for builtins might be trickier, because other modules can change that... But for a first pass, you can either leave builtins out, or assume that no code changes it after startup and throw in some asserts for tests that break that assumption.) That would allow you to run realistic benchmarks. (If you're worried that the pure-Python objects cells and dicts are too slow, Cython should help there.) And it might even be a useful project on its own if it could optimize real-life code in CPython 3.5 and 2.7, PyPy, Jython, etc.

On Sat, Dec 19, 2015 at 9:10 AM, Franklin? Lee <leewangzhong+python@gmail.com> wrote:
Dictionary References: An Idea in Three Emails *snip*
I found some earlier threads while looking for info on PyPy's "celldicts.py". (I knew it couldn't have been a new idea!) I'm taking a break from reading them, but I was writing some notes, and the fact that they exist should be known early. ==== Subject: [Python-Dev] Accessing globals without dict lookup From: Guido van Rossum Date: Fri, 08 Feb 2002 11:50:31 -0500 Link: https://mail.python.org/pipermail/python-dev/2002-February/019846.html PEP: https://www.python.org/dev/peps/pep-0280/ It goes into many of the same details I had worked out this week, including - pre-creation of empty cells - globals() > __builtins__ relationship - can't figure out a use for deeper relationships - Tim Peters lays out an idea of "celldict pointing to a realdict" which is similar to how my refs table is a dict parallel to the original dict, except that it being part of the same object means that I can do a single lookup to find both an entry and its cell. (https://mail.python.org/pipermail/python-dev/2002-February/019893.html) It also has a basic implementation in Python. (Victor, maybe you'd like reading that one better.) The PEP, along with PEP 266 and PEP 267, was deferred because "no-one has yet emerged to do the work of hashing out the differences between [them.]" Key differences: - He stores cells as values. I store them parallel to values, to reduce overhead on normal dict operations. - He uses a subclass. I want to make it the default, and dynamically convert from normal dict to celldict and back. - He packs __builtins__ cells into module cells when the module is created. I am lazier: a module cell will only request a __builtins__ cell when it fails to resolve to a value. - He converts the lookups to indexed lookups, with new corresponding bytecode ops. I... don't actually know how to get the interpreter to use the cells. - Since he uses a subclass, he doesn't have functions load cells from the module if the functions have a regular dict as a globals() scope. I want a function to always be able to request cells from its globals(), even if it is a user dict. - He uses a single-parent relationship: a cell either has a parent or it doesn't. I allow for multiple parents, but since I've already realized that scopes don't have multiple direct parents, let's take away the loop and go with a single optional parent. On semantics: """ I think this faithfully implements the current semantics (where a global can shadow a builtin), but without the need for any dict lookups when accessing globals, except in cases where an explicit dict is passed to exec or eval(). """ Tim Peters points out how it might be more expensive than a lookup. (https://mail.python.org/pipermail/python-dev/2002-February/019874.html) """ Note that a chain of 4 test+branches against NULL in "the usual case" for builtins may not be faster on average than inlining the first few useful lines of lookdict_string twice (the expected path in this routine became fat-free for 2.2): i = hash; ep = &ep0[i]; if (ep->me_key == NULL || ep->me_key == key) return ep; Win or lose, that's usually the end of a dict lookup. That is, I'm certain we're paying significantly more for layers of C-level function call overhead today than for what the dict implementation actually does now (in the usual cases). """ He also wanted to optimize the globals() -> __builtins__ relationship by loading the __builtins__ value into the globals() cell as a fast path, with an expensive notification to all modules' cells when a name in __builtins__ changes. (https://mail.python.org/pipermail/python-dev/2002-February/019904.html) ==== Subject: [Python-ideas] Fast global cacheless lookup From: Neil Toronto Date: Thu Nov 22 16:40:49 CET 2007 Link: https://mail.python.org/pipermail/python-ideas/2007-November/001212.html A proposal to modify `dict` instead of subclassing. """ What if a frame could maintain an array of pointers right into a dictionary's entry table? A global lookup would then consist of a couple of pointer dereferences, and any value change would show up immediately to the frame. """ Unfortunately, he makes the dict responsible for notifying all registered functions, instead of using a ref-counted indirection. So his resizes are an additional O(len(functions)), while mine are an additional O(len(table) + #(refs)). Subject: [Python-Dev] PATCH: Fast globals/builtins lookups for 2.6 From: Neil Toronto Date: Thu Nov 29 11:26:37 CET 2007 Link: http://mail.python.org/pipermail/python-ideas/2007-November/001212.html The patch. He ended up using versioning, because the notification was expensive. (*Possibly* not as much of an issue for me.) The patch slightly changed the behavior of the __builtins__ dict itself.
participants (4)
-
Andrew Barnert
-
Franklin? Lee
-
M.-A. Lemburg
-
Victor Stinner