[Python-ideas] Optimizing global names via dict references

Sat Dec 19 11:33:34 EST 2015

I messed up the links. Well, I learned something about mailman.

On Sat, Dec 19, 2015 at 10:06 AM, Victor Stinner
<victor.stinner at gmail.com> wrote:
> Le samedi 19 décembre 2015, Franklin? Lee <leewangzhong+python at gmail.com> a
> écrit :
>>
>> == Problem statement ==
>>
>> I propose a CPython interpreter optimization for global name lookups.
>
>
> Fans of micro optimisation are probably already using various hacks like
> func(len=len): ... to avoid the lookup at runtime. There is an option
> rewriting bytecode to replace load_global with load_const (I don't recall
> the name of the PyPI project, it's a decorator).
>
> Serhiy also proposed to implement a new syntax to make the lookup when the
> function is defined.

I mentioned the hack, and said this would make it unnecessary. I also
mentioned how the hack bypasses dynamic lookup for the sake of
performance, while this would have the effect of dynamic lookup. It
could possibly be faster (since you don't have to load a default arg).

Rewriting bytecode has the same issues, and adds overhead during
compilation, too. My idea might have less execution overhead on
compilation, since you'd have to do the lookup in either case.

It would improve all functions which use globals, and doesn't require
the programmer to bypass dynamic lookup. No new syntax > new syntax.
Not having to use a hack > using a hack.

> It would be interesting to mesure the cost of these lookups(ex: number of
> nanoseconds per lookup) and have an idea on how much load_global lookups are
> used in the wild (ratio on the overall number of instructions).
>
> Since the goal is a speedup, a working proof of concept is required to show
> that it works and it's faster (on macro benchmarks?). Do you feel able to
> implement it?

It would be a project. I've studied some of the dict code, but I'd
still have to look at OrderedDict and how it differs. I'd have to hack
at the interpreter level. Worst, I'm not very experienced at C, and
have no real "development" experience, so I'd struggle with just
compiling CPython. But I'll definitely try, because I won't get
anywhere by hoping someone else implements what I think of.

Theoretically, using a RefCell is definitely faster than actually
doing the lookups (what currently happens). The extra level of
indirection might result in CPU cache misses, but the dictionary
lookup it replaces is much more likely to do so.

> As I already wrote, I'm not convinced that it's worth it. Your code looks
> more complex, will use more memory, etc. I don't think that load_global is
> common in hot code (the 10% taking 90% of the runtime). I expect effetcs on
> object lifetime which could be annoying.

At the least, it would relieve Python's mental burden: "Don't use
global names in tight loops." That would be worth it. (It would also
mean that you wouldn't need to implement guards, or have a "slow path"
with lookups.) Even if the gains are minimal, they're gains that
people work for, so this would save them work.

The idea is not very complex. "Know where the value would live, and
you will always be able to find it."

The complexity comes from:
- The complexity of dicts. Much of the code is simply managing the
syncing between the dict and the refs table.
- The complexity of scope relationships.
- The complexity of allowing scopes to be replaced.
(`globals()['__builtins__'] = {}`)
- The complexity of cells which will live on after their owner dict.
- Exceptions and error messages.
- Preserving dict performance. For example, if normal dicts weren't a
concern, the ref table would be built into dict's internal table.

Most of the complexity is in trying to replicate the existing
complexity. Otherwise, we can just implement it as, "The inner dict
holds a length-1 list which holds the value, or NULL. Make sure to
check it's not NULL."

The cost of resolving a (not-chained) RefCell: Call a function, which
dereferences a pointer and checks if it's NULL. If it's not, return
it. It's pretty cheap. I don't know why you think it'd possibly be
slower than the current way.

Memory... That will have to be seen, but it would be a function of the
combined sizes of the module dicts, which shouldn't be a big part of
the program. Implementing compact dicts will also make the refs table
smaller.

Module loading will gain costs, since the names used in each module
require lookups at compile-time instead of runtime, and many functions
in a module can go unused. This needs to be explored. It's feasible to
have the interpreter delay the lookups to first-call.

> I implemented an optimization in FAT Python that replaces builtin lookups
> with load_const. I have to enhance the code to make it safe, but it works
> and it doesn't require deep changes in dict type.
> http://faster-cpython.readthedocs.org/en/latest/fat_python.html#copy-builtin-functions-to-constants
>
> In short, it only adds a single integer per dict, incremented at each
> modification (create, modify, delete).

To contrast: I don't require any safety checks, I can replace `print`
without incurring additional lookups, and I am insensitive to other
changes in the dict (unless the dict resizes).

I add a few pointers to dict (probably smaller than a PyIntObject),
and the refs table won't exist unless it's needed.