[Python-ideas] Add specialized bytecode with guards to functions

Wed Oct 21 17:43:57 CEST 2015

Hi,

I would like to share with you an idea to try to optimize CPython.

My previous attempts to optimize CPython failed because they changed the
language semantic. For example, it's not possible to replace len('abc')
with 3 because it is technically possible to override the builtin len()
function.

I propose to add a new "FAT" mode to Python to allow to add specialized
bytecode to function with guards. Example:

    >>> import builtins
    >>> def f(): return len("abc")
    ...
    >>> f()
    3

    >>> def g(): return 3
    ...
    >>> i=f.add(g)   # i is the index of the new specialized bytecode
    >>> f.add_dict_key_guard(i, builtins.__dict__, 'len')
    >>> f()
    3

    >>> builtins.len = lambda obj: "len"
    >>> f()
    'len'

In this example, the f() function gets a fast version (g()) returning
directly 3 instead of calling len("abc"). The optimization is disabled
when the builtin len() function is modified (when
builtins.__dict__['len'] is modified). (The example is not complete, we
need a second guard on the current global namespace, but I wanted to
write a short example.)

With such machinery, it becomes possible to implement interesting
optimizations. Some examples:

* inline a function instead of calling it: calling a Python function is
  "expensive". For example, if you inline the _get_sep() call in
  posixpath.isabs(), the function becomes 20% faster.

* move invariant out of the loop: classic micro-optimization done
  manually. For example, set "append = obj.append" before the loop and
  then call "append(item)" instead of "obj.append(item)" in the loop
  body. On a microbenchmark, it's between 5% (1 item) and 30% faster (10
  items or more)

* call pure functions at the compilation. A pure function has no side
  effect. Example: len(str).

Technical Challenges
====================

The problem is the cost of "guards". A guard is a check done before
calling a specialized function. Example of guards:

* Type of a function parameter
* Watch a dictionary key
* Watch a function (especially it's __code__ attribute)

Watching a dictionary key is a very important guard to disable an
optimization when a builtin function is modified, when a function is
replaced in a namespace, etc. The problem is that a dictionary lookup is
not cheap: we have to get the hash value of the key, browse the hash
table to find the bucket which may require multiple iterations, then we
have to compare the key value, etc.

To have faster guard, I propose to create a subtype of dict which
provides a global "version" of the dictionary, incremented at each
modification (create a new key, modify a key, delete a key) and a
"version" per key=>value mapping, incremented each time
that the mapping is modified. The lookup can be avoided when the
dictionary is not modified. If a different key is modified, we need a
lookup, but only once (we store the new global version to avoid a lookup
at the next guard check).

Limitations
===========

Specialized bytecode is build at compilation, *not* at runtime: it's not a
just-in-time (JIT) compiler. A JIT can implement even more efficient
optimizations and can use better guards. Please continue to use PyPy for
best performances! ;-)

The name "FAT" comes from the fact that multiple bytecode versions of a
function are stored in the memory, so a larger memory footprint
and larger .pyc files on disk can be expected.

Checking guards can also be more expensive than the optimization of the
specialized bytecode. The optimizer should use an heuristic to decide if
it's worth to use a specialized bytecode or not depending on the theoric
speeup and the cost of guards.

My motivation to write FAT Python is that CPython remains the reference
implementation where new features are implemented, and other Python
implementations still have issues with C extensions. JIT also has some
issues, like longer startup time, slow warmup and memory footprint (this
one may also be an issue of FAT Python!).

Implementation
==============

I wrote a proof-of-concept of my idea:

    https://hg.python.org/sandbox/fatpython/

It's a fork of CPython 3.6. Try it with:

    hg clone https://hg.python.org/sandbox/fatpython/
    cd fatpython
    ./configure && make
    ./python -F

See bench.py, Lib/posixpath.py (isabs) and Lib/test/test_fat.py
for examples of optimizations.

The command line -F flag enables the FAT mode. By default, *all*
optimizations are disabled, there is a negligible overhead when the
FAT mode is not used.

In the FAT mode, the dictionary for modules, classes and instances
becomes "versionned". Functions gets new methods to support adding
specialized bytecode with guards.

You can add manually a specialized bytecode for microbenchmarks, but
functions are *not* optimized automatically. (See the Roadmap below.)

How to write an optimizer?
==========================

Currently, my proof-of-oncept is only the machinery to support adding
specialized bytecodes with guards. The next step is to generate
automatically these specialized versions.

IMHO an optimizer should be implemented in Python and not in C, it's
easier to write Python code. We can begin with simple optimizations on
AST. See for example my astoptimizer which implements basic
optimizations:

    https://bitbucket.org/haypo/astoptimizer/

At the beginning, we may use type hints (PEP 484) and other manual
hints to help the optimizer. Later, a profiler learning the most common
types of function parameters can be written for automatic optimizations.
Example of hints: a list of constants which should not be modified in the
application. A global "DEBUG" flag is common in applications, relying on
DEBUG=False helps to remove dead code.

Roadmap
=======

* Finish the proof-of-concept: implement new guards, optimize method
  calls. For example, guards on module imports are required. Checking
  if a module is the expected mode can be tricky.
* Write an optimizer based on manual hints
* Write a profiler to generate hints

Right now, my main question is if it will be possible and efficient to
optimize classes, not only simple functions, especially classes defined
in two different modules. I wrote a proof-of-concept of optimized
method (with inlining), but I'm not sure yet that it's safe (don't
change Python semantic).

For more information on FAT Python, read:

   https://hg.python.org/sandbox/fatpython/file/tip/FATPYTHON.rst

So, what do you think? Do you expect real speedup on applications, not
only on microbechmarks? Do you know similar optimizations in other
scripting languages?

Victor Stinner