[Python-ideas] Add specialized bytecode with guards to functions

Wed Oct 21 22:55:27 CEST 2015

I don't think there will be that many cases that len() gets called on a
constant object, especially in performance-sensitive code since that would
be easy to spot and get rid of.  And without some good knowledge of the
argument, I would guess that it's only marginally helpful to know that
"len" is the builtin len.  I'm not saying that CPython couldn't add
techniques like this, but I think it might need to go decently far into JIT
territory to really make use of it.  For example, which functions would
have multiple versions generated?  There could be startup+memory overheads
if it gets applied to all functions, so maybe there needs to be some sort
of light profiling to determine when to produce the optimized version.  I'm
biased but I think this should be left to the JITs :)

On the other hand, I think it would be pretty interesting for the core
python community to come up with source-level constructs that could help
with this sort of thing.  For example, one idea is to make it canonical to
do something like:
def f():
  from __builtin__ import len
  return len("abc")

and then let implementations pattern-match this to statically resolve the
import.  This doesn't quite work since you can mutate __builtin__ (or
override __import__); maybe a new module name could be used, or maybe there
could be some sort of "bind_globals()" decorator.  I think these are
interesting since they could improve name-resolution in a targeted way, and
also give more information to the JITs.

kmod

On Wed, Oct 21, 2015 at 8:43 AM, Victor Stinner <victor.stinner at gmail.com>
wrote:

> Hi,
>
> I would like to share with you an idea to try to optimize CPython.
>
> My previous attempts to optimize CPython failed because they changed the
> language semantic. For example, it's not possible to replace len('abc')
> with 3 because it is technically possible to override the builtin len()
> function.
>
> I propose to add a new "FAT" mode to Python to allow to add specialized
> bytecode to function with guards. Example:
>
>     >>> import builtins
>     >>> def f(): return len("abc")
>     ...
>     >>> f()
>     3
>
>     >>> def g(): return 3
>     ...
>     >>> i=f.add(g)   # i is the index of the new specialized bytecode
>     >>> f.add_dict_key_guard(i, builtins.__dict__, 'len')
>     >>> f()
>     3
>
>     >>> builtins.len = lambda obj: "len"
>     >>> f()
>     'len'
>
> In this example, the f() function gets a fast version (g()) returning
> directly 3 instead of calling len("abc"). The optimization is disabled
> when the builtin len() function is modified (when
> builtins.__dict__['len'] is modified). (The example is not complete, we
> need a second guard on the current global namespace, but I wanted to
> write a short example.)
>
> With such machinery, it becomes possible to implement interesting
> optimizations. Some examples:
>
> * inline a function instead of calling it: calling a Python function is
>   "expensive". For example, if you inline the _get_sep() call in
>   posixpath.isabs(), the function becomes 20% faster.
>
> * move invariant out of the loop: classic micro-optimization done
>   manually. For example, set "append = obj.append" before the loop and
>   then call "append(item)" instead of "obj.append(item)" in the loop
>   body. On a microbenchmark, it's between 5% (1 item) and 30% faster (10
>   items or more)
>
> * call pure functions at the compilation. A pure function has no side
>   effect. Example: len(str).
>
>
> Technical Challenges
> ====================
>
> The problem is the cost of "guards". A guard is a check done before
> calling a specialized function. Example of guards:
>
> * Type of a function parameter
> * Watch a dictionary key
> * Watch a function (especially it's __code__ attribute)
>
> Watching a dictionary key is a very important guard to disable an
> optimization when a builtin function is modified, when a function is
> replaced in a namespace, etc. The problem is that a dictionary lookup is
> not cheap: we have to get the hash value of the key, browse the hash
> table to find the bucket which may require multiple iterations, then we
> have to compare the key value, etc.
>
> To have faster guard, I propose to create a subtype of dict which
> provides a global "version" of the dictionary, incremented at each
> modification (create a new key, modify a key, delete a key) and a
> "version" per key=>value mapping, incremented each time
> that the mapping is modified. The lookup can be avoided when the
> dictionary is not modified. If a different key is modified, we need a
> lookup, but only once (we store the new global version to avoid a lookup
> at the next guard check).
>
>
> Limitations
> ===========
>
> Specialized bytecode is build at compilation, *not* at runtime: it's not a
> just-in-time (JIT) compiler. A JIT can implement even more efficient
> optimizations and can use better guards. Please continue to use PyPy for
> best performances! ;-)
>
> The name "FAT" comes from the fact that multiple bytecode versions of a
> function are stored in the memory, so a larger memory footprint
> and larger .pyc files on disk can be expected.
>
> Checking guards can also be more expensive than the optimization of the
> specialized bytecode. The optimizer should use an heuristic to decide if
> it's worth to use a specialized bytecode or not depending on the theoric
> speeup and the cost of guards.
>
> My motivation to write FAT Python is that CPython remains the reference
> implementation where new features are implemented, and other Python
> implementations still have issues with C extensions. JIT also has some
> issues, like longer startup time, slow warmup and memory footprint (this
> one may also be an issue of FAT Python!).
>
>
> Implementation
> ==============
>
> I wrote a proof-of-concept of my idea:
>
>     https://hg.python.org/sandbox/fatpython/
>
> It's a fork of CPython 3.6. Try it with:
>
>     hg clone https://hg.python.org/sandbox/fatpython/
>     cd fatpython
>     ./configure && make
>     ./python -F
>
> See bench.py, Lib/posixpath.py (isabs) and Lib/test/test_fat.py
> for examples of optimizations.
>
> The command line -F flag enables the FAT mode. By default, *all*
> optimizations are disabled, there is a negligible overhead when the
> FAT mode is not used.
>
> In the FAT mode, the dictionary for modules, classes and instances
> becomes "versionned". Functions gets new methods to support adding
> specialized bytecode with guards.
>
> You can add manually a specialized bytecode for microbenchmarks, but
> functions are *not* optimized automatically. (See the Roadmap below.)
>
>
> How to write an optimizer?
> ==========================
>
> Currently, my proof-of-oncept is only the machinery to support adding
> specialized bytecodes with guards. The next step is to generate
> automatically these specialized versions.
>
> IMHO an optimizer should be implemented in Python and not in C, it's
> easier to write Python code. We can begin with simple optimizations on
> AST. See for example my astoptimizer which implements basic
> optimizations:
>
>     https://bitbucket.org/haypo/astoptimizer/
>
> At the beginning, we may use type hints (PEP 484) and other manual
> hints to help the optimizer. Later, a profiler learning the most common
> types of function parameters can be written for automatic optimizations.
> Example of hints: a list of constants which should not be modified in the
> application. A global "DEBUG" flag is common in applications, relying on
> DEBUG=False helps to remove dead code.
>
>
> Roadmap
> =======
>
> * Finish the proof-of-concept: implement new guards, optimize method
>   calls. For example, guards on module imports are required. Checking
>   if a module is the expected mode can be tricky.
> * Write an optimizer based on manual hints
> * Write a profiler to generate hints
>
> Right now, my main question is if it will be possible and efficient to
> optimize classes, not only simple functions, especially classes defined
> in two different modules. I wrote a proof-of-concept of optimized
> method (with inlining), but I'm not sure yet that it's safe (don't
> change Python semantic).
>
> For more information on FAT Python, read:
>
>    https://hg.python.org/sandbox/fatpython/file/tip/FATPYTHON.rst
>
>
> So, what do you think? Do you expect real speedup on applications, not
> only on microbechmarks? Do you know similar optimizations in other
> scripting languages?
>
> Victor Stinner
> _______________________________________________
> Python-ideas mailing list
> Python-ideas at python.org
> https://mail.python.org/mailman/listinfo/python-ideas
> Code of Conduct: http://python.org/psf/codeofconduct/
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-ideas/attachments/20151021/810dc5b4/attachment-0001.html>