Add specialized bytecode with guards to functions

Hi, I would like to share with you an idea to try to optimize CPython. My previous attempts to optimize CPython failed because they changed the language semantic. For example, it's not possible to replace len('abc') with 3 because it is technically possible to override the builtin len() function. I propose to add a new "FAT" mode to Python to allow to add specialized bytecode to function with guards. Example: >>> import builtins >>> def f(): return len("abc") ... >>> f() 3 >>> def g(): return 3 ... >>> i=f.add(g) # i is the index of the new specialized bytecode >>> f.add_dict_key_guard(i, builtins.__dict__, 'len') >>> f() 3 >>> builtins.len = lambda obj: "len" >>> f() 'len' In this example, the f() function gets a fast version (g()) returning directly 3 instead of calling len("abc"). The optimization is disabled when the builtin len() function is modified (when builtins.__dict__['len'] is modified). (The example is not complete, we need a second guard on the current global namespace, but I wanted to write a short example.) With such machinery, it becomes possible to implement interesting optimizations. Some examples: * inline a function instead of calling it: calling a Python function is "expensive". For example, if you inline the _get_sep() call in posixpath.isabs(), the function becomes 20% faster. * move invariant out of the loop: classic micro-optimization done manually. For example, set "append = obj.append" before the loop and then call "append(item)" instead of "obj.append(item)" in the loop body. On a microbenchmark, it's between 5% (1 item) and 30% faster (10 items or more) * call pure functions at the compilation. A pure function has no side effect. Example: len(str). Technical Challenges ==================== The problem is the cost of "guards". A guard is a check done before calling a specialized function. Example of guards: * Type of a function parameter * Watch a dictionary key * Watch a function (especially it's __code__ attribute) Watching a dictionary key is a very important guard to disable an optimization when a builtin function is modified, when a function is replaced in a namespace, etc. The problem is that a dictionary lookup is not cheap: we have to get the hash value of the key, browse the hash table to find the bucket which may require multiple iterations, then we have to compare the key value, etc. To have faster guard, I propose to create a subtype of dict which provides a global "version" of the dictionary, incremented at each modification (create a new key, modify a key, delete a key) and a "version" per key=>value mapping, incremented each time that the mapping is modified. The lookup can be avoided when the dictionary is not modified. If a different key is modified, we need a lookup, but only once (we store the new global version to avoid a lookup at the next guard check). Limitations =========== Specialized bytecode is build at compilation, *not* at runtime: it's not a just-in-time (JIT) compiler. A JIT can implement even more efficient optimizations and can use better guards. Please continue to use PyPy for best performances! ;-) The name "FAT" comes from the fact that multiple bytecode versions of a function are stored in the memory, so a larger memory footprint and larger .pyc files on disk can be expected. Checking guards can also be more expensive than the optimization of the specialized bytecode. The optimizer should use an heuristic to decide if it's worth to use a specialized bytecode or not depending on the theoric speeup and the cost of guards. My motivation to write FAT Python is that CPython remains the reference implementation where new features are implemented, and other Python implementations still have issues with C extensions. JIT also has some issues, like longer startup time, slow warmup and memory footprint (this one may also be an issue of FAT Python!). Implementation ============== I wrote a proof-of-concept of my idea: https://hg.python.org/sandbox/fatpython/ It's a fork of CPython 3.6. Try it with: hg clone https://hg.python.org/sandbox/fatpython/ cd fatpython ./configure && make ./python -F See bench.py, Lib/posixpath.py (isabs) and Lib/test/test_fat.py for examples of optimizations. The command line -F flag enables the FAT mode. By default, *all* optimizations are disabled, there is a negligible overhead when the FAT mode is not used. In the FAT mode, the dictionary for modules, classes and instances becomes "versionned". Functions gets new methods to support adding specialized bytecode with guards. You can add manually a specialized bytecode for microbenchmarks, but functions are *not* optimized automatically. (See the Roadmap below.) How to write an optimizer? ========================== Currently, my proof-of-oncept is only the machinery to support adding specialized bytecodes with guards. The next step is to generate automatically these specialized versions. IMHO an optimizer should be implemented in Python and not in C, it's easier to write Python code. We can begin with simple optimizations on AST. See for example my astoptimizer which implements basic optimizations: https://bitbucket.org/haypo/astoptimizer/ At the beginning, we may use type hints (PEP 484) and other manual hints to help the optimizer. Later, a profiler learning the most common types of function parameters can be written for automatic optimizations. Example of hints: a list of constants which should not be modified in the application. A global "DEBUG" flag is common in applications, relying on DEBUG=False helps to remove dead code. Roadmap ======= * Finish the proof-of-concept: implement new guards, optimize method calls. For example, guards on module imports are required. Checking if a module is the expected mode can be tricky. * Write an optimizer based on manual hints * Write a profiler to generate hints Right now, my main question is if it will be possible and efficient to optimize classes, not only simple functions, especially classes defined in two different modules. I wrote a proof-of-concept of optimized method (with inlining), but I'm not sure yet that it's safe (don't change Python semantic). For more information on FAT Python, read: https://hg.python.org/sandbox/fatpython/file/tip/FATPYTHON.rst So, what do you think? Do you expect real speedup on applications, not only on microbechmarks? Do you know similar optimizations in other scripting languages? Victor Stinner

On Thu, Oct 22, 2015 at 2:43 AM, Victor Stinner <victor.stinner@gmail.com> wrote:
* Compare against PyPy. For optimized Python execution, my brain immediately jumps to PyPy. How does FAT Python compare in terms of functionality, performance, CPython compatibility, etc? What can it brag as its advantage? ChrisA

On 21 October 2015 at 17:48, Chris Angelico <rosuav@gmail.com> wrote:
* Full CPython C API compatibility * Reliance on ahead of time compilation means it works for command line scripts, not just long running server processes that allow the PyPy tracing JIT to work its magic As an ahead-of-time optimisation, it's also possible something like this could speed up applications running on PyPy during the JIT warm-up phase. An existing example of this kind of guard-based optimisation exists in the ABC machinery - we globally version the ABC registration graph in order to automatically invalidate caches when a new explicit registration is added. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

On 21 October 2015 at 17:48, Chris Angelico <rosuav@gmail.com> wrote:
* Compare against PyPy.
As written in my email, don't expect better performances than PyPy. PyPy must remain the fatest implementation of Python ;-) The final goal is to see real speedup on non-trivial applications like Django Mercurial, or OpenStack. 2015-10-21 18:11 GMT+02:00 Nick Coghlan <ncoghlan@gmail.com>:
* Full CPython C API compatibility
FYI the FAT mode uses fat.verdict and fat.SpecializedFunction types which are subtypes of dict and types.FunctionTypes, so you keep a full compatibility of the C API. I noticed one issue with the C API: #define PyFunction_Check(op) (Py_TYPE(op) == &PyFunction_Type) This test fails with fat.SpecializedFunction. It means that C extensions must be recompiled to with the FAT mode. I modified the macro to: #define PyFunction_Check(op) \ (Py_TYPE(op) == &PyFunction_Type \ || PyType_IsSubtype(Py_TYPE(op), &PyFunction_Type))
Yes, a I wrote in my email, CPython still beats PyPy for fast command line scripts like many Mercurial commands. PyPy JIT takes time to warmup. (Hum, in fact I only read time, I didn't check, sorry.) So yes, I also expect faster performances with static compilation on such use case.
Hum, this is a common cache. CPython has many similar caches, like cache for methods on types, but it's not guard for function calls. I would like to know if someone else tried to write specialized functions with guards in other scripting languages (or even Python?). Victor

I haven't tried the prototype yet, but I wonder if this mightn't be a useful addition to an existing optimizing Python system, e.g. PyPy or (plug :-) Pyston? On Wed, Oct 21, 2015 at 8:43 AM, Victor Stinner <victor.stinner@gmail.com> wrote:
-- --Guido van Rossum (python.org/~guido)

2015-10-21 18:20 GMT+02:00 Guido van Rossum <guido@python.org>:
I'm not sure that specialized functions with guards would benefit to JIT compilers. A best match is with Cython, pythran and other projects generating a specialized version of a function (in native code), maybe releasing the GIL. It would be cool to be able to use a specialized version under some conditions. For example, if parameters are all floats, use the specialized version optimized by Cython, other fallback to the (slow) regular bytecode. The challenge is to automate everything and make it easy to use. I don't know if Numba would match with this design. Victor

On Wed, Oct 21, 2015 at 9:50 AM, Victor Stinner <victor.stinner@gmail.com> wrote:
Pyston is not primarily a JIT compiler (at least that's my understanding). And my point was the opposite: Pyston (or PyPy, less sure) might benefit from your idea.
Admittedly these are also very attractive-looking pairings! -- --Guido van Rossum (python.org/~guido)

On Wed, 21 Oct 2015 at 10:00 Guido van Rossum <guido@python.org> wrote:
There GitHub project seems to suggest otherwise based on them saying they are leveraging LLVM's JIT: https://github.com/dropbox/pyston#pyston-- .
And my point was the opposite: Pyston (or PyPy, less sure) might benefit from your idea.
As would Pyjion (which is really early work and being played with in Dino and my spare time at work): https://github.com/microsoft/pyjion . We would probably love to see at least the dict watcher get into CPython. I know this was something that Unladen Swallow really wanted as well for easy optimizations of built-in functions. Extend dict watchers to a module's or class' __dict__ and you start to get more inlining opportunities as Victor is hoping for. But as a starting point, getting some way to cheaply detect when the built-ins have not been modified from what they are set to by default would be nice to have. -Brett

On 21 October 2015 at 18:12, Brett Cannon <brett@python.org> wrote:
Would it be worth focusing on mechanisms like this, which would be of general use to various projects, and treating Victor's specific optimisation use case as simply that, a motivating example? At least in the first instance - as Victor pointed out, previous attempts to introduce optimisations have always failed because it's not possible to switch off Python's dynamic nature. So maybe if we had tools which allowed code to more easily *detect* when Python's dynamic features have been used (or rather, when they have *not* been used) we'd be in a better position to develop optimisation strategies that know when they are appropriate? (Another example - Victor mentioned that CPython uses PyFunction_Check, which isn't friendly to function subclasses. A fix to allow function subclasses to work more seamlessly would be another useful general mechanism). Paul

On Wed, 21 Oct 2015 at 10:31 Paul Moore <p.f.moore@gmail.com> wrote:
I think so.
Exactly, and especially if the check is cheap. After that you start to potentially get into granularity, e.g. do you care if *any* built-in changed or do you only care if a *specific* built-in changed, and how do either approach affect performance? The trick is not breaking compatibility and not hurting performance in the general case. But yes, knowing when the uncommon case of overriding a built-in occurs would allow for a bunch of optimization techniques from inlining to using a faster implementation that is compatible with the standard built-in. -Brett

Brett Cannon <brett@...> writes:
Exactly, and especially if the check is cheap. After that you start to
potentially get into granularity, e.g. do you care if any built-in changed or do you only care if a specific built-in changed, and how do either approach affect performance? The trick is not breaking compatibility and not hurting performance in the general case. But yes, knowing when the uncommon case of overriding a built-in occurs would allow for a bunch of optimization techniques from inlining to using a faster implementation that is compatible with the standard built-in. I think adding a version number to dicts would be a very cheap change and would potentially help several CPython-based projects (including perhaps Numba some day, though nowadays it wouldn't benefit). However, I would also like to challenge the idea that an accelerator has to be 100% compatible. If your API is opt-in (which Numba's is), then you can happily relax some constraints (provided it is documented) in exchange for much faster code. Our experience with Numba is that nobody cares if you can't monkeypatch the len() builtin, for example. (Numba actually goes quite far in that direction since e.g. its integers are fixed-width; the speed of generated code, for supported Python constructs, usually approaches or reaches C, Fortran or even hand-tuned assembly speed) Regards Antoine.

2015-10-21 23:03 GMT+02:00 Antoine Pitrou <antoine@python.org>:
I tried to leave the dict type unchanged (it's not true in the PoC right now, but I'm working on fixing this). Currently, the versionned dictionary type (fat.verdict) is only used when Python starts in FAT mode and only in a few places: module, class and instance dictionaries. I'm not sure that it's worth to add the versionning feature to the base dict type. It uses a little bit more memory, and IMHO the dict type is already heavy in term of memory usage (compared to a compact C array). I'm trying to keep all FAT features optional to be able to easily compare performances, but also ensure that the FAT mode can be added to CPython without hurting performances (CPU and memory) when it's disabled.
Sure, it will be possible to skip some guards if you know that builtins are never monkey-patched, that a module variable is constant, that a class will never be modified, etc. This will reduce even more the cost of guards (especially when all guards are skipped :-)). But I prefer to start with something that doesn't change the Python semantic at all, and then provide opt-in options to give more freedom to the optimizer. Victor

On Wed, 21 Oct 2015 at 14:20 Victor Stinner <victor.stinner@gmail.com> wrote:
I'm not sure if Antoine was truly proposing adding version number support to all dicts, but I know I wasn't. I would propose creating a dict subclass and use it in exactly the places you outlined, Victor: modules, classes, and instance dictionaries (and I would honestly be fine with doing it piecemeal in that order until we have experience as to how truly dynamic people treat instances compared to modules and classes). I would start with a global counter for every time __setitem__() is successful and see where that gets us (both in terms of performance and optimizations).
I agree that not all optimizations/accelerators need to be 100% compatible. Numba is in a certain kind of position, though, in that it is applied at the function/method level while what Victor and I are working on applies to all code. So it's just differing levels of granularity which influences how much one feels like they can break the rules in the name of performance.

Brett Cannon <brett@...> writes:
I'm not sure if Antoine was truly proposing adding version number support to all dicts, but I know I wasn't.
I am. An empty dict is currently 288 bytes on 64-bit Linux. Even an empty dict with key sharing (e.g. an instance dict, which would be versioned anyway in your proposal) is 96 bytes long. Adding a size_t-sized field would not make a significant difference in memory consumption, and is probably much simpler than ensuring you use the right dict subtype at each place, add some type-checking to all PyDict concrete APIs, etc. And incrementing the version number is cheaper if you don't have a conditional branch based on whether the dict is a versioned or non-versioned dict. Regards Antoine.

Victor Stinner <victor.stinner@...> writes:
Numba is primarily a JIT compiler but next release will also feature a limited AOT compilation mode (meaning you can compile extension modules for given explicit signatures of whatever functions you want to compile, and don't need Numba to import or run those functions). Documentation preview here: http://numba.pydata.org/numba-doc/dev/user/pycc.html As a JIT compiler, however, Numba is special as it's opt-in and doesn't claim to support the whole range of Python semantics (far from it). Numba takes your Python code and pretty much assumes it has the same dynamicity as Fortran and C++ code, runs type inference and compiles a very fast version out of it, for the CPU (by default) or for the GPU (with a specific programming model). Regards Antoine.

Victor Stinner <victor.stinner@gmail.com> writes:
I think this is solving the wrong problem... What about adding an explicit way to "bake in" the values of globals? freeze_globals(func, values) values is a dict with the names of globals and their values, so, you could for example: def foo(): return len('abc') foo = freeze_globals(foo, {'len': len}) # or foo = freeze_globals(foo, __builtins__.__dict__) It would replace any LOAD_GLOBAL for an item that is in the given dictionary (maybe excluding anything that is also stored in the same function) with a LOAD_CONST with a reference to the value from the dictionary. As a second pass, now that the bytecode looks like this: LOAD_CONST ('abc') LOAD_CONST (<built-in function len>) CALL_FUNCTION 1 This is no longer dependent on anything that can be overridden, and can safely be replaced (given the optimizer knows len is a pure function) with the constant 3. This second pass could also take care of e.g. empty set and frozenset constructors, etc. Decorators could be included to freeze only built-ins, to freeze all globals and built-ins, or to have a list of names to exclude. So you could have @freeze_builtins or @freeze_all_globals for each function that needs heavy optimization.

Hi, 2015-10-21 18:52 GMT+02:00 Random832 <random832@fastmail.com>:
You are changing the semantic of Python. In Python, you should be able to override builtin functions. Binding a global name to a local name is a known trick to optimize a function, but you have to write it explicitly, and again you loose the ability to override len. Example: def f(len=len): return len("abc") Here len becomes a fast local variable.
I cited "previous attempts" which failed. What you describe is close to my "readonly" attempt which tried to make module and type namespaces readonly: https://hg.python.org/sandbox/readonly/file/tip/READONLY.txt Making namespaces read-only simply doesn't work. It's very common to modify namespaces. I added callbacks to disable optimizations when a namespace is modified. But this design doesn't scale: you add a complexity of O(n) when a namespace is modified, even if optimized functions are never called. My new design (guards) avoids this performance issue. Victor

Victor Stinner <victor.stinner@gmail.com> writes:
You are changing the semantic of Python. In Python, you should be able to override builtin functions.
What I am proposing is a generalized way to add objects that do not have a literal to a function's constants pool. The global name within the first version of the function is simply used as a placeholder for loading the new constant. It could just as easily be something like def foo(): return _1(abc) foo = add_consts(foo, {'_1':len})
This would also have to be written explicitly, and the whole point is to lose the ability to override things. Explicitly.
It'd be even faster as a constant, and that way people couldn't call it as f(len=whatever) [so an optimizer doesn't have to deal with the possibility of people doing that]
This would apply to the function, not the namespace.
Nothing happens when a namespace is modified, because the function is no longer using the namespace in any form. This adds to the cost of defining a function, but in most cases (I'm not 100% sure of the semantics for nested functions, let alone the implementation) this could be done at compile time, so it would be a one-time cost.

You can actually do this already in pure Python. We implemented almost exactly what's proposed here as the asconstants <https://github.com/llllllllll/codetransformer#asconstants> decorator in codetransformer <https://github.com/llllllllll/codetransformer>:
This works by making two changes to the function's __code__: 1. We add an entry to foo.__code__.co_consts. 2. We rewrite all LOAD_{GLOBAL|NAME} bytecode instructions of that name to be LOAD_CONSTs instead. This makes the function itself slightly faster, since LOAD_CONST is generally faster than other loads. But more importantly, it statically freezes the value of a particular name in the function, which makes it possible/sane to implement other more interesting transformations/optimizations.

On Oct 21, 2015, at 10:35, Random832 <random832@fastmail.com> wrote:
Nested functions are just like top-level functions: at compile time, the function body is compiled to a code object (bytecode, a list of free variables, etc.) that gets stored as a hidden constant; at runtime, all that happens is building a function object out of it (and, if def rather than lambda, binding it to a name). The only way nested functions are more costly at runtime is that they're more likely to have free variables that have to get bound to cells, but even that isn't a big deal. Anyway, if you just assume that free variables can never be optimized out (which seems fair for a first implementation), you should be able to treat nested functions the same as global functions. If you want to later optimize out free variables bound to constants, you need to add some way to guard the local namespace—which isn't really a dict, but an array in each frame (although you only have to guard the frames that have cells bound to some closure). That seems doable too, but probably worth skipping in the first version.

Andrew Barnert via Python-ideas <python-ideas@python.org> writes:
Nested functions are just like top-level functions:
The point is that my suggested feature is a decorator (or decorator-like function) attached to the function, to modify the function to replace free variables with constants. This means if the decorator is attached to the inner function it gets executed every time the outer function is executed, and if it is attached to the outer function we have to decide whether it will walk through that function's inner functions and change them too.

I don't think there will be that many cases that len() gets called on a constant object, especially in performance-sensitive code since that would be easy to spot and get rid of. And without some good knowledge of the argument, I would guess that it's only marginally helpful to know that "len" is the builtin len. I'm not saying that CPython couldn't add techniques like this, but I think it might need to go decently far into JIT territory to really make use of it. For example, which functions would have multiple versions generated? There could be startup+memory overheads if it gets applied to all functions, so maybe there needs to be some sort of light profiling to determine when to produce the optimized version. I'm biased but I think this should be left to the JITs :) On the other hand, I think it would be pretty interesting for the core python community to come up with source-level constructs that could help with this sort of thing. For example, one idea is to make it canonical to do something like: def f(): from __builtin__ import len return len("abc") and then let implementations pattern-match this to statically resolve the import. This doesn't quite work since you can mutate __builtin__ (or override __import__); maybe a new module name could be used, or maybe there could be some sort of "bind_globals()" decorator. I think these are interesting since they could improve name-resolution in a targeted way, and also give more information to the JITs. kmod On Wed, Oct 21, 2015 at 8:43 AM, Victor Stinner <victor.stinner@gmail.com> wrote:

Hi, 2015-10-21 22:55 GMT+02:00 Kevin Modzelewski <kmod@dropbox.com>:
Hum, I guess that almost no code calls len() on a constant string :-) In my experience astoptimizer, I noticed that it becomes more common when you combine it with constant folding optimization. Many strings are "constants" in Python and commonly tested. Example: "if sys.platform.startswith('freebsd'): ...". If sys.platform is "linux", it's dead code and the whole if can be removed.
I took the len("abc") example just because it's very easy to understand :-) It's just to explain the principle. Later, if you combine multiple optimizations, I bet that it will become really interesting on real applications. But first we need a framework to make these optimizations legit in Python (don't change Python semantic). The goal is to optimize classes, not only builtin functions.
We will need heuristic to estimate the theoric performance speedup with specialized bytecode, to then decide if it's worth to keep it or not. In the long term, we can design a profiler for profile guided optimizations (PGO). In the short term, we can use hints as type hints or manual hints, to decide which functions should be optimized or not.
As written in other messages, they are already many ways to reduce the "overhead" of the Python language. But I'm trying to write an generic optimizer which would not require to modify the source code. Maybe we can modify Python to make some of these optimizations easy to use, but I'm not sure that it's worth it. Victor

On Thu, Oct 22, 2015 at 2:43 AM, Victor Stinner <victor.stinner@gmail.com> wrote:
* Compare against PyPy. For optimized Python execution, my brain immediately jumps to PyPy. How does FAT Python compare in terms of functionality, performance, CPython compatibility, etc? What can it brag as its advantage? ChrisA

On 21 October 2015 at 17:48, Chris Angelico <rosuav@gmail.com> wrote:
* Full CPython C API compatibility * Reliance on ahead of time compilation means it works for command line scripts, not just long running server processes that allow the PyPy tracing JIT to work its magic As an ahead-of-time optimisation, it's also possible something like this could speed up applications running on PyPy during the JIT warm-up phase. An existing example of this kind of guard-based optimisation exists in the ABC machinery - we globally version the ABC registration graph in order to automatically invalidate caches when a new explicit registration is added. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

On 21 October 2015 at 17:48, Chris Angelico <rosuav@gmail.com> wrote:
* Compare against PyPy.
As written in my email, don't expect better performances than PyPy. PyPy must remain the fatest implementation of Python ;-) The final goal is to see real speedup on non-trivial applications like Django Mercurial, or OpenStack. 2015-10-21 18:11 GMT+02:00 Nick Coghlan <ncoghlan@gmail.com>:
* Full CPython C API compatibility
FYI the FAT mode uses fat.verdict and fat.SpecializedFunction types which are subtypes of dict and types.FunctionTypes, so you keep a full compatibility of the C API. I noticed one issue with the C API: #define PyFunction_Check(op) (Py_TYPE(op) == &PyFunction_Type) This test fails with fat.SpecializedFunction. It means that C extensions must be recompiled to with the FAT mode. I modified the macro to: #define PyFunction_Check(op) \ (Py_TYPE(op) == &PyFunction_Type \ || PyType_IsSubtype(Py_TYPE(op), &PyFunction_Type))
Yes, a I wrote in my email, CPython still beats PyPy for fast command line scripts like many Mercurial commands. PyPy JIT takes time to warmup. (Hum, in fact I only read time, I didn't check, sorry.) So yes, I also expect faster performances with static compilation on such use case.
Hum, this is a common cache. CPython has many similar caches, like cache for methods on types, but it's not guard for function calls. I would like to know if someone else tried to write specialized functions with guards in other scripting languages (or even Python?). Victor

I haven't tried the prototype yet, but I wonder if this mightn't be a useful addition to an existing optimizing Python system, e.g. PyPy or (plug :-) Pyston? On Wed, Oct 21, 2015 at 8:43 AM, Victor Stinner <victor.stinner@gmail.com> wrote:
-- --Guido van Rossum (python.org/~guido)

2015-10-21 18:20 GMT+02:00 Guido van Rossum <guido@python.org>:
I'm not sure that specialized functions with guards would benefit to JIT compilers. A best match is with Cython, pythran and other projects generating a specialized version of a function (in native code), maybe releasing the GIL. It would be cool to be able to use a specialized version under some conditions. For example, if parameters are all floats, use the specialized version optimized by Cython, other fallback to the (slow) regular bytecode. The challenge is to automate everything and make it easy to use. I don't know if Numba would match with this design. Victor

On Wed, Oct 21, 2015 at 9:50 AM, Victor Stinner <victor.stinner@gmail.com> wrote:
Pyston is not primarily a JIT compiler (at least that's my understanding). And my point was the opposite: Pyston (or PyPy, less sure) might benefit from your idea.
Admittedly these are also very attractive-looking pairings! -- --Guido van Rossum (python.org/~guido)

On Wed, 21 Oct 2015 at 10:00 Guido van Rossum <guido@python.org> wrote:
There GitHub project seems to suggest otherwise based on them saying they are leveraging LLVM's JIT: https://github.com/dropbox/pyston#pyston-- .
And my point was the opposite: Pyston (or PyPy, less sure) might benefit from your idea.
As would Pyjion (which is really early work and being played with in Dino and my spare time at work): https://github.com/microsoft/pyjion . We would probably love to see at least the dict watcher get into CPython. I know this was something that Unladen Swallow really wanted as well for easy optimizations of built-in functions. Extend dict watchers to a module's or class' __dict__ and you start to get more inlining opportunities as Victor is hoping for. But as a starting point, getting some way to cheaply detect when the built-ins have not been modified from what they are set to by default would be nice to have. -Brett

On 21 October 2015 at 18:12, Brett Cannon <brett@python.org> wrote:
Would it be worth focusing on mechanisms like this, which would be of general use to various projects, and treating Victor's specific optimisation use case as simply that, a motivating example? At least in the first instance - as Victor pointed out, previous attempts to introduce optimisations have always failed because it's not possible to switch off Python's dynamic nature. So maybe if we had tools which allowed code to more easily *detect* when Python's dynamic features have been used (or rather, when they have *not* been used) we'd be in a better position to develop optimisation strategies that know when they are appropriate? (Another example - Victor mentioned that CPython uses PyFunction_Check, which isn't friendly to function subclasses. A fix to allow function subclasses to work more seamlessly would be another useful general mechanism). Paul

On Wed, 21 Oct 2015 at 10:31 Paul Moore <p.f.moore@gmail.com> wrote:
I think so.
Exactly, and especially if the check is cheap. After that you start to potentially get into granularity, e.g. do you care if *any* built-in changed or do you only care if a *specific* built-in changed, and how do either approach affect performance? The trick is not breaking compatibility and not hurting performance in the general case. But yes, knowing when the uncommon case of overriding a built-in occurs would allow for a bunch of optimization techniques from inlining to using a faster implementation that is compatible with the standard built-in. -Brett

Brett Cannon <brett@...> writes:
Exactly, and especially if the check is cheap. After that you start to
potentially get into granularity, e.g. do you care if any built-in changed or do you only care if a specific built-in changed, and how do either approach affect performance? The trick is not breaking compatibility and not hurting performance in the general case. But yes, knowing when the uncommon case of overriding a built-in occurs would allow for a bunch of optimization techniques from inlining to using a faster implementation that is compatible with the standard built-in. I think adding a version number to dicts would be a very cheap change and would potentially help several CPython-based projects (including perhaps Numba some day, though nowadays it wouldn't benefit). However, I would also like to challenge the idea that an accelerator has to be 100% compatible. If your API is opt-in (which Numba's is), then you can happily relax some constraints (provided it is documented) in exchange for much faster code. Our experience with Numba is that nobody cares if you can't monkeypatch the len() builtin, for example. (Numba actually goes quite far in that direction since e.g. its integers are fixed-width; the speed of generated code, for supported Python constructs, usually approaches or reaches C, Fortran or even hand-tuned assembly speed) Regards Antoine.

2015-10-21 23:03 GMT+02:00 Antoine Pitrou <antoine@python.org>:
I tried to leave the dict type unchanged (it's not true in the PoC right now, but I'm working on fixing this). Currently, the versionned dictionary type (fat.verdict) is only used when Python starts in FAT mode and only in a few places: module, class and instance dictionaries. I'm not sure that it's worth to add the versionning feature to the base dict type. It uses a little bit more memory, and IMHO the dict type is already heavy in term of memory usage (compared to a compact C array). I'm trying to keep all FAT features optional to be able to easily compare performances, but also ensure that the FAT mode can be added to CPython without hurting performances (CPU and memory) when it's disabled.
Sure, it will be possible to skip some guards if you know that builtins are never monkey-patched, that a module variable is constant, that a class will never be modified, etc. This will reduce even more the cost of guards (especially when all guards are skipped :-)). But I prefer to start with something that doesn't change the Python semantic at all, and then provide opt-in options to give more freedom to the optimizer. Victor

On Wed, 21 Oct 2015 at 14:20 Victor Stinner <victor.stinner@gmail.com> wrote:
I'm not sure if Antoine was truly proposing adding version number support to all dicts, but I know I wasn't. I would propose creating a dict subclass and use it in exactly the places you outlined, Victor: modules, classes, and instance dictionaries (and I would honestly be fine with doing it piecemeal in that order until we have experience as to how truly dynamic people treat instances compared to modules and classes). I would start with a global counter for every time __setitem__() is successful and see where that gets us (both in terms of performance and optimizations).
I agree that not all optimizations/accelerators need to be 100% compatible. Numba is in a certain kind of position, though, in that it is applied at the function/method level while what Victor and I are working on applies to all code. So it's just differing levels of granularity which influences how much one feels like they can break the rules in the name of performance.

Brett Cannon <brett@...> writes:
I'm not sure if Antoine was truly proposing adding version number support to all dicts, but I know I wasn't.
I am. An empty dict is currently 288 bytes on 64-bit Linux. Even an empty dict with key sharing (e.g. an instance dict, which would be versioned anyway in your proposal) is 96 bytes long. Adding a size_t-sized field would not make a significant difference in memory consumption, and is probably much simpler than ensuring you use the right dict subtype at each place, add some type-checking to all PyDict concrete APIs, etc. And incrementing the version number is cheaper if you don't have a conditional branch based on whether the dict is a versioned or non-versioned dict. Regards Antoine.

Victor Stinner <victor.stinner@...> writes:
Numba is primarily a JIT compiler but next release will also feature a limited AOT compilation mode (meaning you can compile extension modules for given explicit signatures of whatever functions you want to compile, and don't need Numba to import or run those functions). Documentation preview here: http://numba.pydata.org/numba-doc/dev/user/pycc.html As a JIT compiler, however, Numba is special as it's opt-in and doesn't claim to support the whole range of Python semantics (far from it). Numba takes your Python code and pretty much assumes it has the same dynamicity as Fortran and C++ code, runs type inference and compiles a very fast version out of it, for the CPU (by default) or for the GPU (with a specific programming model). Regards Antoine.

Victor Stinner <victor.stinner@gmail.com> writes:
I think this is solving the wrong problem... What about adding an explicit way to "bake in" the values of globals? freeze_globals(func, values) values is a dict with the names of globals and their values, so, you could for example: def foo(): return len('abc') foo = freeze_globals(foo, {'len': len}) # or foo = freeze_globals(foo, __builtins__.__dict__) It would replace any LOAD_GLOBAL for an item that is in the given dictionary (maybe excluding anything that is also stored in the same function) with a LOAD_CONST with a reference to the value from the dictionary. As a second pass, now that the bytecode looks like this: LOAD_CONST ('abc') LOAD_CONST (<built-in function len>) CALL_FUNCTION 1 This is no longer dependent on anything that can be overridden, and can safely be replaced (given the optimizer knows len is a pure function) with the constant 3. This second pass could also take care of e.g. empty set and frozenset constructors, etc. Decorators could be included to freeze only built-ins, to freeze all globals and built-ins, or to have a list of names to exclude. So you could have @freeze_builtins or @freeze_all_globals for each function that needs heavy optimization.

Hi, 2015-10-21 18:52 GMT+02:00 Random832 <random832@fastmail.com>:
You are changing the semantic of Python. In Python, you should be able to override builtin functions. Binding a global name to a local name is a known trick to optimize a function, but you have to write it explicitly, and again you loose the ability to override len. Example: def f(len=len): return len("abc") Here len becomes a fast local variable.
I cited "previous attempts" which failed. What you describe is close to my "readonly" attempt which tried to make module and type namespaces readonly: https://hg.python.org/sandbox/readonly/file/tip/READONLY.txt Making namespaces read-only simply doesn't work. It's very common to modify namespaces. I added callbacks to disable optimizations when a namespace is modified. But this design doesn't scale: you add a complexity of O(n) when a namespace is modified, even if optimized functions are never called. My new design (guards) avoids this performance issue. Victor

Victor Stinner <victor.stinner@gmail.com> writes:
You are changing the semantic of Python. In Python, you should be able to override builtin functions.
What I am proposing is a generalized way to add objects that do not have a literal to a function's constants pool. The global name within the first version of the function is simply used as a placeholder for loading the new constant. It could just as easily be something like def foo(): return _1(abc) foo = add_consts(foo, {'_1':len})
This would also have to be written explicitly, and the whole point is to lose the ability to override things. Explicitly.
It'd be even faster as a constant, and that way people couldn't call it as f(len=whatever) [so an optimizer doesn't have to deal with the possibility of people doing that]
This would apply to the function, not the namespace.
Nothing happens when a namespace is modified, because the function is no longer using the namespace in any form. This adds to the cost of defining a function, but in most cases (I'm not 100% sure of the semantics for nested functions, let alone the implementation) this could be done at compile time, so it would be a one-time cost.

You can actually do this already in pure Python. We implemented almost exactly what's proposed here as the asconstants <https://github.com/llllllllll/codetransformer#asconstants> decorator in codetransformer <https://github.com/llllllllll/codetransformer>:
This works by making two changes to the function's __code__: 1. We add an entry to foo.__code__.co_consts. 2. We rewrite all LOAD_{GLOBAL|NAME} bytecode instructions of that name to be LOAD_CONSTs instead. This makes the function itself slightly faster, since LOAD_CONST is generally faster than other loads. But more importantly, it statically freezes the value of a particular name in the function, which makes it possible/sane to implement other more interesting transformations/optimizations.

On Oct 21, 2015, at 10:35, Random832 <random832@fastmail.com> wrote:
Nested functions are just like top-level functions: at compile time, the function body is compiled to a code object (bytecode, a list of free variables, etc.) that gets stored as a hidden constant; at runtime, all that happens is building a function object out of it (and, if def rather than lambda, binding it to a name). The only way nested functions are more costly at runtime is that they're more likely to have free variables that have to get bound to cells, but even that isn't a big deal. Anyway, if you just assume that free variables can never be optimized out (which seems fair for a first implementation), you should be able to treat nested functions the same as global functions. If you want to later optimize out free variables bound to constants, you need to add some way to guard the local namespace—which isn't really a dict, but an array in each frame (although you only have to guard the frames that have cells bound to some closure). That seems doable too, but probably worth skipping in the first version.

Andrew Barnert via Python-ideas <python-ideas@python.org> writes:
Nested functions are just like top-level functions:
The point is that my suggested feature is a decorator (or decorator-like function) attached to the function, to modify the function to replace free variables with constants. This means if the decorator is attached to the inner function it gets executed every time the outer function is executed, and if it is attached to the outer function we have to decide whether it will walk through that function's inner functions and change them too.

I don't think there will be that many cases that len() gets called on a constant object, especially in performance-sensitive code since that would be easy to spot and get rid of. And without some good knowledge of the argument, I would guess that it's only marginally helpful to know that "len" is the builtin len. I'm not saying that CPython couldn't add techniques like this, but I think it might need to go decently far into JIT territory to really make use of it. For example, which functions would have multiple versions generated? There could be startup+memory overheads if it gets applied to all functions, so maybe there needs to be some sort of light profiling to determine when to produce the optimized version. I'm biased but I think this should be left to the JITs :) On the other hand, I think it would be pretty interesting for the core python community to come up with source-level constructs that could help with this sort of thing. For example, one idea is to make it canonical to do something like: def f(): from __builtin__ import len return len("abc") and then let implementations pattern-match this to statically resolve the import. This doesn't quite work since you can mutate __builtin__ (or override __import__); maybe a new module name could be used, or maybe there could be some sort of "bind_globals()" decorator. I think these are interesting since they could improve name-resolution in a targeted way, and also give more information to the JITs. kmod On Wed, Oct 21, 2015 at 8:43 AM, Victor Stinner <victor.stinner@gmail.com> wrote:

Hi, 2015-10-21 22:55 GMT+02:00 Kevin Modzelewski <kmod@dropbox.com>:
Hum, I guess that almost no code calls len() on a constant string :-) In my experience astoptimizer, I noticed that it becomes more common when you combine it with constant folding optimization. Many strings are "constants" in Python and commonly tested. Example: "if sys.platform.startswith('freebsd'): ...". If sys.platform is "linux", it's dead code and the whole if can be removed.
I took the len("abc") example just because it's very easy to understand :-) It's just to explain the principle. Later, if you combine multiple optimizations, I bet that it will become really interesting on real applications. But first we need a framework to make these optimizations legit in Python (don't change Python semantic). The goal is to optimize classes, not only builtin functions.
We will need heuristic to estimate the theoric performance speedup with specialized bytecode, to then decide if it's worth to keep it or not. In the long term, we can design a profiler for profile guided optimizations (PGO). In the short term, we can use hints as type hints or manual hints, to decide which functions should be optimized or not.
As written in other messages, they are already many ways to reduce the "overhead" of the Python language. But I'm trying to write an generic optimizer which would not require to modify the source code. Maybe we can modify Python to make some of these optimizations easy to use, but I'm not sure that it's worth it. Victor
participants (11)
-
Andrew Barnert
-
Antoine Pitrou
-
Brett Cannon
-
Chris Angelico
-
Guido van Rossum
-
Kevin Modzelewski
-
Nick Coghlan
-
Paul Moore
-
Random832
-
Scott Sanderson
-
Victor Stinner