Registervm

Download

registervm@python.org

1 discussions

My registervm fork (2012)
by Victor Stinner 13 Aug '19

13 Aug '19

Hi, Skip Montanaro asked me about my old CPython fork called "registervm" which replaced stack-based bytecode with register-based bytecode: https://hg.python.org/sandbox/registervm/ Here are some notes about my various CPython optimization experiments (starting with registervm) between 2012 and 2019. In short, all failed for different reasons :-) So I finished this email by sharing my advices for 2019 to optimize Python ;-) By the way, I wrote a whole website to share all my notes about making CPython more efficient :-) https://faster-cpython.readthedocs.io/ == Faster == I created registervm fork in 2012. My registervm fork was faster for 2 main reasons: * Register-based bytecode usually use *less* instructions and CPython eval loop performance depends on the number of instructions: less complex instructions ("CISC") are more efficient than multiple simple instructions ("RISC"). See also the old WPython project which got the same conclusion. * Implement optimizations which change the Python semantics, like moving invariants out of the loop body. When discussed on python-dev, it was a no-go from the start to change the Python semantics. Like loading a method out of a loop to only load it once, only keep the method call inside the loop. You can find some info about registervm at: * https://hg.python.org/sandbox/registervm/file/tip/REGISTERVM.txt * https://hg.python.org/sandbox/registervm/file/tip/BENCHMARKS.txt == registervm implementation == I chose to not touch the compiler to still emit stack-based bytecode, and then my registervm module converts the bytecode to register-based bytecode: https://hg.python.org/sandbox/registervm/file/tip/Lib/registervm.py So it was trivial to compare speed of stack-based bytecode to the speed of register-based bytecode. Constants, registers and local variables all live in the same array in a frame. That's convenient for the bytecode which can access these 3 kinds of values. I chose to explicitly copy a constant into a register, even if technically it could be completely avoided by teaching registervm that constants are *read-only* registers. The conversion is done in multple steps. (*) It starts by adding registers without touching the stack (still push/pop). For example, "LOAD_FAST x" is replaced with "LOAD_CONST_REG reg, x; PUSH_REG reg; CLEAR_REG reg". This step uses SSA (Static single assignment form): a register is assigned exactly one. (*) It splits the code into basic blocks, replace jumps with labels: https://en.wikipedia.org/wiki/Basic_block (*) Instructions are moved. For example, "PUSH_REG; LOAD_CONST_REG" are exchanged: "LOAD_CONST_REG; PUSH_REG" (*) Duplicated loads are merged. This step computes the lifetime of constants, local and global variables. For example, "LOAD_CONST_REG reg1, var; LOAD_CONST_REG reg2, var" becomes "LOAD_CONST_REG reg1, var; MOVE_REG reg2, reg1". (*) Register allocator 1: replace SSA with reused registers. For example "BINARY_ADD_REG reg3, reg2, reg1; MOVE_REG var, reg3; CLEAR_REG reg3" becomes "BINARY_ADD_REG var, reg2, reg1": remove reg3. (*) Register allocator 2. For example, remove temporary registers: "MOVE_REG reg, var; BINARY_ADD_REG result, var, reg" becomes "BINARY_ADD_REG result, var, var": remove temporary reg register. (*) Peephole optimizer. For example, remove "MOV_REG reg, reg" (do nothing), or remove duplicated "CLEAR_REG reg" (only keep the last one). (*) Remove unreachable basic blocks (*) Assemble basic blocks and replace labels with jumps at concrete offset I modified ceval.c to add new "xxx_REG" instructions. == Full example == Simple function computing the factorial of n:: def fact_iter(n): f = 1 for i in range(2, n+1): f *= i return f Stack-based bytecode (20 instructions):: 0 LOAD_CONST 1 (const#1) 3 STORE_FAST 'f' 6 SETUP_LOOP <relative jump to 46 (+37)> 9 LOAD_GLOBAL 0 (range) 12 LOAD_CONST 2 (const#2) 15 LOAD_FAST 'n' 18 LOAD_CONST 1 (const#1) 21 BINARY_ADD 22 CALL_FUNCTION 2 (2 positional, 0 keyword pair) 25 GET_ITER >> 26 FOR_ITER <relative jump to 45 (+16)> 29 STORE_FAST 'i' 32 LOAD_FAST 'f' 35 LOAD_FAST 'i' 38 INPLACE_MULTIPLY 39 STORE_FAST 'f' 42 JUMP_ABSOLUTE <jump to 26> >> 45 POP_BLOCK >> 46 LOAD_FAST 'f' 49 RETURN_VALUE Register-based bytecode (13 instructions):: 0 LOAD_CONST_REG 'f', 1 (const#1) 5 LOAD_CONST_REG R0, 2 (const#2) 10 LOAD_GLOBAL_REG R1, 'range' (name#0) 15 SETUP_LOOP <relative jump to 57 (+39)> 18 BINARY_ADD_REG R2, 'n', 'f' 25 CALL_FUNCTION_REG 4, R1, R1, R0, R2 36 GET_ITER_REG R1, R1 >> 41 FOR_ITER_REG 'i', R1, <relative jump to 56 (+8)> 48 INPLACE_MULTIPLY_REG 'f', 'i' 53 JUMP_ABSOLUTE <jump to 41> >> 56 POP_BLOCK >> 57 RETURN_VALUE_REG 'f' The body of the main loop of this function is composed of 1 instruction (INPLACE_MULTIPLY_REG) instead of 5. == Abandon registervm == When I started the project, I never cleared registers: objects were kept alive at the function exit, when the frame is cleared. This changes the object lifetime and so the Python semantics. Again, when discussed on python-dev, it was a no-go. I introduced CLEAR_REG instruction to "clear a register" (set its value to NULL). But I did it late in the development, when the code base was already quite complex. I had many bugs in registervm operations which failed to understand conditional jumps (like loops). My code to track lifetime of registers / variables was buggy. I bought a book about compilers which clearly explains how to implement that correctly, but I lost interest and I never fixed my code. I decided to give up on this project because when I fixed registervm to no longer change the Python semantics, the speedup was less impressive, whereas the code became tricky to modify/maintain. == read-only Python == I wanted to experiment optimizations which rely on assumptions like "this is not going to change". In 2014, I created my "read-only" CPython fork which tries to convert module namespaces to read-only to implement various optimizations. https://faster-cpython.readthedocs.io/readonly.html Quickly, I understood that everything always change in Python, especially if you want to support monkey-patching. == FAT Python == That's how I came up to my FAT Python project in 2016 which implements guards at runtime. If "something changes" (an assumption becomes false), the function is de-optimized. https://faster-cpython.readthedocs.io/fat_python.html It relies on the fatoptimizer which is a compiler ahead of time. This compiler emits specialized code with guards. The "fat" project implements some guards like "check if a global was not modified" or "check if the function code was modified". One core feature of fat guard was a new "timestamp" attribute added to the Python dict type to implement efficient check if a dictionary was modified, without the need to actually do a lookup. FAT Python was also quite complex to implement. I had to fix multiple bugs in Python to support negative offset in code co_lnotab (now fixed), fix various bugs in the compiler and AST modules. At the end, I failed to prove that my design allowed to make Python faster. I abandoned the project just after David Malcolm implemented an experimental support for function inlining. I stole the function specialization idea from David ;-) He proposed something similar, but without guards, in an old issue. Note: I also created a first "astoptimizer" project before FAT Python which implemented optimizations without guards. FAT Python reuses the same idea of modifying AST tree and implement similar optimizations, but insert guards. I wrote 3 PEPs for FAT Python: "PEP 509 -- Add a private version to dict" -- implemented in Python 3.6 https://www.python.org/dev/peps/pep-0509 "PEP 510 -- Specialize functions with guards" -- rejected https://www.python.org/dev/peps/pep-0510/ => I failed to prove that the FAT Python design (function specialization, guards) allow to make Python significantly faster. "PEP 511 -- API for code transformers" -- rejected https://www.python.org/dev/peps/pep-0511/ => importlib hooks or other solutions can be used instead. The key feature (rejected) was to allow to have different a different pyc filename depending on a command line option "-o OPTIMIZER" (ex: os.cpython-36.fat-0.pyc instead of os.cpython-36.opt-0.pyc). I gave a talk "FAT Python: a new static optimizer for Python 3.6" at EuroPython (Bilbao, Spain) in July 2016: * https://github.com/vstinner/talks/raw/master/2016-EuroPython-Bilbao/fat_pyt… * https://www.youtube.com/watch?v=zFl9RAfbSXE == Fix the benchmark suite == In 2017, I wrote pyperf to run get reproducible benchmark results. Then I modified the benchmark suite to use it and renamed it "pyperformance". * https://pyperf.readthedocs.io/ * https://pyperformance.readthedocs.io/ Articles about my journey: https://vstinner.readthedocs.io/benchmark.html#my-articles Talk at FOSDEM (Belgium) in February 2017: "How to run a stable benchmark", slides + video: * https://github.com/vstinner/talks/raw/master/2017-FOSDEM-Brussels/howto_run… * https://archive.fosdem.org/2017/schedule/event/python_stable_benchmark/ == My advices for 2019 == Micro-optimizing CPython ceval is not enough to make Python competitive with other programming language like Go or Rust. We need a different approach. I gave a keynote “Python Performance: Past, Present and Future” keynote at EuroPython (Basel, Switzerland) last month. I proposed 3 solutions: * new PyHandle C API: see https://github.com/pyhandle/hpy * Tracing garbage collector for CPython * CPython subinterpreters Slides + video: * https://github.com/vstinner/talks/blob/master/2019-EuroPython/python_perfor… * https://www.youtube.com/watch?v=T6vC_LOHBJ4&feature=youtu.be&t=1875 Victor

1 0