Hi,
Skip Montanaro asked me about my old CPython fork called "registervm"
which replaced stack-based bytecode with register-based bytecode:
https://hg.python.org/sandbox/registervm/
Here are some notes about my various CPython optimization experiments
(starting with registervm) between 2012 and 2019. In short, all failed
for different reasons :-) So I finished this email by sharing my advices
for 2019 to optimize Python ;-)
By the way, I wrote a whole website to share all my notes about making
CPython more efficient :-)
https://faster-cpython.readthedocs.io/
== Faster ==
I created registervm fork in 2012.
My registervm fork was faster for 2 main reasons:
* Register-based bytecode usually use *less* instructions and CPython
eval loop performance depends on the number of instructions: less
complex instructions ("CISC") are more efficient than multiple simple
instructions ("RISC"). See also the old WPython project which got the
same conclusion.
* Implement optimizations which change the Python semantics, like moving
invariants out of the loop body. When discussed on python-dev, it was a
no-go from the start to change the Python semantics. Like loading a
method out of a loop to only load it once, only keep the method call
inside the loop.
You can find some info about registervm at:
* https://hg.python.org/sandbox/registervm/file/tip/REGISTERVM.txt
* https://hg.python.org/sandbox/registervm/file/tip/BENCHMARKS.txt
== registervm implementation ==
I chose to not touch the compiler to still emit stack-based bytecode,
and then my registervm module converts the bytecode to register-based
bytecode:
https://hg.python.org/sandbox/registervm/file/tip/Lib/registervm.py
So it was trivial to compare speed of stack-based bytecode to the speed
of register-based bytecode.
Constants, registers and local variables all live in the same array in a
frame. That's convenient for the bytecode which can access these 3 kinds
of values. I chose to explicitly copy a constant into a register, even
if technically it could be completely avoided by teaching registervm
that constants are *read-only* registers.
The conversion is done in multple steps.
(*) It starts by adding registers without touching the stack (still
push/pop). For example, "LOAD_FAST x" is replaced with "LOAD_CONST_REG
reg, x; PUSH_REG reg; CLEAR_REG reg".
This step uses SSA (Static single assignment form): a register is
assigned exactly one.
(*) It splits the code into basic blocks, replace jumps with labels:
https://en.wikipedia.org/wiki/Basic_block
(*) Instructions are moved. For example, "PUSH_REG; LOAD_CONST_REG" are
exchanged: "LOAD_CONST_REG; PUSH_REG"
(*) Duplicated loads are merged. This step computes the lifetime of
constants, local and global variables. For example, "LOAD_CONST_REG
reg1, var; LOAD_CONST_REG reg2, var" becomes "LOAD_CONST_REG reg1, var;
MOVE_REG reg2, reg1".
(*) Register allocator 1: replace SSA with reused registers. For example
"BINARY_ADD_REG reg3, reg2, reg1; MOVE_REG var, reg3; CLEAR_REG reg3"
becomes "BINARY_ADD_REG var, reg2, reg1": remove reg3.
(*) Register allocator 2. For example, remove temporary registers:
"MOVE_REG reg, var; BINARY_ADD_REG result, var, reg" becomes
"BINARY_ADD_REG result, var, var": remove temporary reg register.
(*) Peephole optimizer. For example, remove "MOV_REG reg, reg" (do
nothing), or remove duplicated "CLEAR_REG reg" (only keep the last one).
(*) Remove unreachable basic blocks
(*) Assemble basic blocks and replace labels with jumps at concrete offset
I modified ceval.c to add new "xxx_REG" instructions.
== Full example ==
Simple function computing the factorial of n::
def fact_iter(n):
f = 1
for i in range(2, n+1):
f *= i
return f
Stack-based bytecode (20 instructions)::
0 LOAD_CONST 1 (const#1)
3 STORE_FAST 'f'
6 SETUP_LOOP <relative jump to 46 (+37)>
9 LOAD_GLOBAL 0 (range)
12 LOAD_CONST 2 (const#2)
15 LOAD_FAST 'n'
18 LOAD_CONST 1 (const#1)
21 BINARY_ADD
22 CALL_FUNCTION 2 (2 positional, 0 keyword pair)
25 GET_ITER
>> 26 FOR_ITER <relative jump to 45 (+16)>
29 STORE_FAST 'i'
32 LOAD_FAST 'f'
35 LOAD_FAST 'i'
38 INPLACE_MULTIPLY
39 STORE_FAST 'f'
42 JUMP_ABSOLUTE <jump to 26>
>> 45 POP_BLOCK
>> 46 LOAD_FAST 'f'
49 RETURN_VALUE
Register-based bytecode (13 instructions)::
0 LOAD_CONST_REG 'f', 1 (const#1)
5 LOAD_CONST_REG R0, 2 (const#2)
10 LOAD_GLOBAL_REG R1, 'range' (name#0)
15 SETUP_LOOP <relative jump to 57 (+39)>
18 BINARY_ADD_REG R2, 'n', 'f'
25 CALL_FUNCTION_REG 4, R1, R1, R0, R2
36 GET_ITER_REG R1, R1
>> 41 FOR_ITER_REG 'i', R1, <relative jump to 56 (+8)>
48 INPLACE_MULTIPLY_REG 'f', 'i'
53 JUMP_ABSOLUTE <jump to 41>
>> 56 POP_BLOCK
>> 57 RETURN_VALUE_REG 'f'
The body of the main loop of this function is composed of 1 instruction
(INPLACE_MULTIPLY_REG) instead of 5.
== Abandon registervm ==
When I started the project, I never cleared registers: objects were kept
alive at the function exit, when the frame is cleared. This changes the
object lifetime and so the Python semantics. Again, when discussed on
python-dev, it was a no-go.
I introduced CLEAR_REG instruction to "clear a register" (set its value
to NULL). But I did it late in the development, when the code base was
already quite complex.
I had many bugs in registervm operations which failed to understand
conditional jumps (like loops). My code to track lifetime of registers /
variables was buggy. I bought a book about compilers which clearly
explains how to implement that correctly, but I lost interest and I
never fixed my code.
I decided to give up on this project because when I fixed registervm to
no longer change the Python semantics, the speedup was less impressive,
whereas the code became tricky to modify/maintain.
== read-only Python ==
I wanted to experiment optimizations which rely on assumptions like
"this is not going to change".
In 2014, I created my "read-only" CPython fork which tries to convert
module namespaces to read-only to implement various optimizations.
https://faster-cpython.readthedocs.io/readonly.html
Quickly, I understood that everything always change in Python,
especially if you want to support monkey-patching.
== FAT Python ==
That's how I came up to my FAT Python project in 2016 which implements
guards at runtime. If "something changes" (an assumption becomes false),
the function is de-optimized.
https://faster-cpython.readthedocs.io/fat_python.html
It relies on the fatoptimizer which is a compiler ahead of time. This
compiler emits specialized code with guards.
The "fat" project implements some guards like "check if a global was not
modified" or "check if the function code was modified". One core feature
of fat guard was a new "timestamp" attribute added to the Python dict
type to implement efficient check if a dictionary was modified, without
the need to actually do a lookup.
FAT Python was also quite complex to implement. I had to fix multiple
bugs in Python to support negative offset in code co_lnotab (now fixed),
fix various bugs in the compiler and AST modules.
At the end, I failed to prove that my design allowed to make Python faster.
I abandoned the project just after David Malcolm implemented an
experimental support for function inlining.
I stole the function specialization idea from David ;-) He proposed
something similar, but without guards, in an old issue.
Note: I also created a first "astoptimizer" project before FAT Python
which implemented optimizations without guards. FAT Python reuses the
same idea of modifying AST tree and implement similar optimizations, but
insert guards.
I wrote 3 PEPs for FAT Python:
"PEP 509 -- Add a private version to dict" -- implemented in Python 3.6
https://www.python.org/dev/peps/pep-0509
"PEP 510 -- Specialize functions with guards"
-- rejected
https://www.python.org/dev/peps/pep-0510/
=> I failed to prove that the FAT Python design (function
specialization, guards) allow to make Python significantly faster.
"PEP 511 -- API for code transformers" -- rejected
https://www.python.org/dev/peps/pep-0511/
=> importlib hooks or other solutions can be used instead. The key
feature (rejected) was to allow to have different a different pyc
filename depending on a command line option "-o OPTIMIZER" (ex:
os.cpython-36.fat-0.pyc instead of os.cpython-36.opt-0.pyc).
I gave a talk "FAT Python: a new static optimizer for Python 3.6"
at EuroPython (Bilbao, Spain) in July 2016:
*
https://github.com/vstinner/talks/raw/master/2016-EuroPython-Bilbao/fat_pyt…
* https://www.youtube.com/watch?v=zFl9RAfbSXE
== Fix the benchmark suite ==
In 2017, I wrote pyperf to run get reproducible benchmark results. Then
I modified the benchmark suite to use it and renamed it "pyperformance".
* https://pyperf.readthedocs.io/
* https://pyperformance.readthedocs.io/
Articles about my journey:
https://vstinner.readthedocs.io/benchmark.html#my-articles
Talk at FOSDEM (Belgium) in February 2017: "How to run a stable
benchmark", slides + video:
*
https://github.com/vstinner/talks/raw/master/2017-FOSDEM-Brussels/howto_run…
* https://archive.fosdem.org/2017/schedule/event/python_stable_benchmark/
== My advices for 2019 ==
Micro-optimizing CPython ceval is not enough to make Python competitive
with other programming language like Go or Rust. We need a different
approach.
I gave a keynote “Python Performance: Past, Present and Future” keynote
at EuroPython (Basel, Switzerland) last month. I proposed 3 solutions:
* new PyHandle C API: see https://github.com/pyhandle/hpy
* Tracing garbage collector for CPython
* CPython subinterpreters
Slides + video:
*
https://github.com/vstinner/talks/blob/master/2019-EuroPython/python_perfor…
* https://www.youtube.com/watch?v=T6vC_LOHBJ4&feature=youtu.be&t=1875
Victor