[Python-Dev] Opcode cache in ceval loop
yselivanov.ml at gmail.com
Mon Feb 1 14:10:40 EST 2016
This is the second email thread I start regarding implementing an opcode
cache in ceval loop. Since my first post on this topic:
- I've implemented another optimization (LOAD_ATTR);
- I've added detailed statistics mode so that I can "see" how the cache
performs and tune it;
- some macro benchmarks are now 10-20% faster; 2to3 (a real application)
is 7-8% faster;
- and I have some good insights on the memory footprint.
** The purpose of this email is to get a general approval from
python-dev, so that I can start polishing the patches and getting them
Summary of optimizations
When a code object is executed more than ~1000 times, it's considered
"hot". It gets its opcodes analyzed to initialize caches for
LOAD_METHOD (a new opcode I propose to add in ), LOAD_ATTR, and
It's important to only optimize code objects that were executed "enough"
times, to avoid optimizing code objects for modules, classes, and
functions that were imported but never used.
The cache struct is defined in code.h , and is 32 bytes long. When a
code object becomes hot, it gets an cache offset table allocated for it
(+1 byte for each opcode) + an array of cache structs.
To measure the max/average memory impact, I tuned my code to optimize
*every* code object on *first* run. Then I ran the entire Python test
suite. Python test suite + standard library both contain around 72395
code objects, which required 20Mb of memory for caches. The test
process consumed around 400Mb of memory. Thus, the absolute worst case
scenario, the overhead is about 5%.
Then I ran the test suite without any modifications to the patch. This
means that only code objects that are called frequently enough are
optimized. In this more, only 2072 code objects were optimized, using
less than 1Mb of memory for the cache.
Damien George mentioned that they optimize a lot of dict lookups in
MicroPython by memorizing last key/value offset in the dict object, thus
eliminating lots of hash lookups. I've implemented this optimization in
my patch. The results are quite good. A simple micro-benchmark 
shows ~30% speed improvement. Here are some debug stats generated by
-- Opcode cache LOAD_ATTR hits = 14778415 (83%)
-- Opcode cache LOAD_ATTR misses = 750 (0%)
-- Opcode cache LOAD_ATTR opts = 282
-- Opcode cache LOAD_ATTR deopts = 60
-- Opcode cache LOAD_ATTR total = 17777912
Each "hit" makes LOAD_ATTR about 30% faster.
This turned out to be a very stable optimization. Here is the debug
output of the 2to3 test:
-- Opcode cache LOAD_GLOBAL hits = 3940647 (100%)
-- Opcode cache LOAD_GLOBAL misses = 0 (0%)
-- Opcode cache LOAD_GLOBAL opts = 252
All benchmarks (and real code) have stats like that. Globals and
builtins are very rarely modified, so the cache works really well. With
LOAD_GLOBAL opcode cache, global lookup is very cheap, there is no hash
lookup for it at all. It makes optimizations like "def foo(len=len)"
This is a new opcode I propose to add in . The idea is to substitute
LOAD_ATTR with it, and avoid instantiation of BoundMethod objects.
With the cache, we can store a reference to the method descriptor (I use
type->tp_version_tag for cache invalidation, the same thing
_PyType_Lookup is built around).
The cache makes LOAD_METHOD really efficient. A simple micro-benchmark
like , shows that with the cache and LOAD_METHOD,
"s.startswith('abc')" becomes as efficient as "s[:3] == 'abc'".
LOAD_METHOD/CALL_FUNCTION without cache is about 20% faster than
LOAD_ATTR/CALL_FUNCTION. With the cache, it's about 30% faster.
Here's the debug output of the 2to3 benchmark:
-- Opcode cache LOAD_METHOD hits = 5164848 (64%)
-- Opcode cache LOAD_METHOD misses = 12 (0%)
-- Opcode cache LOAD_METHOD opts = 94
-- Opcode cache LOAD_METHOD deopts = 12
-- Opcode cache LOAD_METHOD dct-chk= 1614801
-- Opcode cache LOAD_METHOD total = 7945954
First, I'd like to merge the new LOAD_METHOD opcode, see issue 26110
. It's a very straightforward optimization, the patch is small and
easy to review.
Second, I'd like to merge the new opcode cache, see issue 26219 .
All unittests pass. Memory usage increase is very moderate (<1mb for
the entire test suite), and the performance increase is significant.
The only potential blocker for this is PEP 509 approval (which I'd be
happy to assist with).
What do you think?
More information about the Python-Dev