[Python-Dev] Opcode cache in ceval loop

Mon Feb 1 14:10:40 EST 2016

Hi,

This is the second email thread I start regarding implementing an opcode 
cache in ceval loop.  Since my first post on this topic:

- I've implemented another optimization (LOAD_ATTR);

- I've added detailed statistics mode so that I can "see" how the cache 
performs and tune it;

- some macro benchmarks are now 10-20% faster; 2to3 (a real application) 
is 7-8% faster;

- and I have some good insights on the memory footprint.

** The purpose of this email is to get a general approval from 
python-dev, so that I can start polishing the patches and getting them 
reviewed/committed. **

Summary of optimizations
------------------------

When a code object is executed more than ~1000 times, it's considered 
"hot".  It gets its opcodes analyzed to initialize caches for 
LOAD_METHOD (a new opcode I propose to add in [1]), LOAD_ATTR, and 
LOAD_GLOBAL.

It's important to only optimize code objects that were executed "enough" 
times, to avoid optimizing code objects for modules, classes, and 
functions that were imported but never used.

The cache struct is defined in code.h [2], and is 32 bytes long. When a 
code object becomes hot, it gets an cache offset table allocated for it 
(+1 byte for each opcode) + an array of cache structs.

To measure the max/average memory impact, I tuned my code to optimize 
*every* code object on *first* run.  Then I ran the entire Python test 
suite.  Python test suite + standard library both contain around 72395 
code objects, which required 20Mb of memory for caches.  The test 
process consumed around 400Mb of memory.  Thus, the absolute worst case 
scenario, the overhead is about 5%.

Then I ran the test suite without any modifications to the patch. This 
means that only code objects that are called frequently enough are 
optimized.  In this more, only 2072 code objects were optimized, using 
less than 1Mb of memory for the cache.

LOAD_ATTR
---------

Damien George mentioned that they optimize a lot of dict lookups in 
MicroPython by memorizing last key/value offset in the dict object, thus 
eliminating lots of hash lookups.  I've implemented this optimization in 
my patch.  The results are quite good.  A simple micro-benchmark [3] 
shows ~30% speed improvement.  Here are some debug stats generated by 
2to3 benchmark:

-- Opcode cache LOAD_ATTR hits     = 14778415 (83%)
-- Opcode cache LOAD_ATTR misses   = 750 (0%)
-- Opcode cache LOAD_ATTR opts     = 282
-- Opcode cache LOAD_ATTR deopts   = 60
-- Opcode cache LOAD_ATTR total    = 17777912

Each "hit" makes LOAD_ATTR about 30% faster.

LOAD_GLOBAL
-----------

This turned out to be a very stable optimization.  Here is the debug 
output of the 2to3 test:

-- Opcode cache LOAD_GLOBAL hits   = 3940647 (100%)
-- Opcode cache LOAD_GLOBAL misses = 0 (0%)
-- Opcode cache LOAD_GLOBAL opts   = 252

All benchmarks (and real code) have stats like that.  Globals and 
builtins are very rarely modified, so the cache works really well.  With 
LOAD_GLOBAL opcode cache, global lookup is very cheap, there is no hash 
lookup for it at all.  It makes optimizations like "def foo(len=len)" 
obsolete.

LOAD_METHOD
-----------

This is a new opcode I propose to add in [1].  The idea is to substitute 
LOAD_ATTR with it, and avoid instantiation of BoundMethod objects.

With the cache, we can store a reference to the method descriptor (I use 
type->tp_version_tag for cache invalidation, the same thing 
_PyType_Lookup is built around).

The cache makes LOAD_METHOD really efficient.  A simple micro-benchmark 
like [4], shows that with the cache and LOAD_METHOD, 
"s.startswith('abc')" becomes as efficient as "s[:3] == 'abc'".

LOAD_METHOD/CALL_FUNCTION without cache is about 20% faster than 
LOAD_ATTR/CALL_FUNCTION.  With the cache, it's about 30% faster.

Here's the debug output of the 2to3 benchmark:

-- Opcode cache LOAD_METHOD hits   = 5164848 (64%)
-- Opcode cache LOAD_METHOD misses = 12 (0%)
-- Opcode cache LOAD_METHOD opts   = 94
-- Opcode cache LOAD_METHOD deopts = 12
-- Opcode cache LOAD_METHOD dct-chk= 1614801
-- Opcode cache LOAD_METHOD total  = 7945954

What's next?
------------

First, I'd like to merge the new LOAD_METHOD opcode, see issue 26110 
[1].  It's a very straightforward optimization, the patch is small and 
easy to review.

Second, I'd like to merge the new opcode cache, see issue 26219 [5].  
All unittests pass.  Memory usage increase is very moderate (<1mb for 
the entire test suite), and the performance increase is significant.  
The only potential blocker for this is PEP 509 approval (which I'd be 
happy to assist with).

What do you think?

Thanks,
Yury

[1] http://bugs.python.org/issue26110
[2] https://github.com/1st1/cpython/blob/opcache5/Include/code.h#L10
[3] https://gist.github.com/1st1/37d928f1e84813bf1c44
[4] https://gist.github.com/1st1/10588e6e11c4d7c19445
[5] http://bugs.python.org/issue26219