<div dir="ltr"><br><br><div class="gmail_quote"><div dir="ltr">On Mon, 1 Feb 2016 at 11:11 Yury Selivanov <<a href="mailto:yselivanov.ml@gmail.com">yselivanov.ml@gmail.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">Hi,<br>

<br>

This is the second email thread I start regarding implementing an opcode<br>

cache in ceval loop.  Since my first post on this topic:<br>

<br>

- I've implemented another optimization (LOAD_ATTR);<br>

<br>

- I've added detailed statistics mode so that I can "see" how the cache<br>

performs and tune it;<br>

<br>

- some macro benchmarks are now 10-20% faster; 2to3 (a real application)<br>

is 7-8% faster;<br>

<br>

- and I have some good insights on the memory footprint.<br>

<br>

** The purpose of this email is to get a general approval from<br>

python-dev, so that I can start polishing the patches and getting them<br>

reviewed/committed. **<br>

<br>

<br>

Summary of optimizations<br>

------------------------<br>

<br>

When a code object is executed more than ~1000 times, it's considered<br>

"hot".  It gets its opcodes analyzed to initialize caches for<br>

LOAD_METHOD (a new opcode I propose to add in [1]), LOAD_ATTR, and<br>

LOAD_GLOBAL.<br>

<br>

It's important to only optimize code objects that were executed "enough"<br>

times, to avoid optimizing code objects for modules, classes, and<br>

functions that were imported but never used.<br>

<br>

The cache struct is defined in code.h [2], and is 32 bytes long. When a<br>

code object becomes hot, it gets an cache offset table allocated for it<br>

(+1 byte for each opcode) + an array of cache structs.<br>

<br>

To measure the max/average memory impact, I tuned my code to optimize<br>

*every* code object on *first* run.  Then I ran the entire Python test<br>

suite.  Python test suite + standard library both contain around 72395<br>

code objects, which required 20Mb of memory for caches.  The test<br>

process consumed around 400Mb of memory.  Thus, the absolute worst case<br>

scenario, the overhead is about 5%.<br>

<br>

Then I ran the test suite without any modifications to the patch. This<br>

means that only code objects that are called frequently enough are<br>

optimized.  In this more, only 2072 code objects were optimized, using<br>

less than 1Mb of memory for the cache.<br>

<br>

<br>

LOAD_ATTR<br>

---------<br>

<br>

Damien George mentioned that they optimize a lot of dict lookups in<br>

MicroPython by memorizing last key/value offset in the dict object, thus<br>

eliminating lots of hash lookups.  I've implemented this optimization in<br>

my patch.  The results are quite good.  A simple micro-benchmark [3]<br>

shows ~30% speed improvement.  Here are some debug stats generated by<br>

2to3 benchmark:<br>

<br>

-- Opcode cache LOAD_ATTR hits     = 14778415 (83%)<br>

-- Opcode cache LOAD_ATTR misses   = 750 (0%)<br>

-- Opcode cache LOAD_ATTR opts     = 282<br>

-- Opcode cache LOAD_ATTR deopts   = 60<br>

-- Opcode cache LOAD_ATTR total    = 17777912<br>

<br>

Each "hit" makes LOAD_ATTR about 30% faster.<br>

<br>

<br>

LOAD_GLOBAL<br>

-----------<br>

<br>

This turned out to be a very stable optimization.  Here is the debug<br>

output of the 2to3 test:<br>

<br>

-- Opcode cache LOAD_GLOBAL hits   = 3940647 (100%)<br>

-- Opcode cache LOAD_GLOBAL misses = 0 (0%)<br>

-- Opcode cache LOAD_GLOBAL opts   = 252<br>

<br>

All benchmarks (and real code) have stats like that.  Globals and<br>

builtins are very rarely modified, so the cache works really well.  With<br>

LOAD_GLOBAL opcode cache, global lookup is very cheap, there is no hash<br>

lookup for it at all.  It makes optimizations like "def foo(len=len)"<br>

obsolete.<br>

<br>

<br>

LOAD_METHOD<br>

-----------<br>

<br>

This is a new opcode I propose to add in [1].  The idea is to substitute<br>

LOAD_ATTR with it, and avoid instantiation of BoundMethod objects.<br>

<br>

With the cache, we can store a reference to the method descriptor (I use<br>

type->tp_version_tag for cache invalidation, the same thing<br>

_PyType_Lookup is built around).<br>

<br>

The cache makes LOAD_METHOD really efficient.  A simple micro-benchmark<br>

like [4], shows that with the cache and LOAD_METHOD,<br>

"s.startswith('abc')" becomes as efficient as "s[:3] == 'abc'".<br>

<br>

LOAD_METHOD/CALL_FUNCTION without cache is about 20% faster than<br>

LOAD_ATTR/CALL_FUNCTION.  With the cache, it's about 30% faster.<br>

<br>

Here's the debug output of the 2to3 benchmark:<br>

<br>

-- Opcode cache LOAD_METHOD hits   = 5164848 (64%)<br>

-- Opcode cache LOAD_METHOD misses = 12 (0%)<br>

-- Opcode cache LOAD_METHOD opts   = 94<br>

-- Opcode cache LOAD_METHOD deopts = 12<br>

-- Opcode cache LOAD_METHOD dct-chk= 1614801<br>

-- Opcode cache LOAD_METHOD total  = 7945954<br>

<br>

<br>

What's next?<br>

------------<br>

<br>

First, I'd like to merge the new LOAD_METHOD opcode, see issue 26110<br>

[1].  It's a very straightforward optimization, the patch is small and<br>

easy to review.</blockquote><div><br></div><div>+1 from me.<br></div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

<br>

Second, I'd like to merge the new opcode cache, see issue 26219 [5].<br>

All unittests pass.  Memory usage increase is very moderate (<1mb for<br>

the entire test suite), and the performance increase is significant.<br>

The only potential blocker for this is PEP 509 approval (which I'd be<br>

happy to assist with).<br></blockquote><div><br></div><div>I think the fact that it improves performance across the board as well as eliminates the various tricks people use to cache global and built-ins, a big +1 from me. I guess that means Victor needs to ask for pronouncement on PEP 509.</div><div> </div><div><br></div><div>BTW, where does LOAD_ATTR fit into all of this?</div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

<br>

What do you think?<br></blockquote><div><br></div><div>It all looks great to me!</div><div><br></div><div>-Brett</div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

<br>

Thanks,<br>

Yury<br>

<br>

<br>

[1] <a href="http://bugs.python.org/issue26110" rel="noreferrer" target="_blank">http://bugs.python.org/issue26110</a><br>

[2] <a href="https://github.com/1st1/cpython/blob/opcache5/Include/code.h#L10" rel="noreferrer" target="_blank">https://github.com/1st1/cpython/blob/opcache5/Include/code.h#L10</a><br>

[3] <a href="https://gist.github.com/1st1/37d928f1e84813bf1c44" rel="noreferrer" target="_blank">https://gist.github.com/1st1/37d928f1e84813bf1c44</a><br>

[4] <a href="https://gist.github.com/1st1/10588e6e11c4d7c19445" rel="noreferrer" target="_blank">https://gist.github.com/1st1/10588e6e11c4d7c19445</a><br>

[5] <a href="http://bugs.python.org/issue26219" rel="noreferrer" target="_blank">http://bugs.python.org/issue26219</a><br>

<br>

_______________________________________________<br>

Python-Dev mailing list<br>

<a href="mailto:Python-Dev@python.org" target="_blank">Python-Dev@python.org</a><br>

<a href="https://mail.python.org/mailman/listinfo/python-dev" rel="noreferrer" target="_blank">https://mail.python.org/mailman/listinfo/python-dev</a><br>

Unsubscribe: <a href="https://mail.python.org/mailman/options/python-dev/brett%40python.org" rel="noreferrer" target="_blank">https://mail.python.org/mailman/options/python-dev/brett%40python.org</a><br>

</blockquote></div></div>