[Python-Dev] Opcode cache in ceval loop

Mon Feb 1 14:30:44 EST 2016

On Mon, 1 Feb 2016 at 11:11 Yury Selivanov <yselivanov.ml at gmail.com> wrote:

> Hi,
>
> This is the second email thread I start regarding implementing an opcode
> cache in ceval loop.  Since my first post on this topic:
>
> - I've implemented another optimization (LOAD_ATTR);
>
> - I've added detailed statistics mode so that I can "see" how the cache
> performs and tune it;
>
> - some macro benchmarks are now 10-20% faster; 2to3 (a real application)
> is 7-8% faster;
>
> - and I have some good insights on the memory footprint.
>
> ** The purpose of this email is to get a general approval from
> python-dev, so that I can start polishing the patches and getting them
> reviewed/committed. **
>
>
> Summary of optimizations
> ------------------------
>
> When a code object is executed more than ~1000 times, it's considered
> "hot".  It gets its opcodes analyzed to initialize caches for
> LOAD_METHOD (a new opcode I propose to add in [1]), LOAD_ATTR, and
> LOAD_GLOBAL.
>
> It's important to only optimize code objects that were executed "enough"
> times, to avoid optimizing code objects for modules, classes, and
> functions that were imported but never used.
>
> The cache struct is defined in code.h [2], and is 32 bytes long. When a
> code object becomes hot, it gets an cache offset table allocated for it
> (+1 byte for each opcode) + an array of cache structs.
>
> To measure the max/average memory impact, I tuned my code to optimize
> *every* code object on *first* run.  Then I ran the entire Python test
> suite.  Python test suite + standard library both contain around 72395
> code objects, which required 20Mb of memory for caches.  The test
> process consumed around 400Mb of memory.  Thus, the absolute worst case
> scenario, the overhead is about 5%.
>
> Then I ran the test suite without any modifications to the patch. This
> means that only code objects that are called frequently enough are
> optimized.  In this more, only 2072 code objects were optimized, using
> less than 1Mb of memory for the cache.
>
>
> LOAD_ATTR
> ---------
>
> Damien George mentioned that they optimize a lot of dict lookups in
> MicroPython by memorizing last key/value offset in the dict object, thus
> eliminating lots of hash lookups.  I've implemented this optimization in
> my patch.  The results are quite good.  A simple micro-benchmark [3]
> shows ~30% speed improvement.  Here are some debug stats generated by
> 2to3 benchmark:
>
> -- Opcode cache LOAD_ATTR hits     = 14778415 (83%)
> -- Opcode cache LOAD_ATTR misses   = 750 (0%)
> -- Opcode cache LOAD_ATTR opts     = 282
> -- Opcode cache LOAD_ATTR deopts   = 60
> -- Opcode cache LOAD_ATTR total    = 17777912
>
> Each "hit" makes LOAD_ATTR about 30% faster.
>
>
> LOAD_GLOBAL
> -----------
>
> This turned out to be a very stable optimization.  Here is the debug
> output of the 2to3 test:
>
> -- Opcode cache LOAD_GLOBAL hits   = 3940647 (100%)
> -- Opcode cache LOAD_GLOBAL misses = 0 (0%)
> -- Opcode cache LOAD_GLOBAL opts   = 252
>
> All benchmarks (and real code) have stats like that.  Globals and
> builtins are very rarely modified, so the cache works really well.  With
> LOAD_GLOBAL opcode cache, global lookup is very cheap, there is no hash
> lookup for it at all.  It makes optimizations like "def foo(len=len)"
> obsolete.
>
>
> LOAD_METHOD
> -----------
>
> This is a new opcode I propose to add in [1].  The idea is to substitute
> LOAD_ATTR with it, and avoid instantiation of BoundMethod objects.
>
> With the cache, we can store a reference to the method descriptor (I use
> type->tp_version_tag for cache invalidation, the same thing
> _PyType_Lookup is built around).
>
> The cache makes LOAD_METHOD really efficient.  A simple micro-benchmark
> like [4], shows that with the cache and LOAD_METHOD,
> "s.startswith('abc')" becomes as efficient as "s[:3] == 'abc'".
>
> LOAD_METHOD/CALL_FUNCTION without cache is about 20% faster than
> LOAD_ATTR/CALL_FUNCTION.  With the cache, it's about 30% faster.
>
> Here's the debug output of the 2to3 benchmark:
>
> -- Opcode cache LOAD_METHOD hits   = 5164848 (64%)
> -- Opcode cache LOAD_METHOD misses = 12 (0%)
> -- Opcode cache LOAD_METHOD opts   = 94
> -- Opcode cache LOAD_METHOD deopts = 12
> -- Opcode cache LOAD_METHOD dct-chk= 1614801
> -- Opcode cache LOAD_METHOD total  = 7945954
>
>
> What's next?
> ------------
>
> First, I'd like to merge the new LOAD_METHOD opcode, see issue 26110
> [1].  It's a very straightforward optimization, the patch is small and
> easy to review.

+1 from me.

>
> Second, I'd like to merge the new opcode cache, see issue 26219 [5].
> All unittests pass.  Memory usage increase is very moderate (<1mb for
> the entire test suite), and the performance increase is significant.
> The only potential blocker for this is PEP 509 approval (which I'd be
> happy to assist with).
>

I think the fact that it improves performance across the board as well as
eliminates the various tricks people use to cache global and built-ins, a
big +1 from me. I guess that means Victor needs to ask for pronouncement on
PEP 509.

BTW, where does LOAD_ATTR fit into all of this?

>
> What do you think?
>

It all looks great to me!

-Brett

>
> Thanks,
> Yury
>
>
> [1] http://bugs.python.org/issue26110
> [2] https://github.com/1st1/cpython/blob/opcache5/Include/code.h#L10
> [3] https://gist.github.com/1st1/37d928f1e84813bf1c44
> [4] https://gist.github.com/1st1/10588e6e11c4d7c19445
> [5] http://bugs.python.org/issue26219
>
> _______________________________________________
> Python-Dev mailing list
> Python-Dev at python.org
> https://mail.python.org/mailman/listinfo/python-dev
> Unsubscribe:
> https://mail.python.org/mailman/options/python-dev/brett%40python.org
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-dev/attachments/20160201/20c19ddf/attachment-0001.html>