
Hi, This is the second email thread I start regarding implementing an opcode cache in ceval loop. Since my first post on this topic: - I've implemented another optimization (LOAD_ATTR); - I've added detailed statistics mode so that I can "see" how the cache performs and tune it; - some macro benchmarks are now 10-20% faster; 2to3 (a real application) is 7-8% faster; - and I have some good insights on the memory footprint. ** The purpose of this email is to get a general approval from python-dev, so that I can start polishing the patches and getting them reviewed/committed. ** Summary of optimizations ------------------------ When a code object is executed more than ~1000 times, it's considered "hot". It gets its opcodes analyzed to initialize caches for LOAD_METHOD (a new opcode I propose to add in [1]), LOAD_ATTR, and LOAD_GLOBAL. It's important to only optimize code objects that were executed "enough" times, to avoid optimizing code objects for modules, classes, and functions that were imported but never used. The cache struct is defined in code.h [2], and is 32 bytes long. When a code object becomes hot, it gets an cache offset table allocated for it (+1 byte for each opcode) + an array of cache structs. To measure the max/average memory impact, I tuned my code to optimize *every* code object on *first* run. Then I ran the entire Python test suite. Python test suite + standard library both contain around 72395 code objects, which required 20Mb of memory for caches. The test process consumed around 400Mb of memory. Thus, the absolute worst case scenario, the overhead is about 5%. Then I ran the test suite without any modifications to the patch. This means that only code objects that are called frequently enough are optimized. In this more, only 2072 code objects were optimized, using less than 1Mb of memory for the cache. LOAD_ATTR --------- Damien George mentioned that they optimize a lot of dict lookups in MicroPython by memorizing last key/value offset in the dict object, thus eliminating lots of hash lookups. I've implemented this optimization in my patch. The results are quite good. A simple micro-benchmark [3] shows ~30% speed improvement. Here are some debug stats generated by 2to3 benchmark: -- Opcode cache LOAD_ATTR hits = 14778415 (83%) -- Opcode cache LOAD_ATTR misses = 750 (0%) -- Opcode cache LOAD_ATTR opts = 282 -- Opcode cache LOAD_ATTR deopts = 60 -- Opcode cache LOAD_ATTR total = 17777912 Each "hit" makes LOAD_ATTR about 30% faster. LOAD_GLOBAL ----------- This turned out to be a very stable optimization. Here is the debug output of the 2to3 test: -- Opcode cache LOAD_GLOBAL hits = 3940647 (100%) -- Opcode cache LOAD_GLOBAL misses = 0 (0%) -- Opcode cache LOAD_GLOBAL opts = 252 All benchmarks (and real code) have stats like that. Globals and builtins are very rarely modified, so the cache works really well. With LOAD_GLOBAL opcode cache, global lookup is very cheap, there is no hash lookup for it at all. It makes optimizations like "def foo(len=len)" obsolete. LOAD_METHOD ----------- This is a new opcode I propose to add in [1]. The idea is to substitute LOAD_ATTR with it, and avoid instantiation of BoundMethod objects. With the cache, we can store a reference to the method descriptor (I use type->tp_version_tag for cache invalidation, the same thing _PyType_Lookup is built around). The cache makes LOAD_METHOD really efficient. A simple micro-benchmark like [4], shows that with the cache and LOAD_METHOD, "s.startswith('abc')" becomes as efficient as "s[:3] == 'abc'". LOAD_METHOD/CALL_FUNCTION without cache is about 20% faster than LOAD_ATTR/CALL_FUNCTION. With the cache, it's about 30% faster. Here's the debug output of the 2to3 benchmark: -- Opcode cache LOAD_METHOD hits = 5164848 (64%) -- Opcode cache LOAD_METHOD misses = 12 (0%) -- Opcode cache LOAD_METHOD opts = 94 -- Opcode cache LOAD_METHOD deopts = 12 -- Opcode cache LOAD_METHOD dct-chk= 1614801 -- Opcode cache LOAD_METHOD total = 7945954 What's next? ------------ First, I'd like to merge the new LOAD_METHOD opcode, see issue 26110 [1]. It's a very straightforward optimization, the patch is small and easy to review. Second, I'd like to merge the new opcode cache, see issue 26219 [5]. All unittests pass. Memory usage increase is very moderate (<1mb for the entire test suite), and the performance increase is significant. The only potential blocker for this is PEP 509 approval (which I'd be happy to assist with). What do you think? Thanks, Yury [1] http://bugs.python.org/issue26110 [2] https://github.com/1st1/cpython/blob/opcache5/Include/code.h#L10 [3] https://gist.github.com/1st1/37d928f1e84813bf1c44 [4] https://gist.github.com/1st1/10588e6e11c4d7c19445 [5] http://bugs.python.org/issue26219

On Mon, 1 Feb 2016 at 11:11 Yury Selivanov <yselivanov.ml@gmail.com> wrote:
Hi,
This is the second email thread I start regarding implementing an opcode cache in ceval loop. Since my first post on this topic:
- I've implemented another optimization (LOAD_ATTR);
- I've added detailed statistics mode so that I can "see" how the cache performs and tune it;
- some macro benchmarks are now 10-20% faster; 2to3 (a real application) is 7-8% faster;
- and I have some good insights on the memory footprint.
** The purpose of this email is to get a general approval from python-dev, so that I can start polishing the patches and getting them reviewed/committed. **
Summary of optimizations ------------------------
When a code object is executed more than ~1000 times, it's considered "hot". It gets its opcodes analyzed to initialize caches for LOAD_METHOD (a new opcode I propose to add in [1]), LOAD_ATTR, and LOAD_GLOBAL.
It's important to only optimize code objects that were executed "enough" times, to avoid optimizing code objects for modules, classes, and functions that were imported but never used.
The cache struct is defined in code.h [2], and is 32 bytes long. When a code object becomes hot, it gets an cache offset table allocated for it (+1 byte for each opcode) + an array of cache structs.
To measure the max/average memory impact, I tuned my code to optimize *every* code object on *first* run. Then I ran the entire Python test suite. Python test suite + standard library both contain around 72395 code objects, which required 20Mb of memory for caches. The test process consumed around 400Mb of memory. Thus, the absolute worst case scenario, the overhead is about 5%.
Then I ran the test suite without any modifications to the patch. This means that only code objects that are called frequently enough are optimized. In this more, only 2072 code objects were optimized, using less than 1Mb of memory for the cache.
LOAD_ATTR ---------
Damien George mentioned that they optimize a lot of dict lookups in MicroPython by memorizing last key/value offset in the dict object, thus eliminating lots of hash lookups. I've implemented this optimization in my patch. The results are quite good. A simple micro-benchmark [3] shows ~30% speed improvement. Here are some debug stats generated by 2to3 benchmark:
-- Opcode cache LOAD_ATTR hits = 14778415 (83%) -- Opcode cache LOAD_ATTR misses = 750 (0%) -- Opcode cache LOAD_ATTR opts = 282 -- Opcode cache LOAD_ATTR deopts = 60 -- Opcode cache LOAD_ATTR total = 17777912
Each "hit" makes LOAD_ATTR about 30% faster.
LOAD_GLOBAL -----------
This turned out to be a very stable optimization. Here is the debug output of the 2to3 test:
-- Opcode cache LOAD_GLOBAL hits = 3940647 (100%) -- Opcode cache LOAD_GLOBAL misses = 0 (0%) -- Opcode cache LOAD_GLOBAL opts = 252
All benchmarks (and real code) have stats like that. Globals and builtins are very rarely modified, so the cache works really well. With LOAD_GLOBAL opcode cache, global lookup is very cheap, there is no hash lookup for it at all. It makes optimizations like "def foo(len=len)" obsolete.
LOAD_METHOD -----------
This is a new opcode I propose to add in [1]. The idea is to substitute LOAD_ATTR with it, and avoid instantiation of BoundMethod objects.
With the cache, we can store a reference to the method descriptor (I use type->tp_version_tag for cache invalidation, the same thing _PyType_Lookup is built around).
The cache makes LOAD_METHOD really efficient. A simple micro-benchmark like [4], shows that with the cache and LOAD_METHOD, "s.startswith('abc')" becomes as efficient as "s[:3] == 'abc'".
LOAD_METHOD/CALL_FUNCTION without cache is about 20% faster than LOAD_ATTR/CALL_FUNCTION. With the cache, it's about 30% faster.
Here's the debug output of the 2to3 benchmark:
-- Opcode cache LOAD_METHOD hits = 5164848 (64%) -- Opcode cache LOAD_METHOD misses = 12 (0%) -- Opcode cache LOAD_METHOD opts = 94 -- Opcode cache LOAD_METHOD deopts = 12 -- Opcode cache LOAD_METHOD dct-chk= 1614801 -- Opcode cache LOAD_METHOD total = 7945954
What's next? ------------
First, I'd like to merge the new LOAD_METHOD opcode, see issue 26110 [1]. It's a very straightforward optimization, the patch is small and easy to review.
+1 from me.
Second, I'd like to merge the new opcode cache, see issue 26219 [5]. All unittests pass. Memory usage increase is very moderate (<1mb for the entire test suite), and the performance increase is significant. The only potential blocker for this is PEP 509 approval (which I'd be happy to assist with).
I think the fact that it improves performance across the board as well as eliminates the various tricks people use to cache global and built-ins, a big +1 from me. I guess that means Victor needs to ask for pronouncement on PEP 509. BTW, where does LOAD_ATTR fit into all of this?
What do you think?
It all looks great to me! -Brett
Thanks, Yury
[1] http://bugs.python.org/issue26110 [2] https://github.com/1st1/cpython/blob/opcache5/Include/code.h#L10 [3] https://gist.github.com/1st1/37d928f1e84813bf1c44 [4] https://gist.github.com/1st1/10588e6e11c4d7c19445 [5] http://bugs.python.org/issue26219
_______________________________________________ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/brett%40python.org

Hi Brett, On 2016-02-01 2:30 PM, Brett Cannon wrote:
On Mon, 1 Feb 2016 at 11:11 Yury Selivanov <yselivanov.ml@gmail.com <mailto:yselivanov.ml@gmail.com>> wrote:
Hi,
[..]
What's next? ------------
First, I'd like to merge the new LOAD_METHOD opcode, see issue 26110 [1]. It's a very straightforward optimization, the patch is small and easy to review.
+1 from me.
Second, I'd like to merge the new opcode cache, see issue 26219 [5]. All unittests pass. Memory usage increase is very moderate (<1mb for the entire test suite), and the performance increase is significant. The only potential blocker for this is PEP 509 approval (which I'd be happy to assist with).
I think the fact that it improves performance across the board as well as eliminates the various tricks people use to cache global and built-ins, a big +1 from me. I guess that means Victor needs to ask for pronouncement on PEP 509.
Great! AFAIK Victor still needs to update the PEP with some changes (globally unique ma_version). My patch includes the latest implementation of PEP 509, and it works fine (no regressions, no broken unittests). I can also assist with reviewing Victor's implementation if the PEP is accepted.
BTW, where does LOAD_ATTR fit into all of this?
LOAD_ATTR optimization doesn't use any of PEP 509 new stuff (if I understand you question correctly). It's based on the following assumptions (that really make JITs work so well): 1. Most classes don't implement __getattribute__. 2. A lot of attributes are stored in objects' __dict__s. 3. Most attributes aren't shaded by descriptors/getters-setters; most code just uses "self.attr". 4. An average method/function works on objects of the same type. Which means that those objects were constructed in a very similar (if not exact) fashion. For instance: class F: def __init__(self, name): self.name = name def say(self): print(self.name) # <- For all F instances, # offset of 'name' in `F().__dict__`s # will be the same If LOAD_ATTR gets too many cache misses (20 in my current patch) it gets deoptimized, and the default implementation is used. So if the code is very dynamic - there's no improvement, but no performance penalty either. In my patch, I use the cache to store (for LOAD_ATTR specifically): - pointer to object's type - type->tp_version_tag - the last successful __dict__ offset The first two fields are used to make sure that we have objects of the same type. If it changes, we deoptimize the opcode immediately. Then we try the offset. If it's successful - we have a cache hit. If not, that's fine, we'll try another few times before deoptimizing the opcode.
What do you think?
It all looks great to me!
Thanks! Yury

On Mon, 1 Feb 2016 at 11:51 Yury Selivanov <yselivanov.ml@gmail.com> wrote:
Hi Brett,
On 2016-02-01 2:30 PM, Brett Cannon wrote:
On Mon, 1 Feb 2016 at 11:11 Yury Selivanov <yselivanov.ml@gmail.com <mailto:yselivanov.ml@gmail.com>> wrote:
Hi,
[..]
What's next? ------------
First, I'd like to merge the new LOAD_METHOD opcode, see issue 26110 [1]. It's a very straightforward optimization, the patch is small
and
easy to review.
+1 from me.
Second, I'd like to merge the new opcode cache, see issue 26219 [5]. All unittests pass. Memory usage increase is very moderate (<1mb for the entire test suite), and the performance increase is significant. The only potential blocker for this is PEP 509 approval (which I'd be happy to assist with).
I think the fact that it improves performance across the board as well as eliminates the various tricks people use to cache global and built-ins, a big +1 from me. I guess that means Victor needs to ask for pronouncement on PEP 509.
Great! AFAIK Victor still needs to update the PEP with some changes (globally unique ma_version). My patch includes the latest implementation of PEP 509, and it works fine (no regressions, no broken unittests). I can also assist with reviewing Victor's implementation if the PEP is accepted.
BTW, where does LOAD_ATTR fit into all of this?
LOAD_ATTR optimization doesn't use any of PEP 509 new stuff (if I understand you question correctly). It's based on the following assumptions (that really make JITs work so well):
1. Most classes don't implement __getattribute__.
2. A lot of attributes are stored in objects' __dict__s.
3. Most attributes aren't shaded by descriptors/getters-setters; most code just uses "self.attr".
4. An average method/function works on objects of the same type. Which means that those objects were constructed in a very similar (if not exact) fashion.
For instance:
class F: def __init__(self, name): self.name = name def say(self): print(self.name) # <- For all F instances, # offset of 'name' in `F().__dict__`s # will be the same
If LOAD_ATTR gets too many cache misses (20 in my current patch) it gets deoptimized, and the default implementation is used. So if the code is very dynamic - there's no improvement, but no performance penalty either.
In my patch, I use the cache to store (for LOAD_ATTR specifically):
- pointer to object's type - type->tp_version_tag - the last successful __dict__ offset
The first two fields are used to make sure that we have objects of the same type. If it changes, we deoptimize the opcode immediately. Then we try the offset. If it's successful - we have a cache hit. If not, that's fine, we'll try another few times before deoptimizing the opcode.
So this is a third "next step" that has its own issue? -Brett
What do you think?
It all looks great to me!
Thanks!
Yury

Brett, On 2016-02-01 3:08 PM, Brett Cannon wrote:
On Mon, 1 Feb 2016 at 11:51 Yury Selivanov <yselivanov.ml@gmail.com <mailto:yselivanov.ml@gmail.com>> wrote:
Hi Brett,
[..]
The first two fields are used to make sure that we have objects of the same type. If it changes, we deoptimize the opcode immediately. Then we try the offset. If it's successful - we have a cache hit. If not, that's fine, we'll try another few times before deoptimizing the opcode.
So this is a third "next step" that has its own issue?
It's all in issue http://bugs.python.org/issue26219 right now. My current plan is to implement LOAD_METHOD/CALL_METHOD (just opcodes, no cache) in 26110. Then implement caching for LOAD_METHOD, LOAD_GLOBAL, and LOAD_ATTR in 26219. I'm flexible to break down 26219 in three separate issues if that helps the review process (but that would take more of my time): - implement support for opcode caching (general infrastructure) + LOAD_GLOBAL optimization - LOAD_METHOD optimization - LOAD_ATTR optimization Yury

On Mon, 1 Feb 2016 at 12:16 Yury Selivanov <yselivanov.ml@gmail.com> wrote:
Brett,
On 2016-02-01 3:08 PM, Brett Cannon wrote:
On Mon, 1 Feb 2016 at 11:51 Yury Selivanov <yselivanov.ml@gmail.com <mailto:yselivanov.ml@gmail.com>> wrote:
Hi Brett,
[..]
The first two fields are used to make sure that we have objects of
the
same type. If it changes, we deoptimize the opcode immediately.
Then
we try the offset. If it's successful - we have a cache hit. If
not,
that's fine, we'll try another few times before deoptimizing the opcode.
So this is a third "next step" that has its own issue?
It's all in issue http://bugs.python.org/issue26219 right now.
My current plan is to implement LOAD_METHOD/CALL_METHOD (just opcodes, no cache) in 26110.
Then implement caching for LOAD_METHOD, LOAD_GLOBAL, and LOAD_ATTR in 26219. I'm flexible to break down 26219 in three separate issues if that helps the review process (but that would take more of my time):
- implement support for opcode caching (general infrastructure) + LOAD_GLOBAL optimization - LOAD_METHOD optimization - LOAD_ATTR optimization
I personally don't care how you break it down, just trying to keep all the moving pieces in my head. :) Anyway, it sounds like PEP 509 is blocking part of it, but the LOAD_METHOD stuff can go in as-is. So are you truly blocked only on getting the latest version of that patch up to http://bugs.python.org/issue26110 and getting a code review?

On 2016-02-01 3:21 PM, Brett Cannon wrote:
On Mon, 1 Feb 2016 at 12:16 Yury Selivanov <yselivanov.ml@gmail.com <mailto:yselivanov.ml@gmail.com>> wrote:
Brett,
On 2016-02-01 3:08 PM, Brett Cannon wrote: > > > On Mon, 1 Feb 2016 at 11:51 Yury Selivanov <yselivanov.ml@gmail.com <mailto:yselivanov.ml@gmail.com> > <mailto:yselivanov.ml@gmail.com <mailto:yselivanov.ml@gmail.com>>> wrote: > > Hi Brett, > [..] > > > The first two fields are used to make sure that we have objects of the > same type. If it changes, we deoptimize the opcode immediately. Then > we try the offset. If it's successful - we have a cache hit. If not, > that's fine, we'll try another few times before deoptimizing the > opcode. > > > So this is a third "next step" that has its own issue?
It's all in issue http://bugs.python.org/issue26219 right now.
My current plan is to implement LOAD_METHOD/CALL_METHOD (just opcodes, no cache) in 26110.
Then implement caching for LOAD_METHOD, LOAD_GLOBAL, and LOAD_ATTR in 26219. I'm flexible to break down 26219 in three separate issues if that helps the review process (but that would take more of my time):
- implement support for opcode caching (general infrastructure) + LOAD_GLOBAL optimization - LOAD_METHOD optimization - LOAD_ATTR optimization
I personally don't care how you break it down, just trying to keep all the moving pieces in my head. :)
Anyway, it sounds like PEP 509 is blocking part of it, but the LOAD_METHOD stuff can go in as-is. So are you truly blocked only on getting the latest version of that patch up to http://bugs.python.org/issue26110 and getting a code review?
Yep. The initial implementation of LOAD_METHOD doesn't need PEP 509 / opcode caching. I'll have to focus on something else this week, but early next week I can upload a new patch for 26110. When we have 26110 committed and PEP 509 approved and committed, I can update the opcode cache patch (issue 26219) and we can start reviewing it. Yury

On 01.02.2016 20:51, Yury Selivanov wrote:
If LOAD_ATTR gets too many cache misses (20 in my current patch) it gets deoptimized, and the default implementation is used. So if the code is very dynamic - there's no improvement, but no performance penalty either.
Will you re-try optimizing it?

On 2016-02-01 3:27 PM, Sven R. Kunze wrote:
On 01.02.2016 20:51, Yury Selivanov wrote:
If LOAD_ATTR gets too many cache misses (20 in my current patch) it gets deoptimized, and the default implementation is used. So if the code is very dynamic - there's no improvement, but no performance penalty either.
Will you re-try optimizing it?
No. It's important to understand that if we have a lot of cache misses after the code object was executed 1000 times, it doesn't make sense to keep trying to update that cache. It just means that the code, in that particular point, works with different kinds of objects. FWIW, I experimented with different ideas (one is to never de-optimize), and the current strategy works best on the vast number of benchmarks. Yury

On 01.02.2016 21:35, Yury Selivanov wrote:
It's important to understand that if we have a lot of cache misses after the code object was executed 1000 times, it doesn't make sense to keep trying to update that cache. It just means that the code, in that particular point, works with different kinds of objects.
So, the assumption is that the code makes the difference here not time. That could be true for production code.
FWIW, I experimented with different ideas (one is to never de-optimize), and the current strategy works best on the vast number of benchmarks.
Nice. Regarding the magic constants (1000, 20) what is the process of updating them? Best, Sven

On 2016-02-01 4:02 PM, Sven R. Kunze wrote:
On 01.02.2016 21:35, Yury Selivanov wrote:
It's important to understand that if we have a lot of cache misses after the code object was executed 1000 times, it doesn't make sense to keep trying to update that cache. It just means that the code, in that particular point, works with different kinds of objects.
So, the assumption is that the code makes the difference here not time. That could be true for production code.
FWIW, I experimented with different ideas (one is to never de-optimize), and the current strategy works best on the vast number of benchmarks.
Nice.
Regarding the magic constants (1000, 20) what is the process of updating them?
Right now they are private constants in ceval.c. I will (maybe) expose a private API via the _testcapi module to re-define them (set them to 1 or 0), only to write better unittests. I have no plans to make those constants public or have a public API to tackle them. IMHO, this is something that almost nobody will ever use. Yury

On 01.02.2016 22:27, Yury Selivanov wrote:
Right now they are private constants in ceval.c.
I will (maybe) expose a private API via the _testcapi module to re-define them (set them to 1 or 0), only to write better unittests. I have no plans to make those constants public or have a public API to tackle them. IMHO, this is something that almost nobody will ever use.
Alright. I agree with you on that. What I actually meant was: how can we find the optimal values? I understand that 1000 and 20 are some hand-figured/subjective values for now. Is there standardized/objective way to find out the best values? What does best even mean here? Best, Sven

Sven, On 2016-02-01 4:32 PM, Sven R. Kunze wrote:
On 01.02.2016 22:27, Yury Selivanov wrote:
Right now they are private constants in ceval.c.
I will (maybe) expose a private API via the _testcapi module to re-define them (set them to 1 or 0), only to write better unittests. I have no plans to make those constants public or have a public API to tackle them. IMHO, this is something that almost nobody will ever use.
Alright. I agree with you on that.
What I actually meant was: how can we find the optimal values? I understand that 1000 and 20 are some hand-figured/subjective values for now.
Is there standardized/objective way to find out the best values? What does best even mean here?
Running lots of benchmarks and micro-benchmarks hundreds of times ;) I've done a lot of that, and I noticed that the numbers don't matter too much. What matters is that we don't want to optimize the code that runs 0 or 1 times. To save some memory we don't want to optimize the code that runs 10 times. So 1000 seems to be about right. We also need to deoptimize the code to avoid having too many cache misses/pointless cache updates. I found that, for instance, LOAD_ATTR is either super stable (hits 100% of times), or really unstable, so 20 misses is, again, seems to be alright. I'm flexible about tweaking those values, I encourage you and everyone to experiment, if you have time ;) https://github.com/1st1/cpython/blob/opcache5/Python/ceval.c#L100 Thanks, Yury

On 01.02.2016 22:43, Yury Selivanov wrote:
Sven,
On 2016-02-01 4:32 PM, Sven R. Kunze wrote:
On 01.02.2016 22:27, Yury Selivanov wrote:
Right now they are private constants in ceval.c.
I will (maybe) expose a private API via the _testcapi module to re-define them (set them to 1 or 0), only to write better unittests. I have no plans to make those constants public or have a public API to tackle them. IMHO, this is something that almost nobody will ever use.
Alright. I agree with you on that.
What I actually meant was: how can we find the optimal values? I understand that 1000 and 20 are some hand-figured/subjective values for now.
Is there standardized/objective way to find out the best values? What does best even mean here?
Running lots of benchmarks and micro-benchmarks hundreds of times ;) I've done a lot of that, and I noticed that the numbers don't matter too much.
That's actually pretty interesting. :) Do you consider writing a blog post about this at some time?
What matters is that we don't want to optimize the code that runs 0 or 1 times. To save some memory we don't want to optimize the code that runs 10 times. So 1000 seems to be about right.
We also need to deoptimize the code to avoid having too many cache misses/pointless cache updates. I found that, for instance, LOAD_ATTR is either super stable (hits 100% of times), or really unstable, so 20 misses is, again, seems to be alright.
I'm flexible about tweaking those values, I encourage you and everyone to experiment, if you have time ;) https://github.com/1st1/cpython/blob/opcache5/Python/ceval.c#L100
Right now, I am busy with the heap implementation but I think I can look into it later. Best, Sven

Hi, On 02/01/2016 10:43 PM, Yury Selivanov wrote:
We also need to deoptimize the code to avoid having too many cache misses/pointless cache updates. I found that, for instance, LOAD_ATTR is either super stable (hits 100% of times), or really unstable, so 20 misses is, again, seems to be alright.
Aren't those hits/misses a way to see how dynamic the code is? I mean can't the current magic (manually tweaked on a limited set) values, be self tweaked/adapted on those numbers? Thanks in advance, francis

On 2016-02-03 3:53 PM, francismb wrote:
Hi,
On 02/01/2016 10:43 PM, Yury Selivanov wrote:
We also need to deoptimize the code to avoid having too many cache misses/pointless cache updates. I found that, for instance, LOAD_ATTR is either super stable (hits 100% of times), or really unstable, so 20 misses is, again, seems to be alright.
Aren't those hits/misses a way to see how dynamic the code is? I mean can't the current magic (manually tweaked on a limited set) values, be self tweaked/adapted on those numbers?
Probably. One way of tackling this is to give each optimized opcode a counter for hit/misses. When we have a "hit" we increment that counter, when it's a miss, we decrement it. I kind of have something like that right now: https://github.com/1st1/cpython/blob/opcache5/Python/ceval.c#L3035 But I only decrement that counter -- the idea is that LOAD_ATTR is allowed to "miss" only 20 times before getting deoptimized. I'll experiment with inc/dec on hit/miss and see how that affects the performance. An ideal way would be to calculate a hit/miss ratio over time for each cached opcode, but that would be an expensive calculation. Yury

On 03.02.2016 22:22, Yury Selivanov wrote:
One way of tackling this is to give each optimized opcode a counter for hit/misses. When we have a "hit" we increment that counter, when it's a miss, we decrement it.
Within a given range, I suppose. Like: c = min(c+1, 100)
I kind of have something like that right now: https://github.com/1st1/cpython/blob/opcache5/Python/ceval.c#L3035
But I only decrement that counter -- the idea is that LOAD_ATTR is allowed to "miss" only 20 times before getting deoptimized.
I'll experiment with inc/dec on hit/miss and see how that affects the performance.
An ideal way would be to calculate a hit/miss ratio over time for each cached opcode, but that would be an expensive calculation.

On Feb 3, 2016, at 13:22, Yury Selivanov <yselivanov.ml@gmail.com> wrote:
An ideal way would be to calculate a hit/miss ratio over time for each cached opcode, but that would be an expensive calculation.
Do you mean like a sliding windows ? Otherwise if you just want a let's say 20% miss threshold, you increment by 1 on hit, and decrement by 4 on miss. On Feb 3, 2016, at 13:37, Sven R. Kunze <srkunze@mail.de> wrote:
On 03.02.2016 22:22, Yury Selivanov wrote:
One way of tackling this is to give each optimized opcode a counter for hit/misses. When we have a "hit" we increment that counter, when it's a miss, we decrement it.
Within a given range, I suppose. Like:
c = min(c+1, 100)
Min might be overkill, maybe you can use a or mask, to limit the windows range to 256 consecutive call ? -- M
Yury
_______________________________________________ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/bussonniermatthias%40gmai...

On 04.02.2016 16:57, Matthias Bussonnier wrote:
On Feb 3, 2016, at 13:22, Yury Selivanov <yselivanov.ml@gmail.com> wrote:
An ideal way would be to calculate a hit/miss ratio over time for each cached opcode, but that would be an expensive calculation. Do you mean like a sliding windows ? Otherwise if you just want a let's say 20% miss threshold, you increment by 1 on hit, and decrement by 4 on miss.
Division is expensive.
On Feb 3, 2016, at 13:37, Sven R. Kunze <srkunze@mail.de> wrote:
On 03.02.2016 22:22, Yury Selivanov wrote:
One way of tackling this is to give each optimized opcode a counter for hit/misses. When we have a "hit" we increment that counter, when it's a miss, we decrement it. Within a given range, I suppose. Like:
c = min(c+1, 100)
Min might be overkill, maybe you can use a or mask, to limit the windows range to 256 consecutive call ?
Sure, that is how I would have written it in Python. But I would suggest an AND mask. ;-) Best, Sven

On Feb 4, 2016, at 08:22, Sven R. Kunze <srkunze@mail.de> wrote:
On 04.02.2016 16:57, Matthias Bussonnier wrote:
On Feb 3, 2016, at 13:22, Yury Selivanov <yselivanov.ml@gmail.com> wrote:
An ideal way would be to calculate a hit/miss ratio over time for each cached opcode, but that would be an expensive calculation. Do you mean like a sliding windows ? Otherwise if you just want a let's say 20% miss threshold, you increment by 1 on hit, and decrement by 4 on miss.
Division is expensive.
I'm not speaking about division here. if you +M / -N the counter will decrease in average only if the hit/miss ratio is below N/(M+N), but you do not need to do the division. Then you deoptimize only if you get < 0.
On Feb 3, 2016, at 13:37, Sven R. Kunze <srkunze@mail.de> wrote:
On 03.02.2016 22:22, Yury Selivanov wrote:
One way of tackling this is to give each optimized opcode a counter for hit/misses. When we have a "hit" we increment that counter, when it's a miss, we decrement it. Within a given range, I suppose. Like:
c = min(c+1, 100)
Min might be overkill, maybe you can use a or mask, to limit the windows range to 256 consecutive call ?
Sure, that is how I would have written it in Python. But I would suggest an AND mask. ;-)
Sure, implementation detail I would say. Should not write emails before breakfast... The other problem, with the mask, is if your increment hit 256 you wrap around back to 0 where it deoptimize (which is not what you want), so you might need to not mask the sign bit and deoptimize only on a certain negative threshold. Does it make sens ? -- M
Best, Sven

On 05.02.2016 00:06, Matthias Bussonnier wrote:
On Feb 4, 2016, at 08:22, Sven R. Kunze <srkunze@mail.de> wrote:
On 04.02.2016 16:57, Matthias Bussonnier wrote:
On Feb 3, 2016, at 13:22, Yury Selivanov <yselivanov.ml@gmail.com> wrote:
An ideal way would be to calculate a hit/miss ratio over time for each cached opcode, but that would be an expensive calculation. Do you mean like a sliding windows ? Otherwise if you just want a let's say 20% miss threshold, you increment by 1 on hit, and decrement by 4 on miss. Division is expensive. I'm not speaking about division here. if you +M / -N the counter will decrease in average only if the hit/miss ratio is below N/(M+N), but you do not need to do the division.
Then you deoptimize only if you get < 0.
I see but it looks still more complicated. :)
On Feb 3, 2016, at 13:37, Sven R. Kunze <srkunze@mail.de> wrote:
On 03.02.2016 22:22, Yury Selivanov wrote:
One way of tackling this is to give each optimized opcode a counter for hit/misses. When we have a "hit" we increment that counter, when it's a miss, we decrement it. Within a given range, I suppose. Like:
c = min(c+1, 100) Min might be overkill, maybe you can use a or mask, to limit the windows range to 256 consecutive call ? Sure, that is how I would have written it in Python. But I would suggest an AND mask. ;-)
Sure, implementation detail I would say. Should not write emails before breakfast...
;-)
The other problem, with the mask, is if your increment hit 256 you wrap around back to 0 where it deoptimize (which is not what you want), so you might need to not mask the sign bit and deoptimize only on a certain negative threshold.
Does it make sens ?
Definitely. I am curious about the actual implementation of this idea. Best, Sven

Looking over the thread and the two issues, you've got good arguments for why the improved code will be the most common code, and good benchmarks for various kinds of real-life code, but it doesn't seem like you'd tried to stress it on anything that could be made worse. From your explanations and your code, I wouldn't expect that @classmethods, functions stored in the object dict or generated by __getattr__, non-function callables as methods, etc. would go significantly slower, or code that mixes @properties or __getattr__ proxy attributes with real attributes, or uses __slots__, or code that does frequently write to a global, etc. But it would be nice to _know_ that they don't instead of just expecting it. Sent from my iPhone
On Feb 1, 2016, at 11:10, Yury Selivanov <yselivanov.ml@gmail.com> wrote:
Hi,
This is the second email thread I start regarding implementing an opcode cache in ceval loop. Since my first post on this topic:
- I've implemented another optimization (LOAD_ATTR);
- I've added detailed statistics mode so that I can "see" how the cache performs and tune it;
- some macro benchmarks are now 10-20% faster; 2to3 (a real application) is 7-8% faster;
- and I have some good insights on the memory footprint.
** The purpose of this email is to get a general approval from python-dev, so that I can start polishing the patches and getting them reviewed/committed. **
Summary of optimizations ------------------------
When a code object is executed more than ~1000 times, it's considered "hot". It gets its opcodes analyzed to initialize caches for LOAD_METHOD (a new opcode I propose to add in [1]), LOAD_ATTR, and LOAD_GLOBAL.
It's important to only optimize code objects that were executed "enough" times, to avoid optimizing code objects for modules, classes, and functions that were imported but never used.
The cache struct is defined in code.h [2], and is 32 bytes long. When a code object becomes hot, it gets an cache offset table allocated for it (+1 byte for each opcode) + an array of cache structs.
To measure the max/average memory impact, I tuned my code to optimize *every* code object on *first* run. Then I ran the entire Python test suite. Python test suite + standard library both contain around 72395 code objects, which required 20Mb of memory for caches. The test process consumed around 400Mb of memory. Thus, the absolute worst case scenario, the overhead is about 5%.
Then I ran the test suite without any modifications to the patch. This means that only code objects that are called frequently enough are optimized. In this more, only 2072 code objects were optimized, using less than 1Mb of memory for the cache.
LOAD_ATTR ---------
Damien George mentioned that they optimize a lot of dict lookups in MicroPython by memorizing last key/value offset in the dict object, thus eliminating lots of hash lookups. I've implemented this optimization in my patch. The results are quite good. A simple micro-benchmark [3] shows ~30% speed improvement. Here are some debug stats generated by 2to3 benchmark:
-- Opcode cache LOAD_ATTR hits = 14778415 (83%) -- Opcode cache LOAD_ATTR misses = 750 (0%) -- Opcode cache LOAD_ATTR opts = 282 -- Opcode cache LOAD_ATTR deopts = 60 -- Opcode cache LOAD_ATTR total = 17777912
Each "hit" makes LOAD_ATTR about 30% faster.
LOAD_GLOBAL -----------
This turned out to be a very stable optimization. Here is the debug output of the 2to3 test:
-- Opcode cache LOAD_GLOBAL hits = 3940647 (100%) -- Opcode cache LOAD_GLOBAL misses = 0 (0%) -- Opcode cache LOAD_GLOBAL opts = 252
All benchmarks (and real code) have stats like that. Globals and builtins are very rarely modified, so the cache works really well. With LOAD_GLOBAL opcode cache, global lookup is very cheap, there is no hash lookup for it at all. It makes optimizations like "def foo(len=len)" obsolete.
LOAD_METHOD -----------
This is a new opcode I propose to add in [1]. The idea is to substitute LOAD_ATTR with it, and avoid instantiation of BoundMethod objects.
With the cache, we can store a reference to the method descriptor (I use type->tp_version_tag for cache invalidation, the same thing _PyType_Lookup is built around).
The cache makes LOAD_METHOD really efficient. A simple micro-benchmark like [4], shows that with the cache and LOAD_METHOD, "s.startswith('abc')" becomes as efficient as "s[:3] == 'abc'".
LOAD_METHOD/CALL_FUNCTION without cache is about 20% faster than LOAD_ATTR/CALL_FUNCTION. With the cache, it's about 30% faster.
Here's the debug output of the 2to3 benchmark:
-- Opcode cache LOAD_METHOD hits = 5164848 (64%) -- Opcode cache LOAD_METHOD misses = 12 (0%) -- Opcode cache LOAD_METHOD opts = 94 -- Opcode cache LOAD_METHOD deopts = 12 -- Opcode cache LOAD_METHOD dct-chk= 1614801 -- Opcode cache LOAD_METHOD total = 7945954
What's next? ------------
First, I'd like to merge the new LOAD_METHOD opcode, see issue 26110 [1]. It's a very straightforward optimization, the patch is small and easy to review.
Second, I'd like to merge the new opcode cache, see issue 26219 [5]. All unittests pass. Memory usage increase is very moderate (<1mb for the entire test suite), and the performance increase is significant. The only potential blocker for this is PEP 509 approval (which I'd be happy to assist with).
What do you think?
Thanks, Yury
[1] http://bugs.python.org/issue26110 [2] https://github.com/1st1/cpython/blob/opcache5/Include/code.h#L10 [3] https://gist.github.com/1st1/37d928f1e84813bf1c44 [4] https://gist.github.com/1st1/10588e6e11c4d7c19445 [5] http://bugs.python.org/issue26219
_______________________________________________ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/abarnert%40yahoo.com

Andrew, On 2016-02-01 4:29 PM, Andrew Barnert wrote:
Looking over the thread and the two issues, you've got good arguments for why the improved code will be the most common code, and good benchmarks for various kinds of real-life code, but it doesn't seem like you'd tried to stress it on anything that could be made worse. From your explanations and your code, I wouldn't expect that @classmethods, functions stored in the object dict or generated by __getattr__, non-function callables as methods, etc. would go significantly slower,
Right. The caching, of course, has some overhead, albeit barely detectable. The only way the slow down might become "significant" if there is a bug in the ceval.c code -- i.e. an opcode doesn't get de-optimized etc. That should be fixable.
or code that mixes @properties or __getattr__ proxy attributes with real attributes, or uses __slots__,
No performance degradation for __slots__, we have a benchmark for that. I also tried to add __slots__ to every class in the Richards test - no improvement or performance degradation there.
or code that does frequently write to a global, etc. But it would be nice to _know_ that they don't instead of just expecting it.
FWIW I've just tried to write a micro-benchmark for __getattr__: https://gist.github.com/1st1/22c1aa0a46f246a31515 Opcode cache gets quickly deoptimized with it, but, as expected, the CPython with opcode cache is <1% slower. But that's a 1% in a super micro-benchmark; of course the cost of having a cache that isn't used will show up. In a real code that doesn't consist only of LOAD_ATTRs, it won't even be possible to see any slowdown. Thanks, Yury

Hi, Maybe it's worth to write a PEP to summarize all your changes to optimize CPython? It would avoid to have to follow different threads on the mailing lists, different issues on the bug tracker, with external links to GitHub gists, etc. Your code changes critical parts of Python: code object structure and Python/ceval.c. At least, it would help to document Python internals ;-) The previous "big" change (optimization) like that was the new "type attribute cache": addition of tp_version_tag to PyTypeObject. I "documented" it in the PEP 509 and it was difficult to rebuild the context, understand the design, etc. https://www.python.org/dev/peps/pep-0509/#method-cache-and-type-version-tag Victor

Hi Victor, On 2016-02-02 4:33 AM, Victor Stinner wrote:
Hi,
Maybe it's worth to write a PEP to summarize all your changes to optimize CPython? It would avoid to have to follow different threads on the mailing lists, different issues on the bug tracker, with external links to GitHub gists, etc. Your code changes critical parts of Python: code object structure and Python/ceval.c.
Not sure about that... PEPs take a LOT of time :( Besides, all my changes are CPython specific and can be considered as an implementation detail.
At least, it would help to document Python internals ;-)
I can write a ceval.txt file explaining what's going on in ceval loop, with details on the opcode cache and other things. I think it's even better than a PEP, to be honest. Yury

On 02.02.2016 20:41, Yury Selivanov wrote:
Hi Victor,
On 2016-02-02 4:33 AM, Victor Stinner wrote:
Hi,
Maybe it's worth to write a PEP to summarize all your changes to optimize CPython? It would avoid to have to follow different threads on the mailing lists, different issues on the bug tracker, with external links to GitHub gists, etc. Your code changes critical parts of Python: code object structure and Python/ceval.c.
Not sure about that... PEPs take a LOT of time :(
True.
Besides, all my changes are CPython specific and can be considered as an implementation detail.
At least, it would help to document Python internals ;-)
I can write a ceval.txt file explaining what's going on in ceval loop, with details on the opcode cache and other things. I think it's even better than a PEP, to be honest.
I would love to see that. :) Best, Sven

Yury Selivanov writes:
Not sure about that... PEPs take a LOT of time :(
Informational PEPs need not take so much time, no more than you would spend on ceval.txt. I'm sure a PEP would get a lot more attention from reviewers, too. Even if you PEP the whole thing, as you say it's a (big ;-) implementation detail. A PEP won't make things more controversial (or less) than they already are. I don't see why it would take that much more time than ceval.txt.
I can write a ceval.txt file explaining what's going on in ceval loop, with details on the opcode cache and other things. I think it's even better than a PEP, to be honest.
Unlikely to be better, since that's a subset of the proposed PEP. Of course it's up to you, since you'd be doing most of the work, but for the rest of us PEPs are a lot more discoverable and easily referenced than a .txt file with a short name.

On 3 February 2016 at 06:49, Stephen J. Turnbull <stephen@xemacs.org> wrote:
Yury Selivanov writes:
Not sure about that... PEPs take a LOT of time :(
Informational PEPs need not take so much time, no more than you would spend on ceval.txt. I'm sure a PEP would get a lot more attention from reviewers, too.
Even if you PEP the whole thing, as you say it's a (big ;-) implementation detail. A PEP won't make things more controversial (or less) than they already are. I don't see why it would take that much more time than ceval.txt.
For a typical PEP, you need to explain both the status quo *and* the state after the changes, as well as provide references to the related discussions. I think in this case the main target audience for the technical details should be future maintainers, so Yury writing a ceval.txt akin to the current dictnotes.txt, listsort.txt, etc would cover the essentials. If someone else wanted to also describe the change in a PEP for ease of future reference, using Yury's ceval.txt as input, I do think that would be a good thing, but I wouldn't want to make the enhancement conditional on someone volunteering to do that. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

Nick Coghlan writes:
If someone else wanted to also describe the change in a PEP for ease of future reference, using Yury's ceval.txt as input, I do think that would be a good thing, but I wouldn't want to make the enhancement conditional on someone volunteering to do that.
I wasn't suggesting making it conditional, I was encouraging Yury to do it himself as the most familiar with the situation. I may be underestimating the additional cost, but it seems to me explaining both before and after would be very useful to people who've hacked ceval in the past. (Presumably Yury would just be explaining "after" in his ceval.txt.) The important thing is to make it discoverable, though, and I don't care if it's done by PEP or not. In fact, perhaps "let Yury be Yury", plus an informational PEP listing all of the *.txt files in the tree would be more useful? Or in the devguide?

I can write a ceval.txt file explaining what's going on in ceval loop, with details on the opcode cache and other things. I think it's even better than a PEP, to be honest.
I totally agree.
Please include the notes text file. This provides an excellent summary for those of us that haven't yet taken the deep dive into the ceval loop, but still wish to understand its implementation.

On 01.02.16 21:10, Yury Selivanov wrote:
To measure the max/average memory impact, I tuned my code to optimize *every* code object on *first* run. Then I ran the entire Python test suite. Python test suite + standard library both contain around 72395 code objects, which required 20Mb of memory for caches. The test process consumed around 400Mb of memory. Thus, the absolute worst case scenario, the overhead is about 5%.
Test process consumes such much memory because few tests creates huge objects. If exclude these tests (note that tests that requires more than 1Gb are already excluded by default) and tests that creates a number of threads (threads consume much memory too), the rest of tests needs less than 100Mb of memory. Absolute required minimum is about 25Mb. Thus, the absolute worst case scenario, the overhead is about 100%.

On 01.02.16 21:10, Yury Selivanov wrote:
To measure the max/average memory impact, I tuned my code to optimize *every* code object on *first* run. Then I ran the entire Python test suite. Python test suite + standard library both contain around 72395 code objects, which required 20Mb of memory for caches. The test process consumed around 400Mb of memory. Thus, the absolute worst case scenario, the overhead is about 5%.
Test process consumes such much memory because few tests creates huge objects. If exclude these tests (note that tests that requires more than 1Gb are already excluded by default) and tests that creates a number of threads (threads consume much memory too), the rest of tests needs less than 100Mb of memory. Absolute required minimum is about 25Mb. Thus, the absolute worst case scenario, the overhead is about 100%. Can you give me the exact configuration of tests (command line to run)
On 2016-02-02 12:41 PM, Serhiy Storchaka wrote: that would only consume 25mb? Yury

On 02.02.16 19:45, Yury Selivanov wrote:
On 01.02.16 21:10, Yury Selivanov wrote:
To measure the max/average memory impact, I tuned my code to optimize *every* code object on *first* run. Then I ran the entire Python test suite. Python test suite + standard library both contain around 72395 code objects, which required 20Mb of memory for caches. The test process consumed around 400Mb of memory. Thus, the absolute worst case scenario, the overhead is about 5%.
Test process consumes such much memory because few tests creates huge objects. If exclude these tests (note that tests that requires more than 1Gb are already excluded by default) and tests that creates a number of threads (threads consume much memory too), the rest of tests needs less than 100Mb of memory. Absolute required minimum is about 25Mb. Thus, the absolute worst case scenario, the overhead is about 100%. Can you give me the exact configuration of tests (command line to run)
On 2016-02-02 12:41 PM, Serhiy Storchaka wrote: that would only consume 25mb?
I don't remember what exact tests consume the most of memory, but following tests are failed when run with less than 30Mb of memory: test___all__ test_asynchat test_asyncio test_bz2 test_capi test_concurrent_futures test_ctypes test_decimal test_descr test_distutils test_docxmlrpc test_eintr test_email test_fork1 test_fstring test_ftplib test_functools test_gc test_gdb test_hashlib test_httplib test_httpservers test_idle test_imaplib test_import test_importlib test_io test_itertools test_json test_lib2to3 test_list test_logging test_longexp test_lzma test_mmap test_multiprocessing_fork test_multiprocessing_forkserver test_multiprocessing_main_handling test_multiprocessing_spawn test_os test_pickle test_poplib test_pydoc test_queue test_regrtest test_resource test_robotparser test_shutil test_smtplib test_socket test_sqlite test_ssl test_subprocess test_tarfile test_tcl test_thread test_threaded_import test_threadedtempfile test_threading test_threading_local test_threadsignals test_tix test_tk test_tools test_ttk_guionly test_ttk_textonly test_tuple test_unicode test_urllib2_localnet test_wait3 test_wait4 test_xmlrpc test_zipfile test_zlib

On 2016-02-02 1:45 PM, Serhiy Storchaka wrote:
On 02.02.16 19:45, Yury Selivanov wrote:
On 01.02.16 21:10, Yury Selivanov wrote:
To measure the max/average memory impact, I tuned my code to optimize *every* code object on *first* run. Then I ran the entire Python test suite. Python test suite + standard library both contain around 72395 code objects, which required 20Mb of memory for caches. The test process consumed around 400Mb of memory. Thus, the absolute worst case scenario, the overhead is about 5%.
Test process consumes such much memory because few tests creates huge objects. If exclude these tests (note that tests that requires more than 1Gb are already excluded by default) and tests that creates a number of threads (threads consume much memory too), the rest of tests needs less than 100Mb of memory. Absolute required minimum is about 25Mb. Thus, the absolute worst case scenario, the overhead is about 100%. Can you give me the exact configuration of tests (command line to run)
On 2016-02-02 12:41 PM, Serhiy Storchaka wrote: that would only consume 25mb?
I don't remember what exact tests consume the most of memory, but following tests are failed when run with less than 30Mb of memory:
test___all__ test_asynchat test_asyncio test_bz2 test_capi test_concurrent_futures test_ctypes test_decimal test_descr test_distutils test_docxmlrpc test_eintr test_email test_fork1 test_fstring test_ftplib test_functools test_gc test_gdb test_hashlib test_httplib test_httpservers test_idle test_imaplib test_import test_importlib test_io test_itertools test_json test_lib2to3 test_list test_logging test_longexp test_lzma test_mmap test_multiprocessing_fork test_multiprocessing_forkserver test_multiprocessing_main_handling test_multiprocessing_spawn test_os test_pickle test_poplib test_pydoc test_queue test_regrtest test_resource test_robotparser test_shutil test_smtplib test_socket test_sqlite test_ssl test_subprocess test_tarfile test_tcl test_thread test_threaded_import test_threadedtempfile test_threading test_threading_local test_threadsignals test_tix test_tk test_tools test_ttk_guionly test_ttk_textonly test_tuple test_unicode test_urllib2_localnet test_wait3 test_wait4 test_xmlrpc test_zipfile test_zlib
Alright, I modified the code to optimize ALL code objects, and ran unit tests with the above tests excluded: -- Max process mem (ru_maxrss) = 131858432 -- Opcode cache number of objects = 42109 -- Opcode cache total extra mem = 10901106 And asyncio tests: -- Max process mem (ru_maxrss) = 57081856 -- Opcode cache number of objects = 4656 -- Opcode cache total extra mem = 1766681 So the absolute worst case for a small asyncio program is 3%, for unit tests (with the above list excluded) - 8%. I think it'd be very hard to find a real-life program that consists of only code objects, and nothing else (no data to work with/process, no objects with dicts, no threads, basically nothing). Because only for such a program you would have a 100% memory overhead for the bytecode cache (when all code objects are optimized). FWIW, here are stats for asyncio with only hot objects being optimized: -- Max process mem (ru_maxrss) = 54775808 -- Opcode cache number of objects = 121 -- Opcode cache total extra mem = 43521 Yury

On 02.02.16 21:23, Yury Selivanov wrote:
Alright, I modified the code to optimize ALL code objects, and ran unit tests with the above tests excluded:
-- Max process mem (ru_maxrss) = 131858432 -- Opcode cache number of objects = 42109 -- Opcode cache total extra mem = 10901106
Thank you for doing these tests. Now results are more convincing to me.
And asyncio tests:
-- Max process mem (ru_maxrss) = 57081856 -- Opcode cache number of objects = 4656 -- Opcode cache total extra mem = 1766681
FWIW, here are stats for asyncio with only hot objects being optimized:
-- Max process mem (ru_maxrss) = 54775808 -- Opcode cache number of objects = 121 -- Opcode cache total extra mem = 43521
Interesting, 57081856 - 54775808 = 2306048, but 1766681 - 43521 = 1723160. There are additional 0.5Mb lost during fragmentation.

2016-02-02 20:23 GMT+01:00 Yury Selivanov <yselivanov.ml@gmail.com>:
Alright, I modified the code to optimize ALL code objects, and ran unit tests with the above tests excluded:
-- Max process mem (ru_maxrss) = 131858432 -- Opcode cache number of objects = 42109 -- Opcode cache total extra mem = 10901106
In my experience, RSS is a coarse measure of the memory usage. I wrote tracemalloc to get a reliable measure of the *Python* memory usage: https://docs.python.org/dev/library/tracemalloc.html#tracemalloc.get_traced_... Run tests with -X tracemalloc -i, and then type in the REPL:
import tracemalloc; print("%.1f kB" % (tracemalloc.get_traced_memory()[1] / 1024.)) 10197.7 kB
I expect this value to be (much) lower than RSS. Victor
participants (11)
-
Andrew Barnert
-
Brett Cannon
-
francismb
-
Matthias Bussonnier
-
Nick Coghlan
-
Serhiy Storchaka
-
Stephen J. Turnbull
-
Sven R. Kunze
-
Victor Stinner
-
Yury Selivanov
-
ƦOB COASTN