[Python-Dev] Speeding up CPython 5-10%

Tue Feb 2 09:15:58 EST 2016

On 2016-02-02 4:28 AM, Victor Stinner wrote:
[..]
> I take a first look at your patch and sorry,

Thanks for the initial code review!

> I'm skeptical about the
> design. I have to play with it a little bit more to check if there is
> no better design.

So far I see two things you are worried about:

1. The cache is attached to the code object vs function/frame.

I think the code object is the perfect place for such a cache.

The cache must be there (and survive!) "across" the frames.
If you attach it to the function object, you'll have to
re-attach it to a frame object on each PyEval call.
I can't see how that would be better.

2. Two levels of indirection in my cache -- offsets table +
cache table.

In my other email thread "Opcode cache in ceval loop" I
explained that optimizing every code object in the standard
library and unittests adds 5% memory overhead.  Optimizing
only those that are called frequently is less than 1%.

Besides, many functions that you import are never called, or
only called once or twice.  And code objects for modules
and class bodies are called once.

If we don't use an offset table and just allocate a cache
entry for every opcode, then the memory usage will raise
*significantly*.  Right now the overhead of the offset table
is *8 bits* per opcode, the overhead of the cache table is
*32 bytes* per an optimized opcode.  The overhead of
using 1 extra indirection is minimal.

[..]

>
>
> 2016-01-27 19:25 GMT+01:00 Yury Selivanov <yselivanov.ml at gmail.com>:
>> tl;dr The summary is that I have a patch that improves CPython performance
>> up to 5-10% on macro benchmarks.  Benchmarks results on Macbook Pro/Mac OS
>> X, desktop CPU/Linux, server CPU/Linux are available at [1].  There are no
>> slowdowns that I could reproduce consistently.
> That's really impressive, great job Yury :-) Getting non-negligible
> speedup on large macrobenchmarks became really hard in CPython.
> CPython is already well optimized in all corners. It looks like the
> overall Python performance still depends heavily on the performance of
> dictionary and attribute lookups. Even if it was well known, I didn't
> expect up to 10% speedup on *macro* benchmarks.

Thanks!

>
>
>> LOAD_METHOD & CALL_METHOD
>> -------------------------
>>
>> We had a lot of conversations with Victor about his PEP 509, and he sent me
>> a link to his amazing compilation of notes about CPython performance [2].
>> One optimization that he pointed out to me was LOAD/CALL_METHOD opcodes, an
>> idea first originated in PyPy.
>>
>> There is a patch that implements this optimization, it's tracked here: [3].
>> There are some low level details that I explained in the issue, but I'll go
>> over the high level design in this email as well.
> Your cache is stored directly in code objects. Currently, code objects
> are immutable.

Code objects are immutable on the Python level.  My cache
doesn't make any previously immutable field mutable.

Adding a few mutable cache structures visible only at the
C level is acceptable I think.

>
> Antoine Pitrou's patch adding a LOAD_GLOBAL cache adds a cache to
> functions with an "alias" in each frame object:
> http://bugs.python.org/issue10401
>
> Andrea Griffini's patch also adding a cache for LOAD_GLOBAL adds a
> cache for code objects too.
> https://bugs.python.org/issue1616125

Those patches are nice, but optimizing just LOAD_GLOBAL
won't give you a big speed-up.  For instance, 2to3 became
7-8% faster once I started to optimize LOAD_ATTR.

The idea of my patch is that it implements caching
in such a way, that we can add it to several different
opcodes.
>> The opcodes we want to optimize are LAOD_GLOBAL, 0 and 3.  Let's look at the
>> first one, that loads the 'print' function from builtins.  The opcode knows
>> the following bits of information:
> I tested your latest patch. It looks like LOAD_GLOBAL never
> invalidates the cache on cache miss ("deoptimize" the instruction).

Yes, that was a deliberate decision (but we can add the
deoptimization easily).  So far I haven't seen a use case
or benchmark where we really need to deoptimize.

>
> I suggest to always invalidate the cache at each cache miss. Not only,
> it's common to modify global variables, but there is also the issue of
> different namespace used with the same code object. Examples:
>
> * late global initialization. See for example _a85chars cache of
> base64.a85encode.
> * code object created in a temporary namespace and then always run in
> a different global namespace. See for example
> collections.namedtuple(). I'm not sure that it's the best example
> because it looks like the Python code only loads builtins, not
> globals. But it looks like your code keeps a copy of the version of
> the global namespace dict.
>
> I tested with a threshold of 1: always optimize all code objects.
> Maybe with your default threshold of 1024 runs, the issue with
> different namespaces doesn't occur in practice.

Yep. I added a constant in ceval.c that enables collection
of opcode cache stats.

99.9% of all global dicts in benchmarks are stable.

test suite was a bit different, only ~99% :) One percent
of cache misses was probably because of unittest.mock.

>
>
>> A straightforward way to implement such a cache is simple, but consumes a
>> lot of memory, that would be just wasted, since we only need such a cache
>> for LOAD_GLOBAL and LOAD_METHOD opcodes. So we have to be creative about the
>> cache design.
> I'm not sure that it's worth to develop a complex dynamic logic to
> only enable optimizations after a threshold (design very close to a
> JIT compiler).

I think it's not even remotely close to what JITs do.

In my design I have a simple counter -- when it reaches
1000, we create the caches in the code objects.  Some opcodes
start to use it.  That's basically it.

JIT compilers trace the code, collect information
about types, think about memory, optimize, deoptimize,
think about memory again, etc, etc :)

> What is the overhead (% of RSS memory) on a concrete
> application when all code objects are optimized at startup?

I've mentioned that in my other thread.

When the whole test suite is run with *every* code object
being optimized (threshold = 1), about 73000 code objects
were optimized, requiring >20Mb of memory (the test suite
process consumed ~400Mb of memory).  So 5% looks to be
the worst case.

When I ran the test suite with threshold set to 1024, only
2000 objects were optimized, requiring less than 1% of
the total process memory.

>
> Maybe we need a global boolean flag to disable the optimization? Or
> even a compilation option?

I'd hate to add such thing.  Why would you want to disable
the cache?  To save 1% of memory?  TBH I think this only
adds maintenance overhead to us.

>
> I mean that all these new counters have a cost, and the code may be
> even faster without these counters if everything is always optimized,
> no?

Yes, but only marginally.   You'll save one "inc" in eval
loop.  And a couple of "if"s.  Maybe on a micro benchmark
you can see a difference.

But optimizing everything will require much more memory.
And we shouldn't optimize code objects that are run only
once -- that's code objects for modules and classes.

Threshold of 1024 is big enough to say that the code
object is frequently used and will probably continue
to be frequently used in the future.

>
> I'm not sure that the storage for the cache is really efficient. It's
> a compact data structure, but it looks "expensive" to access it (there
> is one level of indirection). I understand that it's compact to reduce
> the memory footpring overhead.
>
> I'm not sure that the threshold of 1000x run is ok for short scripts.
> It would be nice to optimize also scripts which only call a function
> 900x times :-) Classical memory vs cpu compromise issue :-)

I'd be OK to change the threshold to 500 or something.  But
IMHO it won't change much.  Short/small scripts won't hit
it anyways.  And even if they do, they typically don't run
long enough to get a measurable speedup.

>
> I'm just thinking aloud :-)

Thanks!  I'm happy that you are looking at this thing
with a critical eye.

BTW, here's a debug output of unit tests with every code
object optimized:

-- Opcode cache number of objects  = 72395
-- Opcode cache total extra mem    = 20925595

-- Opcode cache LOAD_METHOD hits   = 64569036 (63%)
-- Opcode cache LOAD_METHOD misses = 23899 (0%)
-- Opcode cache LOAD_METHOD opts   = 104872
-- Opcode cache LOAD_METHOD deopts = 19191
-- Opcode cache LOAD_METHOD dct-chk= 12805608
-- Opcode cache LOAD_METHOD total  = 101735114

-- Opcode cache LOAD_GLOBAL hits   = 123808815 (99%)
-- Opcode cache LOAD_GLOBAL misses = 310397 (0%)
-- Opcode cache LOAD_GLOBAL opts   = 125205

-- Opcode cache LOAD_ATTR hits     = 59089435 (53%)
-- Opcode cache LOAD_ATTR misses   = 33372 (0%)
-- Opcode cache LOAD_ATTR opts     = 73643
-- Opcode cache LOAD_ATTR deopts   = 20276
-- Opcode cache LOAD_ATTR total    = 111049468

Yury