[Python-Dev] Speeding up CPython 5-10%
yselivanov.ml at gmail.com
Wed Jan 27 15:26:35 EST 2016
On 2016-01-27 3:01 PM, Brett Cannon wrote:
> We can also optimize LOAD_METHOD. There are high chances, that
> 'obj' in
> 'obj.method()' will be of the same type every time we execute the code
> object. So if we'd have an opcodes cache, LOAD_METHOD could then
> a pointer to the resolved unbound method, a pointer to obj.__class__,
> and tp_version_tag of obj.__class__. Then it would only need to check
> if the cached object type is the same (and that it wasn't
> modified) and
> that obj.__dict__ doesn't override 'method'. Long story short, this
> caching really speeds up method calls on types implemented in C.
> list.append becomes very fast, because list doesn't have a
> __dict__, so
> the check is very cheap (with cache).
> What would it take to make this work with Python-defined classes?
It already works for Python-defined classes. But it's a bit more
expensive because you still have to check object's __dict__. Still,
there is a very noticeable performance increase (see the results of
> I guess that would require knowing the version of the instance's
> __dict__, the instance's __class__ version, the MRO, and where the
> method object was found in the MRO and any intermediary classes to
> know if it was suddenly shadowed? I think that's everything. :)
No, unfortunately we can't use the version of the instance's __dict__ as
it is very volatile. The current implementation of opcode cache works
because types are much more stable. Remember, the cache is per *code
object*, so it should work for all times when code object is executed.
self.ham() # <- version of self.__dict__ is unstable
# so we'll endup invalidating the cache
# too often
__class__ version, MRO changes etc are covered by tp_version_tag, which
I use as one of guards.
> Obviously that's a lot, but I wonder how many classes have a deep
> inheritance model vs. inheriting only from `object`? In that case you
> only have to check self.__dict__.ma_version, self.__class__,
> self.__class__.__dict__.ma_version, and self.__class__.__class__ ==
> `type`. I guess another way to look at this is to get an idea of how
> complex do the checks have to get before caching something like this
> is not worth it (probably also depends on how often you mutate
> self.__dict__ thanks to mutating attributes, but you could in that
> instance just decide to always look at self.__dict__ for the method's
> key and then do the ma_version cache check for everything coming from
> the class).
> Otherwise we can consider looking at the the caching strategies that
> Self helped pioneer (http://bibliography.selflanguage.org/) that all
> of the various JS engines lifted and consider caching all method lookups.
Yeah, hidden classes are great. But the infrastructure to support them
properly is huge. I think that to make them work you'll need a JIT --
to trace, deoptimize, optimize, and do it all with a reasonable memory
footprint. My patch is much smaller and simpler, something we can
realistically tune and ship in 3.6.
> A straightforward way to implement such a cache is simple, but
> a lot of memory, that would be just wasted, since we only need such a
> cache for LOAD_GLOBAL and LOAD_METHOD opcodes. So we have to be
> about the cache design. Here's what I came up with:
> 1. We add a few fields to the code object.
> 2. ceval will count how many times each code object is executed.
> 3. When the code object is executed over ~900 times, we mark it as
> What happens if you simply consider all code as hot? Is the overhead
> of building the mapping such that you really need this, or is this
> simply to avoid some memory/startup cost?
That's the first step for this patch. I think we need to profile
several big applications (I'll do it later for some of my code bases)
and see how big is the memory impact if we optimize everything.
In any case, I expect it to be noticeable (which may be acceptable), so
we'll probably try to optimize it.
> We also create an 'unsigned char' array "MAPPING", with length
> set to match the length of the code object. So we have a 1-to-1
> between opcodes and MAPPING array.
> 4. Next ~100 calls, while the code object is "hot", LOAD_GLOBAL and
> LOAD_METHOD do "MAPPING[opcode_offset()]++".
> 5. After 1024 calls to the code object, ceval loop will iterate
> the MAPPING, counting all opcodes that were executed more than 50
> Where did the "50 times" boundary come from? Was this measured somehow
> or did you just guess at a number?
If the number is too low, then you'll optimize code in branches that are
rarely executed. So I picked 50, because I only trace opcodes for 100
All of those numbers can be (should be?) changed, and I think we should
experiment with different heuristics.
> If you're interested in these kind of optimizations, please help with
> code reviews, ideas, profiling and benchmarks. The latter is
> important, I'd never imagine how hard it is to come up with a good
> Have you tried hg.python.org/benchmarks <http://hg.python.org/benchmarks>?
> Or are you looking for new benchmarks? If the latter then we should
> probably strike up a discussion on speed@ and start considering a new,
> unified benchmark suite that CPython, PyPy, Pyston, Jython, and
> IronPython can all agree on.
Yes, IMHO we need better benchmarks. Some of the existing ones are very
unstable -- I can run them three times and get three completely
different results. Benchmarking is hard :)
I'll create a few issues on bugs.python.org with new/updated benchmarks,
and will join the speed@ mailing list.
More information about the Python-Dev