
Hi, tl;dr The summary is that I have a patch that improves CPython performance up to 5-10% on macro benchmarks. Benchmarks results on Macbook Pro/Mac OS X, desktop CPU/Linux, server CPU/Linux are available at [1]. There are no slowdowns that I could reproduce consistently. There are twodifferent optimizations that yield this speedup: LOAD_METHOD/CALL_METHOD opcodes and per-opcode cache in ceval loop. LOAD_METHOD & CALL_METHOD ------------------------- We had a lot of conversations with Victor about his PEP 509, and he sent me a link to his amazing compilation of notes about CPython performance [2]. One optimization that he pointed out to me was LOAD/CALL_METHOD opcodes, an idea first originated in PyPy. There is a patch that implements this optimization, it's tracked here: [3]. There are some low level details that I explained in the issue, but I'll go over the high level design in this email as well. Every time you access a method attribute on an object, a BoundMethod object is created. It is a fairly expensive operation, despite a freelist of BoundMethods (so that memory allocation is generally avoided). The idea is to detect what looks like a method call in the compiler, and emit a pair of specialized bytecodes for that. So instead of LOAD_GLOBAL/LOAD_ATTR/CALL_FUNCTION we will have LOAD_GLOBAL/LOAD_METHOD/CALL_METHOD. LOAD_METHOD looks at the object on top of the stack, and checks if the name resolves to a method or to a regular attribute. If it's a method, then we push the unbound method object and the object to the stack. If it's an attribute, we push the resolved attribute and NULL. When CALL_METHOD looks at the stack it knows how to call the unbound method properly (pushing the object as a first arg), or how to call a regular callable. This idea does make CPython faster around 2-4%. And it surely doesn't make it slower. I think it's a safe bet to at least implement this optimization in CPython 3.6. So far, the patch only optimizes positional-only method calls. It's possible to optimize all kind of calls, but this will necessitate 3 more opcodes (explained in the issue). We'll need to do some careful benchmarking to see if it's really needed. Per-opcode cache in ceval ------------------------- While reading PEP 509, I was thinking about how we can use dict->ma_version in ceval to speed up globals lookups. One of the key assumptions (and this is what makes JITs possible) is that real-life programs don't modify globals and rebind builtins (often), and that most code paths operate on objects of the same type. In CPython, all pure Python functions have code objects. When you call a function, ceval executes its code object in a frame. Frames contain contextual information, including pointers to the globals and builtins dict. The key observation here is that almost all code objects always have same pointers to the globals (the module they were defined in) and to the builtins. And it's not a good programming practice to mutate globals or rebind builtins. Let's look at this function: def spam(): print(ham) Here are its opcodes: 2 0 LOAD_GLOBAL 0 (print) 3 LOAD_GLOBAL 1 (ham) 6 CALL_FUNCTION 1 (1 positional, 0 keyword pair) 9 POP_TOP 10 LOAD_CONST 0 (None) 13 RETURN_VALUE The opcodes we want to optimize are LAOD_GLOBAL, 0 and 3. Let's look at the first one, that loads the 'print' function from builtins. The opcode knows the following bits of information: - its offset (0), - its argument (0 -> 'print'), - its type (LOAD_GLOBAL). And these bits of information will *never* change. So if this opcode could resolve the 'print' name (from globals or builtins, likely the latter) and save the pointer to it somewhere, along with globals->ma_version and builtins->ma_version, it could, on its second call, just load this cached info back, check that the globals and builtins dict haven't changed and push the cached ref to the stack. That would save it from doing two dict lookups. We can also optimize LOAD_METHOD. There are high chances, that 'obj' in 'obj.method()' will be of the same type every time we execute the code object. So if we'd have an opcodes cache, LOAD_METHOD could then cache a pointer to the resolved unbound method, a pointer to obj.__class__, and tp_version_tag of obj.__class__. Then it would only need to check if the cached object type is the same (and that it wasn't modified) and that obj.__dict__ doesn't override 'method'. Long story short, this caching really speeds up method calls on types implemented in C. list.append becomes very fast, because list doesn't have a __dict__, so the check is very cheap (with cache). A straightforward way to implement such a cache is simple, but consumes a lot of memory, that would be just wasted, since we only need such a cache for LOAD_GLOBAL and LOAD_METHOD opcodes. So we have to be creative about the cache design. Here's what I came up with: 1. We add a few fields to the code object. 2. ceval will count how many times each code object is executed. 3. When the code object is executed over ~900 times, we mark it as "hot". We also create an 'unsigned char' array "MAPPING", with length set to match the length of the code object. So we have a 1-to-1 mapping between opcodes and MAPPING array. 4. Next ~100 calls, while the code object is "hot", LOAD_GLOBAL and LOAD_METHOD do "MAPPING[opcode_offset()]++". 5. After 1024 calls to the code object, ceval loop will iterate through the MAPPING, counting all opcodes that were executed more than 50 times. 6. We then create an array of cache structs "CACHE" (here's a link to the updated code.h file: [6]). We update MAPPING to be a mapping between opcode position and position in the CACHE. The code object is now "optimized". 7. When the code object is "optimized", LOAD_METHOD and LOAD_GLOBAL use the CACHE array for fast path. 8. When there is a cache miss, i.e. the builtins/global/obj.__dict__ were mutated, the opcode marks its entry in 'CACHE' as deoptimized, and it will never try to use the cache again. Here's a link to the issue tracker with the first version of the patch: [5]. I'm working on the patch in a github repo here: [4]. Summary ------- There are many things about this algorithm that we can improve/tweak. Perhaps we should profile code objects longer, or account for time they were executed. Maybe we shouldn't deoptimize opcodes on their first cache miss. Maybe we can come up with better data structures. We also need to profile the memory and see how much more this cache will require. One thing I'm certain about, is that we can get a 5-10% speedup of CPython with relatively low memory impact. And I think it's worth exploring that! If you're interested in these kind of optimizations, please help with code reviews, ideas, profiling and benchmarks. The latter is especially important, I'd never imagine how hard it is to come up with a good macro benchmark. I also want to thank my company MagicStack (magic.io) for sponsoring this work. Thanks, Yury [1] https://gist.github.com/1st1/aed69d63a2ff4de4c7be [2] http://faster-cpython.readthedocs.org/index.html [3] http://bugs.python.org/issue26110 [4] https://github.com/1st1/cpython/tree/opcache2 [5] http://bugs.python.org/issue26219 [6] https://github.com/python/cpython/compare/master...1st1:opcache2?expand=1#di...

On Wed, 27 Jan 2016 at 10:26 Yury Selivanov <yselivanov.ml@gmail.com> wrote:
What would it take to make this work with Python-defined classes? I guess that would require knowing the version of the instance's __dict__, the instance's __class__ version, the MRO, and where the method object was found in the MRO and any intermediary classes to know if it was suddenly shadowed? I think that's everything. :) Obviously that's a lot, but I wonder how many classes have a deep inheritance model vs. inheriting only from `object`? In that case you only have to check self.__dict__.ma_version, self.__class__, self.__class__.__dict__.ma_version, and self.__class__.__class__ == `type`. I guess another way to look at this is to get an idea of how complex do the checks have to get before caching something like this is not worth it (probably also depends on how often you mutate self.__dict__ thanks to mutating attributes, but you could in that instance just decide to always look at self.__dict__ for the method's key and then do the ma_version cache check for everything coming from the class). Otherwise we can consider looking at the the caching strategies that Self helped pioneer (http://bibliography.selflanguage.org/) that all of the various JS engines lifted and consider caching all method lookups.
What happens if you simply consider all code as hot? Is the overhead of building the mapping such that you really need this, or is this simply to avoid some memory/startup cost?
Where did the "50 times" boundary come from? Was this measured somehow or did you just guess at a number?
Great!
Have you tried hg.python.org/benchmarks? Or are you looking for new benchmarks? If the latter then we should probably strike up a discussion on speed@ and start considering a new, unified benchmark suite that CPython, PyPy, Pyston, Jython, and IronPython can all agree on.
I also want to thank my company MagicStack (magic.io) for sponsoring this work.
Yep, thanks to all the companies sponsoring people doing work lately to try and speed things up!

On 2016-01-27 3:01 PM, Brett Cannon wrote:
It already works for Python-defined classes. But it's a bit more expensive because you still have to check object's __dict__. Still, there is a very noticeable performance increase (see the results of benchmark runs).
No, unfortunately we can't use the version of the instance's __dict__ as it is very volatile. The current implementation of opcode cache works because types are much more stable. Remember, the cache is per *code object*, so it should work for all times when code object is executed. class F: def spam(self): self.ham() # <- version of self.__dict__ is unstable # so we'll endup invalidating the cache # too often __class__ version, MRO changes etc are covered by tp_version_tag, which I use as one of guards.
Yeah, hidden classes are great. But the infrastructure to support them properly is huge. I think that to make them work you'll need a JIT -- to trace, deoptimize, optimize, and do it all with a reasonable memory footprint. My patch is much smaller and simpler, something we can realistically tune and ship in 3.6.
That's the first step for this patch. I think we need to profile several big applications (I'll do it later for some of my code bases) and see how big is the memory impact if we optimize everything. In any case, I expect it to be noticeable (which may be acceptable), so we'll probably try to optimize it.
If the number is too low, then you'll optimize code in branches that are rarely executed. So I picked 50, because I only trace opcodes for 100 calls. All of those numbers can be (should be?) changed, and I think we should experiment with different heuristics.
Yes: https://gist.github.com/1st1/aed69d63a2ff4de4c7be
Yes, IMHO we need better benchmarks. Some of the existing ones are very unstable -- I can run them three times and get three completely different results. Benchmarking is hard :) I'll create a few issues on bugs.python.org with new/updated benchmarks, and will join the speed@ mailing list. Yury

As Brett suggested, I've just run the benchmarks suite with memory tracking on. The results are here: https://gist.github.com/1st1/1851afb2773526fd7c58 Looks like the memory increase is around 1%. One synthetic micro-benchmark, unpack_sequence, contains hundreds of lines that load a global variable and does nothing else, consumes 5%. Yury

BTW, this optimization also makes some old optimization tricks obsolete. 1. No need to write 'def func(len=len)'. Globals lookups will be fast. 2. No need to save bound methods: obj = [] obj_append = obj.append for _ in range(10**6): obj_append(something) This hand-optimized code would only be marginally faster, because of LOAD_METHOD and how it's cached. Yury

Yury Selivanov schrieb am 27.01.2016 um 19:25:
I implemented a similar but simpler optimisation in Cython a while back: http://blog.behnel.de/posts/faster-python-calls-in-cython-021.html Instead of avoiding the creation of method objects, as you proposed, it just normally calls getattr and if that returns a bound method object, it uses inlined calling code that avoids re-packing the argument tuple. Interestingly, I got speedups of 5-15% for some of the Python benchmarks, but I don't quite remember which ones (at least raytrace and richards, I think), nor do I recall the overall gain, which (I assume) is what you are referring to with your 2-4% above. Might have been in the same order. Stefan

On 2016-01-29 5:00 AM, Stefan Behnel wrote:
That's great! I'm still working on the patch, but so far it looks like adding just LOAD_METHOD/CALL_METHOD (that avoid instantiating BoundMethods) gives us 10-15% faster method calls. Combining them with my opcode cache makes them 30-35% faster. Yury

On Wed, Jan 27, 2016 at 01:25:27PM -0500, Yury Selivanov wrote:
Have you looked at Cesare Di Mauro's wpython? As far as I know, it's now unmaintained, and the project repo on Google Code appears to be dead (I get a 404), but I understand that it was significantly faster than CPython back in the 2.6 days. https://wpython.googlecode.com/files/Beyond%20Bytecode%20-%20A%20Wordcode-ba... -- Steve

On 2016-01-29 11:28 PM, Steven D'Aprano wrote:
Thanks for bringing this up! IIRC wpython was about using "fat" bytecodes, i.e. using 64bits per bytecode instead of 8. That allows to minimize the number of bytecodes, thus having some performance increase. TBH, I don't think it was "significantly faster". If I were to do some big refactoring of the ceval loop, I'd probably consider implementing a register VM. While register VMs are a bit faster than stack VMs (up to 20-30%), they would also allow us to apply more optimizations, and even bolt on a simple JIT compiler. Yury

On Mon, 1 Feb 2016 at 09:08 Yury Selivanov <yselivanov.ml@gmail.com> wrote:
If you did tackle the register VM approach that would also settle a long-standing question of whether a certain optimization works for Python. As for bolting on a JIT, the whole point of Pyjion is to see if that's worth it for CPython, so that's already being taken care of (and is actually easier with a stack-based VM since the JIT engine we're using is stack-based itself).

On 01.02.2016 18:18, Brett Cannon wrote:
Are there some resources on why register machines are considered faster than stack machines?
Interesting. Haven't noticed these projects, yet. So, it could be that we will see a jitted CPython when Pyjion appears to be successful? Best, Sven

On Mon, 1 Feb 2016 at 10:21 Sven R. Kunze <srkunze@mail.de> wrote:
A search for [stack vs register based virtual machine] will get you some information.
You aren't really supposed to yet. :) In Pyjion's case we are still working on compatibility, let alone trying to show a speed improvement so we have not said much beyond this mailing list (we have a talk proposal in for PyCon US that we hope gets accepted). We just happened to get picked up on Reddit and HN recently and so interest has spiked in the project.
So, it could be that we will see a jitted CPython when Pyjion appears to be successful?
The ability to plug in a JIT, but yes, that's the hope.

On 01.02.2016 19:28, Brett Cannon wrote:
A search for [stack vs register based virtual machine] will get you some information.
Alright. :) Will go for that.
Exciting. :)
Okay. Not sure what you mean by plugin. One thing I like about Python is that it just works. So, plugin sounds like unnecessary work.

Sven R. Kunze wrote:
Are there some resources on why register machines are considered faster than stack machines?
If a register VM is faster, it's probably because each register instruction does the work of about 2-3 stack instructions, meaning less trips around the eval loop, so less unpredictable branches and less pipeline flushes. This assumes that bytecode dispatching is a substantial fraction of the time taken to execute each instruction. For something like cpython, where the operations carried out by the bytecodes involve a substantial amount of work, this may not be true. It also assumes the VM is executing the bytecodes directly. If there is a JIT involved, it all gets translated into something else anyway, and then it's more a matter of whether you find it easier to design the JIT to deal with stack or register code. -- Greg

On 02.02.2016 00:27, Greg Ewing wrote:
That's was I found so far as well.
Interesting point indeed. It makes sense that register machines only saves us the bytecode dispatching. How much that is compared to the work each instruction requires, I cannot say. Maybe, Yury has a better understanding here.
It seems like Yury thinks so. He didn't tell use so far. Best, Sven

Also, modern compiler technology tends to use "infinite register" machines for the intermediate representation, then uses register coloring to assign the actual registers (and generate spill code if needed). I've seen work on inter-function optimization for avoiding some register loads and stores (combined with tail-call optimization, it can turn recursive calls into loops in the register machine). On 2 February 2016 at 09:16, Sven R. Kunze <srkunze@mail.de> wrote:

On 01/02/2016 16:54, Yury Selivanov wrote:
From https://code.google.com/archive/p/wpython/ <quote> WPython is a re-implementation of (some parts of) Python, which drops support for bytecode in favour of a wordcode-based model (where a is word is 16 bits wide). It also implements an hybrid stack-register virtual machine, and adds a lot of other optimizations. </quote> -- My fellow Pythonistas, ask not what our language can do for you, ask what you can do for our language. Mark Lawrence

2016-02-01 17:54 GMT+01:00 Yury Selivanov <yselivanov.ml@gmail.com>:
No, it used 16, 32, and 48-bit per opcode (1, 2, or 3 16-bit words).
That allows to minimize the number of bytecodes, thus having some performance increase. TBH, I don't think it was "significantly faster".
Please, take a look at the benchmarks, or compile it and check yourself. ;-) If I were to do some big refactoring of the ceval loop, I'd probably
WPython was an hybrid-VM: it supported both a stack-based and a register-based approach. I think that it's needed, since the nature of Python, because you can have operations with intermixed operands: constants, locals, globals, names. It's quite difficult to handle all possible cases with a register-based VM. Regards, Cesare

Hi, I'm back for the FOSDEM event at Bruxelles, it was really cool. I gave talk about FAT Python and I got good feedback. But friends told me that people now have expectations on FAT Python. It looks like people care of Python performance :-) FYI the slides of my talk: https://github.com/haypo/conf/raw/master/2016-FOSDEM/fat_python.pdf (a video was recorded, I don't know when it will be online) I take a first look at your patch and sorry, I'm skeptical about the design. I have to play with it a little bit more to check if there is no better design. To be clear, FAT Python with your work looks more and more like a cheap JIT compiler :-) Guards, specializations, optimizing at runtime after a threshold... all these things come from JIT compilers. I like the idea of a kind-of JIT compiler without having to pay the high cost of a large dependency like LLVM. I like baby steps in CPython, it's faster, it's possible to implement it in a single release cycle (one minor Python release, Python 3.6). Integrating a JIT compiler into CPython already failed with Unladen Swallow :-/ PyPy has a complete different design (and has serious issues with the Python C API), Pyston is restricted to Python 2.7, Pyjion looks specific to Windows (CoreCLR), Numba is specific to numeric computations (numpy). IMHO none of these projects can be easily be merged into CPython "quickly" (again, in a single Python release cycle). By the way, Pyjion still looks very young (I heard that they are still working on the compatibility with CPython, not on performance yet). 2016-01-27 19:25 GMT+01:00 Yury Selivanov <yselivanov.ml@gmail.com>:
That's really impressive, great job Yury :-) Getting non-negligible speedup on large macrobenchmarks became really hard in CPython. CPython is already well optimized in all corners. It looks like the overall Python performance still depends heavily on the performance of dictionary and attribute lookups. Even if it was well known, I didn't expect up to 10% speedup on *macro* benchmarks.
Your cache is stored directly in code objects. Currently, code objects are immutable. Antoine Pitrou's patch adding a LOAD_GLOBAL cache adds a cache to functions with an "alias" in each frame object: http://bugs.python.org/issue10401 Andrea Griffini's patch also adding a cache for LOAD_GLOBAL adds a cache for code objects too. https://bugs.python.org/issue1616125 I don't know what is the best place to store the cache. I vaguely recall a patch which uses a single unique global cache, but maybe I'm wrong :-p
I tested your latest patch. It looks like LOAD_GLOBAL never invalidates the cache on cache miss ("deoptimize" the instruction). I suggest to always invalidate the cache at each cache miss. Not only, it's common to modify global variables, but there is also the issue of different namespace used with the same code object. Examples: * late global initialization. See for example _a85chars cache of base64.a85encode. * code object created in a temporary namespace and then always run in a different global namespace. See for example collections.namedtuple(). I'm not sure that it's the best example because it looks like the Python code only loads builtins, not globals. But it looks like your code keeps a copy of the version of the global namespace dict. I tested with a threshold of 1: always optimize all code objects. Maybe with your default threshold of 1024 runs, the issue with different namespaces doesn't occur in practice.
I'm not sure that it's worth to develop a complex dynamic logic to only enable optimizations after a threshold (design very close to a JIT compiler). What is the overhead (% of RSS memory) on a concrete application when all code objects are optimized at startup? Maybe we need a global boolean flag to disable the optimization? Or even a compilation option? I mean that all these new counters have a cost, and the code may be even faster without these counters if everything is always optimized, no? I'm not sure that the storage for the cache is really efficient. It's a compact data structure, but it looks "expensive" to access it (there is one level of indirection). I understand that it's compact to reduce the memory footpring overhead. I'm not sure that the threshold of 1000x run is ok for short scripts. It would be nice to optimize also scripts which only call a function 900x times :-) Classical memory vs cpu compromise issue :-) I'm just thinking aloud :-) Victor

On 2016-02-02 4:28 AM, Victor Stinner wrote: [..]
I take a first look at your patch and sorry,
Thanks for the initial code review!
So far I see two things you are worried about: 1. The cache is attached to the code object vs function/frame. I think the code object is the perfect place for such a cache. The cache must be there (and survive!) "across" the frames. If you attach it to the function object, you'll have to re-attach it to a frame object on each PyEval call. I can't see how that would be better. 2. Two levels of indirection in my cache -- offsets table + cache table. In my other email thread "Opcode cache in ceval loop" I explained that optimizing every code object in the standard library and unittests adds 5% memory overhead. Optimizing only those that are called frequently is less than 1%. Besides, many functions that you import are never called, or only called once or twice. And code objects for modules and class bodies are called once. If we don't use an offset table and just allocate a cache entry for every opcode, then the memory usage will raise *significantly*. Right now the overhead of the offset table is *8 bits* per opcode, the overhead of the cache table is *32 bytes* per an optimized opcode. The overhead of using 1 extra indirection is minimal. [..]
Thanks!
Code objects are immutable on the Python level. My cache doesn't make any previously immutable field mutable. Adding a few mutable cache structures visible only at the C level is acceptable I think.
Those patches are nice, but optimizing just LOAD_GLOBAL won't give you a big speed-up. For instance, 2to3 became 7-8% faster once I started to optimize LOAD_ATTR. The idea of my patch is that it implements caching in such a way, that we can add it to several different opcodes.
Yes, that was a deliberate decision (but we can add the deoptimization easily). So far I haven't seen a use case or benchmark where we really need to deoptimize.
Yep. I added a constant in ceval.c that enables collection of opcode cache stats. 99.9% of all global dicts in benchmarks are stable. test suite was a bit different, only ~99% :) One percent of cache misses was probably because of unittest.mock.
I think it's not even remotely close to what JITs do. In my design I have a simple counter -- when it reaches 1000, we create the caches in the code objects. Some opcodes start to use it. That's basically it. JIT compilers trace the code, collect information about types, think about memory, optimize, deoptimize, think about memory again, etc, etc :)
What is the overhead (% of RSS memory) on a concrete application when all code objects are optimized at startup?
I've mentioned that in my other thread. When the whole test suite is run with *every* code object being optimized (threshold = 1), about 73000 code objects were optimized, requiring >20Mb of memory (the test suite process consumed ~400Mb of memory). So 5% looks to be the worst case. When I ran the test suite with threshold set to 1024, only 2000 objects were optimized, requiring less than 1% of the total process memory.
Maybe we need a global boolean flag to disable the optimization? Or even a compilation option?
I'd hate to add such thing. Why would you want to disable the cache? To save 1% of memory? TBH I think this only adds maintenance overhead to us.
Yes, but only marginally. You'll save one "inc" in eval loop. And a couple of "if"s. Maybe on a micro benchmark you can see a difference. But optimizing everything will require much more memory. And we shouldn't optimize code objects that are run only once -- that's code objects for modules and classes. Threshold of 1024 is big enough to say that the code object is frequently used and will probably continue to be frequently used in the future.
I'd be OK to change the threshold to 500 or something. But IMHO it won't change much. Short/small scripts won't hit it anyways. And even if they do, they typically don't run long enough to get a measurable speedup.
I'm just thinking aloud :-)
Thanks! I'm happy that you are looking at this thing with a critical eye. BTW, here's a debug output of unit tests with every code object optimized: -- Opcode cache number of objects = 72395 -- Opcode cache total extra mem = 20925595 -- Opcode cache LOAD_METHOD hits = 64569036 (63%) -- Opcode cache LOAD_METHOD misses = 23899 (0%) -- Opcode cache LOAD_METHOD opts = 104872 -- Opcode cache LOAD_METHOD deopts = 19191 -- Opcode cache LOAD_METHOD dct-chk= 12805608 -- Opcode cache LOAD_METHOD total = 101735114 -- Opcode cache LOAD_GLOBAL hits = 123808815 (99%) -- Opcode cache LOAD_GLOBAL misses = 310397 (0%) -- Opcode cache LOAD_GLOBAL opts = 125205 -- Opcode cache LOAD_ATTR hits = 59089435 (53%) -- Opcode cache LOAD_ATTR misses = 33372 (0%) -- Opcode cache LOAD_ATTR opts = 73643 -- Opcode cache LOAD_ATTR deopts = 20276 -- Opcode cache LOAD_ATTR total = 111049468 Yury

On Tue, 2 Feb 2016 at 01:29 Victor Stinner <victor.stinner@gmail.com> wrote:
We are not ready to have a serious discussion about Pyjion yet as we are still working on compatibility (we have a talk proposal in for PyCon US 2016 and so we are hoping to have something to discuss at the language summit), but Victor's email shows there is some misconceptions about it already and a misunderstanding of our fundamental goal. First off, Pyjion is very much a work-in-progress. You can find it at https://github.com/microsoft/pyjion (where there is an FAQ), but for this audience the key thing to know is that we are still working on compatibility (see https://github.com/Microsoft/Pyjion/blob/master/Tests/python_tests.txt for the list of tests we do (not) pass from the Python test suite). Out of our roughly 400 tests, we don't pass about 18 of them. Second, we have not really started work on performance yet. We have done some very low-hanging fruit stuff, but just barely. IOW we are not really ready to discuss performance (ATM we JIT instantly for all code objects and even being that aggressive with the JIT overhead we are even/slightly slower than an unmodified Python 3.5 VM, so we are hopeful this work will pan out). Third, the over-arching goal of Pyjion is not to add a JIT into CPython, but to add a C API to CPython that will allow plugging in a JIT. If you simply JIT code objects then the API required to let someone plug in a JIT is basically three functions, maybe as little as two (you can see the exact patch against CPython that we are working with at https://github.com/Microsoft/Pyjion/blob/master/Patches/python.diff). We have no interest in shipping a JIT with CPython, just making it much easier to let others add one if they want to because it makes sense for their workload. We have no plans to suggest shipping a JIT with CPython, just to make it an option for people to add in if they want (and if Yury's caching stuff goes in with an execution counter then even the one bit of true overhead we had will be part of CPython already which makes it even more of an easy decision to consider the API we will eventually propose). Fourth, it is not Windows-only by design. CoreCLR is cross-platform on all major OSs, so that is not a restriction (and honestly we are using CoreCLR simply because Dino used to work on the CLR team so he knows the bytecode really well; we easily could have used some other JIT to prove our point). The only reason Pyjion doesn't work with other OSs is momenum/laziness on Dino and my part; Dino hacked together Pyjion at PyCon US 2015 and he is the most comfortable on Windows, and so he just did it in Windows on Visual Studio and just didn't bother to start with e.g., CMake to make it build on other OSs. Since we are still trying to work out some compatibility stuff so we would rather do that than worry about Linux or OS X support right now. Fifth, if we manage to show that a C API can easily be added to CPython to make a JIT something that can simply be plugged in and be useful, then we will also have a basic JIT framework for people to use. As I said, our use of CoreCLR is just for ease of development. There is no reason we couldn't use ChakraCore, v8, LLVM, etc. But since all of these JIT compilers would need to know how to handle CPython bytecode, we have tried to design a framework where JIT compilers just need a wrapper to handle code emission and our framework that we are building will handle driving the code emission (e.g., the wrapper needs to know how to emit add_integer(), but our framework handles when to have to do that). Anyway, as I said, Pyjion is very much a work in progress. We hope to have something more solid to propose/discuss at the language summit at PyCon US 2016. The only reason I keep mentioning it is because what Victor is calling "JIT-like" is really "minimize doing extra work that's not needed" and that benefits everyone trying to do any computational work that takes extra time to speed up CPython (which includes Pyjion). IOW Yury's work combined with Victor's work could quite easily just spill out beyond just local caches and into allowing pluggable JITs in CPython.

On 3 February 2016 at 03:52, Brett Cannon <brett@python.org> wrote:
That could also be really interesting in the context of pymetabiosis [1] if it meant that PyPy could still at least partially JIT the Python code running on the CPython side of the boundary. Cheers, Nick. [1] https://github.com/rguillebert/pymetabiosis -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

2016-02-02 10:28 GMT+01:00 Victor Stinner <victor.stinner@gmail.com>:
It's long time since I took a look at CPython (3.2), but if it didn't changed a lot then there might be some corner cases still waiting to be optimized. ;-) Just one thing that comes to my mind: is the stack depth calculation routine changed? It was suboptimal, and calculating a better number decreases stack allocation, and increases the frame usage.
True, but it might be mitigated in some ways, at least for built-in types. There are ideas about that, but they are a bit complicated to implement. The problem is with functions like len, which IMO should become attribute lookups ('foo'.len) or method executions ('foo'.len()). Then it'll be easier to accelerate their execution, with one of the above ideas. However such kind of changes belong to Guido, which defines the language structure/philosophy. IMO something like len should be part of the attributes exposed by an object: it's more "object-oriented". Whereas other things like open, file, sum, etc., are "general facilities". Regards, Cesare

On Sun, May 15, 2016 at 2:23 AM, Cesare Di Mauro <cesare.di.mauro@gmail.com> wrote:
This is still a problem and came up again recently: http://bugs.python.org/issue26549 -- Meador

2016-05-16 17:55 GMT+02:00 Meador Inge <meadori@gmail.com>:
I saw the last two comments of the issues: this is what I was talking about (in particular the issue opened by Armin applies). However there's another case where the situation is even worse. Let me show a small reproducer: def test(self): for i in range(self.count): with self: pass The stack size reported by Python 2.7.11:
test.__code__.co_stacksize 6
Adding another with statement:
test.__code__.co_stacksize 7
But unfortunately with Python 3.5.1 the problematic is much worse:
test.__code__.co_stacksize 10
test.__code__.co_stacksize 17
Here the situation is exacerbated by the fact that the WITH_CLEANUP instruction of Python 2.x was split into two (WITH_CLEANUP_START and WITH_CLEANUP_FINISH) in some Python 3 release. I don't know why two different instructions were introduced, but IMO it's better to have one instruction which handles all code finalization of the with statement, at least in this case. If there are other scenarios where two different instructions are needed, then ad-hoc instructions like those can be used. Regards, Cesare

On Wed, 27 Jan 2016 at 10:26 Yury Selivanov <yselivanov.ml@gmail.com> wrote:
What would it take to make this work with Python-defined classes? I guess that would require knowing the version of the instance's __dict__, the instance's __class__ version, the MRO, and where the method object was found in the MRO and any intermediary classes to know if it was suddenly shadowed? I think that's everything. :) Obviously that's a lot, but I wonder how many classes have a deep inheritance model vs. inheriting only from `object`? In that case you only have to check self.__dict__.ma_version, self.__class__, self.__class__.__dict__.ma_version, and self.__class__.__class__ == `type`. I guess another way to look at this is to get an idea of how complex do the checks have to get before caching something like this is not worth it (probably also depends on how often you mutate self.__dict__ thanks to mutating attributes, but you could in that instance just decide to always look at self.__dict__ for the method's key and then do the ma_version cache check for everything coming from the class). Otherwise we can consider looking at the the caching strategies that Self helped pioneer (http://bibliography.selflanguage.org/) that all of the various JS engines lifted and consider caching all method lookups.
What happens if you simply consider all code as hot? Is the overhead of building the mapping such that you really need this, or is this simply to avoid some memory/startup cost?
Where did the "50 times" boundary come from? Was this measured somehow or did you just guess at a number?
Great!
Have you tried hg.python.org/benchmarks? Or are you looking for new benchmarks? If the latter then we should probably strike up a discussion on speed@ and start considering a new, unified benchmark suite that CPython, PyPy, Pyston, Jython, and IronPython can all agree on.
I also want to thank my company MagicStack (magic.io) for sponsoring this work.
Yep, thanks to all the companies sponsoring people doing work lately to try and speed things up!

On 2016-01-27 3:01 PM, Brett Cannon wrote:
It already works for Python-defined classes. But it's a bit more expensive because you still have to check object's __dict__. Still, there is a very noticeable performance increase (see the results of benchmark runs).
No, unfortunately we can't use the version of the instance's __dict__ as it is very volatile. The current implementation of opcode cache works because types are much more stable. Remember, the cache is per *code object*, so it should work for all times when code object is executed. class F: def spam(self): self.ham() # <- version of self.__dict__ is unstable # so we'll endup invalidating the cache # too often __class__ version, MRO changes etc are covered by tp_version_tag, which I use as one of guards.
Yeah, hidden classes are great. But the infrastructure to support them properly is huge. I think that to make them work you'll need a JIT -- to trace, deoptimize, optimize, and do it all with a reasonable memory footprint. My patch is much smaller and simpler, something we can realistically tune and ship in 3.6.
That's the first step for this patch. I think we need to profile several big applications (I'll do it later for some of my code bases) and see how big is the memory impact if we optimize everything. In any case, I expect it to be noticeable (which may be acceptable), so we'll probably try to optimize it.
If the number is too low, then you'll optimize code in branches that are rarely executed. So I picked 50, because I only trace opcodes for 100 calls. All of those numbers can be (should be?) changed, and I think we should experiment with different heuristics.
Yes: https://gist.github.com/1st1/aed69d63a2ff4de4c7be
Yes, IMHO we need better benchmarks. Some of the existing ones are very unstable -- I can run them three times and get three completely different results. Benchmarking is hard :) I'll create a few issues on bugs.python.org with new/updated benchmarks, and will join the speed@ mailing list. Yury

As Brett suggested, I've just run the benchmarks suite with memory tracking on. The results are here: https://gist.github.com/1st1/1851afb2773526fd7c58 Looks like the memory increase is around 1%. One synthetic micro-benchmark, unpack_sequence, contains hundreds of lines that load a global variable and does nothing else, consumes 5%. Yury

BTW, this optimization also makes some old optimization tricks obsolete. 1. No need to write 'def func(len=len)'. Globals lookups will be fast. 2. No need to save bound methods: obj = [] obj_append = obj.append for _ in range(10**6): obj_append(something) This hand-optimized code would only be marginally faster, because of LOAD_METHOD and how it's cached. Yury

Yury Selivanov schrieb am 27.01.2016 um 19:25:
I implemented a similar but simpler optimisation in Cython a while back: http://blog.behnel.de/posts/faster-python-calls-in-cython-021.html Instead of avoiding the creation of method objects, as you proposed, it just normally calls getattr and if that returns a bound method object, it uses inlined calling code that avoids re-packing the argument tuple. Interestingly, I got speedups of 5-15% for some of the Python benchmarks, but I don't quite remember which ones (at least raytrace and richards, I think), nor do I recall the overall gain, which (I assume) is what you are referring to with your 2-4% above. Might have been in the same order. Stefan

On 2016-01-29 5:00 AM, Stefan Behnel wrote:
That's great! I'm still working on the patch, but so far it looks like adding just LOAD_METHOD/CALL_METHOD (that avoid instantiating BoundMethods) gives us 10-15% faster method calls. Combining them with my opcode cache makes them 30-35% faster. Yury

On Wed, Jan 27, 2016 at 01:25:27PM -0500, Yury Selivanov wrote:
Have you looked at Cesare Di Mauro's wpython? As far as I know, it's now unmaintained, and the project repo on Google Code appears to be dead (I get a 404), but I understand that it was significantly faster than CPython back in the 2.6 days. https://wpython.googlecode.com/files/Beyond%20Bytecode%20-%20A%20Wordcode-ba... -- Steve

On 2016-01-29 11:28 PM, Steven D'Aprano wrote:
Thanks for bringing this up! IIRC wpython was about using "fat" bytecodes, i.e. using 64bits per bytecode instead of 8. That allows to minimize the number of bytecodes, thus having some performance increase. TBH, I don't think it was "significantly faster". If I were to do some big refactoring of the ceval loop, I'd probably consider implementing a register VM. While register VMs are a bit faster than stack VMs (up to 20-30%), they would also allow us to apply more optimizations, and even bolt on a simple JIT compiler. Yury

On Mon, 1 Feb 2016 at 09:08 Yury Selivanov <yselivanov.ml@gmail.com> wrote:
If you did tackle the register VM approach that would also settle a long-standing question of whether a certain optimization works for Python. As for bolting on a JIT, the whole point of Pyjion is to see if that's worth it for CPython, so that's already being taken care of (and is actually easier with a stack-based VM since the JIT engine we're using is stack-based itself).

On 01.02.2016 18:18, Brett Cannon wrote:
Are there some resources on why register machines are considered faster than stack machines?
Interesting. Haven't noticed these projects, yet. So, it could be that we will see a jitted CPython when Pyjion appears to be successful? Best, Sven

On Mon, 1 Feb 2016 at 10:21 Sven R. Kunze <srkunze@mail.de> wrote:
A search for [stack vs register based virtual machine] will get you some information.
You aren't really supposed to yet. :) In Pyjion's case we are still working on compatibility, let alone trying to show a speed improvement so we have not said much beyond this mailing list (we have a talk proposal in for PyCon US that we hope gets accepted). We just happened to get picked up on Reddit and HN recently and so interest has spiked in the project.
So, it could be that we will see a jitted CPython when Pyjion appears to be successful?
The ability to plug in a JIT, but yes, that's the hope.

On 01.02.2016 19:28, Brett Cannon wrote:
A search for [stack vs register based virtual machine] will get you some information.
Alright. :) Will go for that.
Exciting. :)
Okay. Not sure what you mean by plugin. One thing I like about Python is that it just works. So, plugin sounds like unnecessary work.

Sven R. Kunze wrote:
Are there some resources on why register machines are considered faster than stack machines?
If a register VM is faster, it's probably because each register instruction does the work of about 2-3 stack instructions, meaning less trips around the eval loop, so less unpredictable branches and less pipeline flushes. This assumes that bytecode dispatching is a substantial fraction of the time taken to execute each instruction. For something like cpython, where the operations carried out by the bytecodes involve a substantial amount of work, this may not be true. It also assumes the VM is executing the bytecodes directly. If there is a JIT involved, it all gets translated into something else anyway, and then it's more a matter of whether you find it easier to design the JIT to deal with stack or register code. -- Greg

On 02.02.2016 00:27, Greg Ewing wrote:
That's was I found so far as well.
Interesting point indeed. It makes sense that register machines only saves us the bytecode dispatching. How much that is compared to the work each instruction requires, I cannot say. Maybe, Yury has a better understanding here.
It seems like Yury thinks so. He didn't tell use so far. Best, Sven

Also, modern compiler technology tends to use "infinite register" machines for the intermediate representation, then uses register coloring to assign the actual registers (and generate spill code if needed). I've seen work on inter-function optimization for avoiding some register loads and stores (combined with tail-call optimization, it can turn recursive calls into loops in the register machine). On 2 February 2016 at 09:16, Sven R. Kunze <srkunze@mail.de> wrote:

On 01/02/2016 16:54, Yury Selivanov wrote:
From https://code.google.com/archive/p/wpython/ <quote> WPython is a re-implementation of (some parts of) Python, which drops support for bytecode in favour of a wordcode-based model (where a is word is 16 bits wide). It also implements an hybrid stack-register virtual machine, and adds a lot of other optimizations. </quote> -- My fellow Pythonistas, ask not what our language can do for you, ask what you can do for our language. Mark Lawrence

2016-02-01 17:54 GMT+01:00 Yury Selivanov <yselivanov.ml@gmail.com>:
No, it used 16, 32, and 48-bit per opcode (1, 2, or 3 16-bit words).
That allows to minimize the number of bytecodes, thus having some performance increase. TBH, I don't think it was "significantly faster".
Please, take a look at the benchmarks, or compile it and check yourself. ;-) If I were to do some big refactoring of the ceval loop, I'd probably
WPython was an hybrid-VM: it supported both a stack-based and a register-based approach. I think that it's needed, since the nature of Python, because you can have operations with intermixed operands: constants, locals, globals, names. It's quite difficult to handle all possible cases with a register-based VM. Regards, Cesare

Hi, I'm back for the FOSDEM event at Bruxelles, it was really cool. I gave talk about FAT Python and I got good feedback. But friends told me that people now have expectations on FAT Python. It looks like people care of Python performance :-) FYI the slides of my talk: https://github.com/haypo/conf/raw/master/2016-FOSDEM/fat_python.pdf (a video was recorded, I don't know when it will be online) I take a first look at your patch and sorry, I'm skeptical about the design. I have to play with it a little bit more to check if there is no better design. To be clear, FAT Python with your work looks more and more like a cheap JIT compiler :-) Guards, specializations, optimizing at runtime after a threshold... all these things come from JIT compilers. I like the idea of a kind-of JIT compiler without having to pay the high cost of a large dependency like LLVM. I like baby steps in CPython, it's faster, it's possible to implement it in a single release cycle (one minor Python release, Python 3.6). Integrating a JIT compiler into CPython already failed with Unladen Swallow :-/ PyPy has a complete different design (and has serious issues with the Python C API), Pyston is restricted to Python 2.7, Pyjion looks specific to Windows (CoreCLR), Numba is specific to numeric computations (numpy). IMHO none of these projects can be easily be merged into CPython "quickly" (again, in a single Python release cycle). By the way, Pyjion still looks very young (I heard that they are still working on the compatibility with CPython, not on performance yet). 2016-01-27 19:25 GMT+01:00 Yury Selivanov <yselivanov.ml@gmail.com>:
That's really impressive, great job Yury :-) Getting non-negligible speedup on large macrobenchmarks became really hard in CPython. CPython is already well optimized in all corners. It looks like the overall Python performance still depends heavily on the performance of dictionary and attribute lookups. Even if it was well known, I didn't expect up to 10% speedup on *macro* benchmarks.
Your cache is stored directly in code objects. Currently, code objects are immutable. Antoine Pitrou's patch adding a LOAD_GLOBAL cache adds a cache to functions with an "alias" in each frame object: http://bugs.python.org/issue10401 Andrea Griffini's patch also adding a cache for LOAD_GLOBAL adds a cache for code objects too. https://bugs.python.org/issue1616125 I don't know what is the best place to store the cache. I vaguely recall a patch which uses a single unique global cache, but maybe I'm wrong :-p
I tested your latest patch. It looks like LOAD_GLOBAL never invalidates the cache on cache miss ("deoptimize" the instruction). I suggest to always invalidate the cache at each cache miss. Not only, it's common to modify global variables, but there is also the issue of different namespace used with the same code object. Examples: * late global initialization. See for example _a85chars cache of base64.a85encode. * code object created in a temporary namespace and then always run in a different global namespace. See for example collections.namedtuple(). I'm not sure that it's the best example because it looks like the Python code only loads builtins, not globals. But it looks like your code keeps a copy of the version of the global namespace dict. I tested with a threshold of 1: always optimize all code objects. Maybe with your default threshold of 1024 runs, the issue with different namespaces doesn't occur in practice.
I'm not sure that it's worth to develop a complex dynamic logic to only enable optimizations after a threshold (design very close to a JIT compiler). What is the overhead (% of RSS memory) on a concrete application when all code objects are optimized at startup? Maybe we need a global boolean flag to disable the optimization? Or even a compilation option? I mean that all these new counters have a cost, and the code may be even faster without these counters if everything is always optimized, no? I'm not sure that the storage for the cache is really efficient. It's a compact data structure, but it looks "expensive" to access it (there is one level of indirection). I understand that it's compact to reduce the memory footpring overhead. I'm not sure that the threshold of 1000x run is ok for short scripts. It would be nice to optimize also scripts which only call a function 900x times :-) Classical memory vs cpu compromise issue :-) I'm just thinking aloud :-) Victor

On 2016-02-02 4:28 AM, Victor Stinner wrote: [..]
I take a first look at your patch and sorry,
Thanks for the initial code review!
So far I see two things you are worried about: 1. The cache is attached to the code object vs function/frame. I think the code object is the perfect place for such a cache. The cache must be there (and survive!) "across" the frames. If you attach it to the function object, you'll have to re-attach it to a frame object on each PyEval call. I can't see how that would be better. 2. Two levels of indirection in my cache -- offsets table + cache table. In my other email thread "Opcode cache in ceval loop" I explained that optimizing every code object in the standard library and unittests adds 5% memory overhead. Optimizing only those that are called frequently is less than 1%. Besides, many functions that you import are never called, or only called once or twice. And code objects for modules and class bodies are called once. If we don't use an offset table and just allocate a cache entry for every opcode, then the memory usage will raise *significantly*. Right now the overhead of the offset table is *8 bits* per opcode, the overhead of the cache table is *32 bytes* per an optimized opcode. The overhead of using 1 extra indirection is minimal. [..]
Thanks!
Code objects are immutable on the Python level. My cache doesn't make any previously immutable field mutable. Adding a few mutable cache structures visible only at the C level is acceptable I think.
Those patches are nice, but optimizing just LOAD_GLOBAL won't give you a big speed-up. For instance, 2to3 became 7-8% faster once I started to optimize LOAD_ATTR. The idea of my patch is that it implements caching in such a way, that we can add it to several different opcodes.
Yes, that was a deliberate decision (but we can add the deoptimization easily). So far I haven't seen a use case or benchmark where we really need to deoptimize.
Yep. I added a constant in ceval.c that enables collection of opcode cache stats. 99.9% of all global dicts in benchmarks are stable. test suite was a bit different, only ~99% :) One percent of cache misses was probably because of unittest.mock.
I think it's not even remotely close to what JITs do. In my design I have a simple counter -- when it reaches 1000, we create the caches in the code objects. Some opcodes start to use it. That's basically it. JIT compilers trace the code, collect information about types, think about memory, optimize, deoptimize, think about memory again, etc, etc :)
What is the overhead (% of RSS memory) on a concrete application when all code objects are optimized at startup?
I've mentioned that in my other thread. When the whole test suite is run with *every* code object being optimized (threshold = 1), about 73000 code objects were optimized, requiring >20Mb of memory (the test suite process consumed ~400Mb of memory). So 5% looks to be the worst case. When I ran the test suite with threshold set to 1024, only 2000 objects were optimized, requiring less than 1% of the total process memory.
Maybe we need a global boolean flag to disable the optimization? Or even a compilation option?
I'd hate to add such thing. Why would you want to disable the cache? To save 1% of memory? TBH I think this only adds maintenance overhead to us.
Yes, but only marginally. You'll save one "inc" in eval loop. And a couple of "if"s. Maybe on a micro benchmark you can see a difference. But optimizing everything will require much more memory. And we shouldn't optimize code objects that are run only once -- that's code objects for modules and classes. Threshold of 1024 is big enough to say that the code object is frequently used and will probably continue to be frequently used in the future.
I'd be OK to change the threshold to 500 or something. But IMHO it won't change much. Short/small scripts won't hit it anyways. And even if they do, they typically don't run long enough to get a measurable speedup.
I'm just thinking aloud :-)
Thanks! I'm happy that you are looking at this thing with a critical eye. BTW, here's a debug output of unit tests with every code object optimized: -- Opcode cache number of objects = 72395 -- Opcode cache total extra mem = 20925595 -- Opcode cache LOAD_METHOD hits = 64569036 (63%) -- Opcode cache LOAD_METHOD misses = 23899 (0%) -- Opcode cache LOAD_METHOD opts = 104872 -- Opcode cache LOAD_METHOD deopts = 19191 -- Opcode cache LOAD_METHOD dct-chk= 12805608 -- Opcode cache LOAD_METHOD total = 101735114 -- Opcode cache LOAD_GLOBAL hits = 123808815 (99%) -- Opcode cache LOAD_GLOBAL misses = 310397 (0%) -- Opcode cache LOAD_GLOBAL opts = 125205 -- Opcode cache LOAD_ATTR hits = 59089435 (53%) -- Opcode cache LOAD_ATTR misses = 33372 (0%) -- Opcode cache LOAD_ATTR opts = 73643 -- Opcode cache LOAD_ATTR deopts = 20276 -- Opcode cache LOAD_ATTR total = 111049468 Yury

On Tue, 2 Feb 2016 at 01:29 Victor Stinner <victor.stinner@gmail.com> wrote:
We are not ready to have a serious discussion about Pyjion yet as we are still working on compatibility (we have a talk proposal in for PyCon US 2016 and so we are hoping to have something to discuss at the language summit), but Victor's email shows there is some misconceptions about it already and a misunderstanding of our fundamental goal. First off, Pyjion is very much a work-in-progress. You can find it at https://github.com/microsoft/pyjion (where there is an FAQ), but for this audience the key thing to know is that we are still working on compatibility (see https://github.com/Microsoft/Pyjion/blob/master/Tests/python_tests.txt for the list of tests we do (not) pass from the Python test suite). Out of our roughly 400 tests, we don't pass about 18 of them. Second, we have not really started work on performance yet. We have done some very low-hanging fruit stuff, but just barely. IOW we are not really ready to discuss performance (ATM we JIT instantly for all code objects and even being that aggressive with the JIT overhead we are even/slightly slower than an unmodified Python 3.5 VM, so we are hopeful this work will pan out). Third, the over-arching goal of Pyjion is not to add a JIT into CPython, but to add a C API to CPython that will allow plugging in a JIT. If you simply JIT code objects then the API required to let someone plug in a JIT is basically three functions, maybe as little as two (you can see the exact patch against CPython that we are working with at https://github.com/Microsoft/Pyjion/blob/master/Patches/python.diff). We have no interest in shipping a JIT with CPython, just making it much easier to let others add one if they want to because it makes sense for their workload. We have no plans to suggest shipping a JIT with CPython, just to make it an option for people to add in if they want (and if Yury's caching stuff goes in with an execution counter then even the one bit of true overhead we had will be part of CPython already which makes it even more of an easy decision to consider the API we will eventually propose). Fourth, it is not Windows-only by design. CoreCLR is cross-platform on all major OSs, so that is not a restriction (and honestly we are using CoreCLR simply because Dino used to work on the CLR team so he knows the bytecode really well; we easily could have used some other JIT to prove our point). The only reason Pyjion doesn't work with other OSs is momenum/laziness on Dino and my part; Dino hacked together Pyjion at PyCon US 2015 and he is the most comfortable on Windows, and so he just did it in Windows on Visual Studio and just didn't bother to start with e.g., CMake to make it build on other OSs. Since we are still trying to work out some compatibility stuff so we would rather do that than worry about Linux or OS X support right now. Fifth, if we manage to show that a C API can easily be added to CPython to make a JIT something that can simply be plugged in and be useful, then we will also have a basic JIT framework for people to use. As I said, our use of CoreCLR is just for ease of development. There is no reason we couldn't use ChakraCore, v8, LLVM, etc. But since all of these JIT compilers would need to know how to handle CPython bytecode, we have tried to design a framework where JIT compilers just need a wrapper to handle code emission and our framework that we are building will handle driving the code emission (e.g., the wrapper needs to know how to emit add_integer(), but our framework handles when to have to do that). Anyway, as I said, Pyjion is very much a work in progress. We hope to have something more solid to propose/discuss at the language summit at PyCon US 2016. The only reason I keep mentioning it is because what Victor is calling "JIT-like" is really "minimize doing extra work that's not needed" and that benefits everyone trying to do any computational work that takes extra time to speed up CPython (which includes Pyjion). IOW Yury's work combined with Victor's work could quite easily just spill out beyond just local caches and into allowing pluggable JITs in CPython.

On 3 February 2016 at 03:52, Brett Cannon <brett@python.org> wrote:
That could also be really interesting in the context of pymetabiosis [1] if it meant that PyPy could still at least partially JIT the Python code running on the CPython side of the boundary. Cheers, Nick. [1] https://github.com/rguillebert/pymetabiosis -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

2016-02-02 10:28 GMT+01:00 Victor Stinner <victor.stinner@gmail.com>:
It's long time since I took a look at CPython (3.2), but if it didn't changed a lot then there might be some corner cases still waiting to be optimized. ;-) Just one thing that comes to my mind: is the stack depth calculation routine changed? It was suboptimal, and calculating a better number decreases stack allocation, and increases the frame usage.
True, but it might be mitigated in some ways, at least for built-in types. There are ideas about that, but they are a bit complicated to implement. The problem is with functions like len, which IMO should become attribute lookups ('foo'.len) or method executions ('foo'.len()). Then it'll be easier to accelerate their execution, with one of the above ideas. However such kind of changes belong to Guido, which defines the language structure/philosophy. IMO something like len should be part of the attributes exposed by an object: it's more "object-oriented". Whereas other things like open, file, sum, etc., are "general facilities". Regards, Cesare

On Sun, May 15, 2016 at 2:23 AM, Cesare Di Mauro <cesare.di.mauro@gmail.com> wrote:
This is still a problem and came up again recently: http://bugs.python.org/issue26549 -- Meador

2016-05-16 17:55 GMT+02:00 Meador Inge <meadori@gmail.com>:
I saw the last two comments of the issues: this is what I was talking about (in particular the issue opened by Armin applies). However there's another case where the situation is even worse. Let me show a small reproducer: def test(self): for i in range(self.count): with self: pass The stack size reported by Python 2.7.11:
test.__code__.co_stacksize 6
Adding another with statement:
test.__code__.co_stacksize 7
But unfortunately with Python 3.5.1 the problematic is much worse:
test.__code__.co_stacksize 10
test.__code__.co_stacksize 17
Here the situation is exacerbated by the fact that the WITH_CLEANUP instruction of Python 2.x was split into two (WITH_CLEANUP_START and WITH_CLEANUP_FINISH) in some Python 3 release. I don't know why two different instructions were introduced, but IMO it's better to have one instruction which handles all code finalization of the with statement, at least in this case. If there are other scenarios where two different instructions are needed, then ad-hoc instructions like those can be used. Regards, Cesare
participants (12)
-
Brett Cannon
-
Cesare Di Mauro
-
Greg Ewing
-
Mark Lawrence
-
Meador Inge
-
Nick Coghlan
-
Peter Ludemann
-
Stefan Behnel
-
Steven D'Aprano
-
Sven R. Kunze
-
Victor Stinner
-
Yury Selivanov