PEP 659: Specializing Adaptive Interpreter

Hi everyone, I would like to present PEP 659. This is an informational PEP about a key part of our plan to improve CPython performance for 3.11 and beyond. For those of you aware of the recent releases of Cinder and Pyston, PEP 659 might look similar. It is similar, but I believe PEP 659 offers better interpreter performance and is more suitable to a collaborative, open-source development model. As always, comments and suggestions are welcome. Cheers, Mark. Links: https://www.python.org/dev/peps/pep-0659/ https://github.com/facebookincubator/cinder https://github.com/pyston/pyston

On 5/12/2021 1:40 PM, Mark Shannon wrote:
This is an informational PEP about a key part of our plan to improve CPython performance for 3.11 and beyond.
As always, comments and suggestions are welcome.
The claim that starts the Motivation section, "Python is widely acknowledged as slow.", has multiple problems. While some people believe, or at least claim to believe "Python is slow", other know that as stated, the latter is false. Languages do not have a speed, only implementations running code for particular applications have a speed, or a speed relative to equivalent code in another language with a different runtime. I reason I am picking on this is that the meme 'Python is slow' is being morphed into 'Python is destroying the earth' (and should be abandoned, if not banned). Last fall, a science news journal (Nature News?) quoted a 'concerned scientist' saying just this. An internet troll repeated it last week on comp.lang.python (from where it leaked onto python-list). It is true that Python has characteristics that make it *relatively* difficult to write interpreters that are *relatively* fast in certain applications. But the opposite is also true. The language does *not* mandate that objects, their methods, and modules be written in the language. Hence, CPython implements builtin objects and function and some stdlib modules in C and allows 3rd party modules written in C or C++ or Fortran. I believe the first killer app for Python, in the mid 1990s, numerical computing with NumericalPython. Rather than being 'slow', CPython *enabled* people, with a few percent of added time, to access fast, heavily optimized C and Fortran libraries and do things they could not do in Fortran and that would have been much more difficult in C. My daughter's PhD thesis work is a recent example of using Python to access C libraries. The concerned scientist mentioned above noted, more or less correctly, that numerical code, such as neuro-network code, is, say, 80x slower in pure python than in compiled C. But he did not mention that serious numerical and scientific work in Python is not done with such code. I have seen this sort of bogus comparison before.
-- Terry Jan Reedy

Hi Terry, On 13/05/2021 5:32 am, Terry Reedy wrote:
I broadly agree, but CPython is largely synonymous with Python and CPython is slower than it could be. The phrase was not meant to upset anyone. How would you rephrase it, bearing in mind that needs to be short?
It is a legitimate concern that CPython is bad for the environment, and one that I hope we can address by speeding up CPython. Since, faster == less energy for the same amount of work, making CPython faster will reduce the amount of CO2 produced to do that work and hopefully make it less of a concern. Of course, compared to the environmental disaster that is BitCoin, it's not a big deal.
Yes, one of the great things about Python is that almost every library of any size has Python bindings. But there is a difference between making code that is already written in C/Fortran available to Python and telling people to write code in C/Fortran because their Python code is too slow. We want people to be able to write code in Python and have it perform at the level they would get from a good Javascript or lua implementation.
It is still important to speed up Python though. If a program does 95% of its work in a C++ library and 5% in Python, it can easily spend the majority of its time in Python because CPython is a lot slower than C++ (in general). Cheers, Mark.

On Thu, 13 May 2021 at 09:23, Mark Shannon <mark@hotpy.org> wrote:
How about simply "The CPython interpreter, while sufficiently fast for much use, could be faster"? Along with the following sentence, this seems to me to state the situation fairly but in a way that motivates this proposal. Paul

On Thu, May 13, 2021 at 9:18 AM Mark Shannon <mark@hotpy.org> wrote:
[...] hopefully make it less of a concern.
Of course, compared to the environmental disaster that is BitCoin, it's not a big deal.
Every little helps. Please switch off the light as you leave the room. [...]
It is still important to speed up Python though.
Agreed.

I suggest we keep it really simple, and name the implementation. Building on Steve Holden’s suggestion: There is broad interest in improving the performance of the cPython runtime. (Interpreter?) -CHB -- Christopher Barker, PhD (Chris) Python Language Consulting - Teaching - Scientific Software Development - Desktop GUI and Web Development - wxPython, numpy, scipy, Cython

On Tue, May 18, 2021 at 8:51 PM Stephen J. Turnbull <turnbull.stephen.fw@u.tsukuba.ac.jp> wrote:
I feel the need to redress the balance of names here. This thread has had a mere two Chrises so far, and I am improving that statistic by 50%. ChrisA (used to style his name as "chrisa" but people complained that I looked like a girl)

On Thu, May 13, 2021 at 09:18:27AM +0100, Mark Shannon wrote:
Work expands to fill the time available: if Python is 10% more efficient, people will use that extra speed to do 10% more work. There will be no nett savings in CO2 and if anything a faster Python will lead to more people using it and hence a nett increase in Python's share of the CO2 emissions. Let's make Python faster, but don't fool ourselves that we're doing it for the environment. Every time people improve the efficiency of some resource, we respond by using more of that resource. -- Steve

Mark Shannon writes:
On 13/05/2021 5:32 am, Terry Reedy wrote:
The claim that starts the Motivation section, "Python is widely acknowledged as slow.", has multiple problems.
How would you rephrase it, bearing in mind that needs to be short?
We can make CPython run significantly faster, at a reasonable cost in developer time, without otherwise changing the sematics of the language. If you have good justification for saying "as fast as the best JS/Lua implementations" or whatever, feel free to substitute that for "significantly faster". And now this:
It is a legitimate concern that CPython is bad for the environment,
It is not. I do this for a living (5 hours in a research hackathon just this afternoon on a closely related topic[1]), and I assure you that such "concern" is legitimate only as a matter of purely speculative metaphysics. We don't have the data to analyze the possibilities, and we don't even have the models if we did have the data. The implied model that gets you from your tautology to "concern" is just plain wrong -- work to be done is not independent of the cost of doing it[2], not to mention several other relevant variables, and cannot be made so in a useful model.
and hopefully make it less of a concern.
It is only a concern in the Tucker Carlson "just asking questions" mode of "concern". Really -- it's *that* bad.
So say that. Nothing to be ashamed of there! The work you propose to do is valuable for a lot of valid reasons, the most important of which is "because we can and there's no immediate downside".[3] Stick to those. Footnotes: [1] Yoshida, M., Turnbull, S.J. Voluntary provision of environmental offsets under monopolistic competition. Int Tax Public Finance (2021). https://doi.org/10.1007/s10797-020-09630-5. Paywalled, available from the author, rather specialized, though. Two works-in-progress are much more closely related, but I have a paranoid coauthor so can't say more at this time. :-) [2] As Steven d'Aprano points out colorfully, using Parkinson's Law. [3] Look up Braess's Paradox for a classic and mathematically simple example of how reducing cost "with no immediate downside" can increase expense "once everything works itself out."

On 5/13/2021 4:18 AM, Mark Shannon wrote:
Others have given some fine suggestions. Take your pick. [ship]
We want people to be able to write code in Python and have it perform at the level they would get from a good Javascript or lua implementation.
I agree with others that this is a good way to state the goal. It also seems on the face of it reasonable, though not trivial. I get the impression that you are proposing to use python-adapted variations of techniques already used for such implementations.
It is still important to speed up Python though.
I completely agree. Some application areas are amenable to speedup be resorting to C libraries, often already available. Others are not. The latter involve lots of idiosyncratic business logic, individual numbers rather than arrays of numbers, and strings. Numpy based applications gain firstly from using unboxed arrays of machine ints and floats instead of lists (and lists) of boxed ints and floats and secondly from C or assembly-coded routines Python strings are already arrays of machine ints (codepoints). Basic operations on strings, such as 'substring in string' are already coded in C working on machine values. So the low-hanging fruit has already been picked.
I believe the ratio for the sort of numerical computing getting bogus complaints is sometimes more like 95% of *time* in compiled C and only, say, 5% of *time* in the Python interpreter. So even if the interpreter ran instantly, it would make also no difference -- for such applications. -- Terry Jan Reedy

On Thu, 20 May 2021 at 04:58, Terry Reedy <tjreedy@udel.edu> wrote:
Not necessarily because if the interpreter is faster then it opens up new options that perhaps don't involve the same C routines. The situation right now is that it is often faster to do more "computation" than needed using efficient C routines rather than do precisely what is needed in bare Python. If the bare Python part becomes faster then maybe you don't need the C routine at all. To give a concrete example, in SymPy I have written a pure Python implementation of typed sparse matrices (this is much faster than the public Matrix class so don't compare with that). I would like to use the flint library to speed up some of these matrix calculations and the flint library has a highly optimised C/assembly implementation of dense matrices of arbitrary precision integers. Which of these implementations is faster for e.g. matrix multiplication depends on how sparse the matrix actually is. If I have a large matrix say 1000 x 1000 and only 1% of the elements are nonzero then the pure Python sparse implementation is faster (it can be much faster as the density reduces since it does not have the same big-O characteristics). On the other hand for fully dense matrices where all elements are nonzero the flint implementation is consistently around 100x faster. The break even point where both implementations take equal time is around about 5% density. What that means is that for a 1000 x 1000 matrix with 10% of elements nonzero it is faster to ask flint to construct an enormous dense matrix and perform a huge number of arithmetic operations (mostly involving zeros) than it is to use a pure Python implementation that has more favourable asymptotic complexity and theoretically computes the result with 100x fewer arithmetic "operations". In this situation there is a sliding scale where the faster the Python interpreter gets the less often you benefit from calling the C routine in the first place. Although this is a very specific example it illustrates something that I see very often which is that while the efficient C routines can make things "run at the speed of C" you can often end up optimising things to use an approach that would seem inefficient if you were working in C directly. This happens because it works out faster from the perspective of pure Python code that is encumbered by interpreter overhead and has a limited range of C routines to choose from. If the interpreter overhead is less then the options to choose from are improved. Also for many applications it is much easier for the programmer to write an algorithm directly in loops rather than coming up with a vectorised version based on e.g. numpy arrays. Vectorisation as a way of optimising code is actually work for the programmer. There is another tradeoff here which is not about C speed vs Python speed but about programmer time vs CPU time. If a straight-forward Python implementation is already "fast enough" then you don't need to spend time thinking about how to translate that into something that would possibly run faster (potentially at the expense of code readability). In the case of SymPy/flint if the maximum speed gain of flint was only 10x then I might not bother using it at all to avoid the complexity of having multiple implementations to choose from and external dependencies etc. -- Oscar

On 5/20/2021 10:49 AM, Oscar Benjamin wrote:
On Thu, 20 May 2021 at 04:58, Terry Reedy <tjreedy@udel.edu> wrote:
'also' was meant to be 'almost'
Not necessarily
In the context I carefully defined, where Python is falsely accused of endangering the earth, by people who set up strawpeople images of how Python is actually used and who care nothing about programmer time and joy, yes, necessarily. However, in the related context you define, faster Python could help save the earth by reducing the need for brute-force C routines when they are grossly inefficient. How ironic that would be.
-- Terry Jan Reedy

Oscar Benjamin writes:
Sure, but what's also happening here is that you're optimizing programmer cost by not writing the sparse algorithm in C, C++, or Rust. So I haven't done the math, but I guess to double the percentage of nonzero matrix elements that constitutes the breakeven point you need to double the speed of the Python runtime, and I don't think that's going to happen any time soon. As far as I can see, any reasonably anticipatable speedup is quite marginal for you (a 10% speedup in arithmetic is, I hope, a dream, but that would get you from 5% to 5.5% -- is that really a big deal?)
Absolutely. But the real problem you face is that nobody is writing routines for sparse matrices in languages that compile to machine code (or worse, not wrapping already available C libraries).
Sure, but my guesstimate is that that would require a 90% speedup in Python arithmetic. Really, is that going to happen? I feel your pain (even though for me it's quite theoretical, my own data is dense, even impenetrable). But I just don't see even potential 10% or 20% speedups in Python overcoming the generic need for programmers to either (1) accept the practical limits to the size of data they can work with in Python or (2) bite the bullet and write C (or ctypes) that can do the calculations 100x as fast as a well-tuned Python program. I'm all for Mark's work, and I'm glad somebody's willing to pay him some approximation to what he's worth, even though I probably won't benefit myself (nor my students). But I really don't see the economics of individual programmers changing very much -- 90% of us will just use the tried-and-true packages (some of which are accelerated like NumPy and Pandas), 9% will think for ten minutes and choose (1) or (2) above, and 1% will do the careful breakeven analysis you do, and write (and deal with the annoyances of) hybrid code. Steve

I find this whole conversation confusing -- does anyone really think a substantial performance boost to cPython is not a "good thing"? Worth the work? Maybe not, but it seems that Mark, Guido, and MS think it is -- more power to them! Anyway: potential 10% or 20% speedups in Python I believe the folks involved think they may get a factor of two speedup -- but in any case, Oscar has a point -- there is a trade-off of effort vs performance, and increasing the performance of cPython moves that trade-off point, even if just a little. I like Oscar's example, because it's got hard numbers attached to it, but the principle is the same for any time you are considering writing, or even using, a non-python library.
Oddly missing from this conversation is PyPy -- which can buy you a lot of performance for some types of code in pure Python, and things like Cython or numba, which can buy you a lot with slightly modified Python. All those options are why Python is very useful today -- but none of them make the case that making cPython run faster isn't a worthy goal. -CHB -- Christopher Barker, PhD (Chris) Python Language Consulting - Teaching - Scientific Software Development - Desktop GUI and Web Development - wxPython, numpy, scipy, Cython

Christopher Barker writes:
I find this whole conversation confusing -- does anyone really think a substantial performance boost to cPython is not a "good thing"?
I don't understand why you think anybody, except maybe some crank who caused the editors of Science or whatever it was to seriously embarrass themselves, opposes the goal of making cPython run faster. All I want is some sanity when advocating changes to Python. For performance work, tell us how much faster cPython is going to be, explain where you got your numbers, and let us decide how we'll use the cycles saved. There's been a lot of nonsense peddled in support of this proposal by the proponent and thirds parties, when all anybody needs is Mark says he can make cPython noticably faster and we believe him! More important, Microsoft does. Steve

On 5/12/2021 1:40 PM, Mark Shannon wrote:
This is an informational PEP about a key part of our plan to improve CPython performance for 3.11 and beyond.
What is the purpose of this PEP? It seems in part to be like a Standards Track PEP in that it proposes a new (revised) implementation idea for the CPython bycode interpreter. Do you not intend this to not constitute approval of even the principle? One of the issues in the new project gave formulas for the cost versus benefit calculations underlying specialization. Depending on your purpose, it might be good to include them. They certainly gave some clarity to me. -- Terry Jan Reedy

Hi Terry, On 13/05/2021 8:20 am, Terry Reedy wrote:
I will make it a standards PEP if anyone feels that would be better. We can implement PEP 659 incrementally, without any large changes to the implementation or any to the language or API/ABI, so a standards PEP didn't seem necessary to us. However, because it is a large change to the implementation, it seemed worth documenting and doing so in a clearly public fashion. Hence the informational PEP.
Which ones in particular? I can add something like them to the PEP. Cheers, Mark.

On Thu, May 13, 2021 at 1:38 AM Mark Shannon <mark@hotpy.org> wrote:
I personally think it should be a Standards Track PEP. This PEP isn't documenting some detail like PEP 13 or some release schedule, but is instead proposing a rather major change to the interpreter which a lot of us will need to understand in order to support the code (and I do realize the entire area of "what requires a PEP and what doesn't" is very hazy). -Brett

On Tue, May 25, 2021 at 12:34 PM Brett Cannon <brett@python.org> wrote:
Does that also mean you think the design should be completely hashed out and approved by the SC ahead of merging the implementation? Given the amount of work, that would run into another issue -- many of the details of the design can't be fixed until the implementation has proceeded, and we'd end up with a long-living fork of the implementation followed by a giant merge. My preference (and my promise at the Language Summit) is to avoid mega-PRs and instead work on this incrementally. Now, we've done similar things before (for example, the pattern matching implementation was a long-living branch), but the difference is that for pattern matching, the implementation followed the design, whereas for the changes to the bytecode interpreter that we're undertaking here, much of the architecture will be designed as the implementation proceeds, based on what we learn during the implementation. Or do you think the "Standards Track" PEP should just codify general agreement that we're going to implement a specializing adaptive interpreter, with the level of detail that's currently in the PEP? I don't recall other standards track PEPs that don't also spell out the specification of the proposal in detail. -- --Guido van Rossum (python.org/~guido) *Pronouns: he/him **(why is my pronoun here?)* <http://feministing.com/2015/02/03/how-using-they-as-a-singular-pronoun-can-c...>

On Tue, May 25, 2021 at 1:50 PM Łukasz Langa <lukasz@langa.pl> wrote:
I think it's different -- the problems with the Gilectomy were pretty predictable (slower single-core perf due to way more locking calls), but it was not predictable whether Larry would be able to overcome them (I was rooting for him the whole time). Here, we're looking at something where Mark has prototyped the proposed approach extensively (HoyPy, HotPy2), and the question is more whether Python 3.11 is going to be 15% faster or 50%. And some of the ideas have also been prototyped by the existing inline caches (some of the proposal is just to do more of those, and reducing the overhead by specializing opcodes), and further validated by Dino's work at Facebook/Instagram on Shadowcode (part of Cinder), which also specializes opcodes. -- --Guido van Rossum (python.org/~guido) *Pronouns: he/him **(why is my pronoun here?)* <http://feministing.com/2015/02/03/how-using-they-as-a-singular-pronoun-can-c...>

To potentially help provide a little bit of additional detail around our approach I've spent some time writing up our internal details of the shadow byte code implementation, and landed that in our Cinder repo here: https://github.com/facebookincubator/cinder/blob/cinder/3.8/CinderDoc/shadow.... That might at least spark discussion or ideas about possible internal implementation details or things which could be different/more efficient in our implementation. I've also had a version of it against 3.10 going for a while (as internally we're still at 3.8) and I've updated it to a relatively recent merge of 3.11 main. I've pushed the latest version of that here here: https://github.com/DinoV/cpython/tree/shadowcode_rebase_2021_05_12. The 3.11 version obviously isn't as battle tested as what we've been running in production for some time now but it pretty much the same. It is missing our improved global caching which uses dictionary watches though. And it is a rather large PR (almost 7k lines) but over 1/3rd of that is the test cases. Also just to inform the discussion around potential performance benefits, here's how that alone is currently benchmarking versus the base commit: cpython_310_opt_rig.json ======================== Performance version: 1.0.1 Report on Linux-5.2.9-229_fbk15_hardened_4185_g357f49b36602-x86_64-with-glibc2.28 Number of logical CPUs: 48 Start date: 2021-05-17 21:57:08.095822 End date: 2021-05-17 22:40:33.374232 cpython_ghdino_opt_rig.json =========================== Performance version: 1.0.1 Report on Linux-5.2.9-229_fbk15_hardened_4185_g357f49b36602-x86_64-with-glibc2.28 Number of logical CPUs: 48 Start date: 2021-05-21 17:25:24.410644 End date: 2021-05-21 18:02:53.524314 +-------------------------+--------------------------+-----------------------------+--------------+-----------------------+ | Benchmark | cpython_310_opt_rig.json | cpython_ghdino_opt_rig.json | Change | Significance | +=========================+==========================+=============================+==============+=======================+ | 2to3 | 498 ms | 459 ms | 1.09x faster | Significant (t=15.60) | +-------------------------+--------------------------+-----------------------------+--------------+-----------------------+ | chameleon | 13.4 ms | 12.6 ms | 1.07x faster | Significant (t=11.10) | +-------------------------+--------------------------+-----------------------------+--------------+-----------------------+ | chaos | 163 ms | 135 ms | 1.21x faster | Significant (t=33.07) | +-------------------------+--------------------------+-----------------------------+--------------+-----------------------+ | crypto_pyaes | 171 ms | 147 ms | 1.16x faster | Significant (t=24.93) | +-------------------------+--------------------------+-----------------------------+--------------+-----------------------+ | deltablue | 11.7 ms | 8.38 ms | 1.40x faster | Significant (t=70.51) | +-------------------------+--------------------------+-----------------------------+--------------+-----------------------+ | django_template | 73.7 ms | 68.1 ms | 1.08x faster | Significant (t=13.12) | +-------------------------+--------------------------+-----------------------------+--------------+-----------------------+ | dulwich_log | 108 ms | 98.6 ms | 1.10x faster | Significant (t=18.11) | +-------------------------+--------------------------+-----------------------------+--------------+-----------------------+ | fannkuch | 734 ms | 731 ms | 1.00x faster | Not significant | +-------------------------+--------------------------+-----------------------------+--------------+-----------------------+ | float | 166 ms | 140 ms | 1.18x faster | Significant (t=29.38) | +-------------------------+--------------------------+-----------------------------+--------------+-----------------------+ | go | 345 ms | 305 ms | 1.13x faster | Significant (t=31.29) | +-------------------------+--------------------------+-----------------------------+--------------+-----------------------+ | hexiom | 14.4 ms | 13.1 ms | 1.10x faster | Significant (t=15.95) | +-------------------------+--------------------------+-----------------------------+--------------+-----------------------+ | json_dumps | 19.6 ms | 18.1 ms | 1.09x faster | Significant (t=13.85) | +-------------------------+--------------------------+-----------------------------+--------------+-----------------------+ | json_loads | 37.5 us | 34.8 us | 1.08x faster | Significant (t=16.23) | +-------------------------+--------------------------+-----------------------------+--------------+-----------------------+ | logging_format | 14.5 us | 10.9 us | 1.33x faster | Significant (t=43.42) | +-------------------------+--------------------------+-----------------------------+--------------+-----------------------+ | logging_silent | 274 ns | 238 ns | 1.15x faster | Significant (t=23.00) | +-------------------------+--------------------------+-----------------------------+--------------+-----------------------+ | logging_simple | 13.4 us | 10.2 us | 1.31x faster | Significant (t=46.73) | +-------------------------+--------------------------+-----------------------------+--------------+-----------------------+ | mako | 23.1 ms | 22.3 ms | 1.04x faster | Significant (t=5.78) | +-------------------------+--------------------------+-----------------------------+--------------+-----------------------+ | meteor_contest | 151 ms | 152 ms | 1.01x slower | Not significant | +-------------------------+--------------------------+-----------------------------+--------------+-----------------------+ | nbody | 217 ms | 208 ms | 1.04x faster | Significant (t=6.52) | +-------------------------+--------------------------+-----------------------------+--------------+-----------------------+ | nqueens | 153 ms | 145 ms | 1.06x faster | Significant (t=10.43) | +-------------------------+--------------------------+-----------------------------+--------------+-----------------------+ | pathlib | 29.2 ms | 24.5 ms | 1.19x faster | Significant (t=27.86) | +-------------------------+--------------------------+-----------------------------+--------------+-----------------------+ | pickle | 14.6 us | 14.6 us | 1.00x slower | Not significant | +-------------------------+--------------------------+-----------------------------+--------------+-----------------------+ | pickle_dict | 36.3 us | 35.4 us | 1.03x faster | Significant (t=6.24) | +-------------------------+--------------------------+-----------------------------+--------------+-----------------------+ | pickle_list | 5.55 us | 5.44 us | 1.02x faster | Significant (t=3.42) | +-------------------------+--------------------------+-----------------------------+--------------+-----------------------+ | pickle_pure_python | 708 us | 576 us | 1.23x faster | Significant (t=56.02) | +-------------------------+--------------------------+-----------------------------+--------------+-----------------------+ | pidigits | 262 ms | 255 ms | 1.03x faster | Significant (t=6.37) | +-------------------------+--------------------------+-----------------------------+--------------+-----------------------+ | pyflate | 1.02 sec | 919 ms | 1.11x faster | Significant (t=24.26) | +-------------------------+--------------------------+-----------------------------+--------------+-----------------------+ | python_startup | 13.1 ms | 13.1 ms | 1.01x faster | Not significant | +-------------------------+--------------------------+-----------------------------+--------------+-----------------------+ | python_startup_no_site | 8.69 ms | 8.56 ms | 1.01x faster | Not significant | +-------------------------+--------------------------+-----------------------------+--------------+-----------------------+ | raytrace | 758 ms | 590 ms | 1.28x faster | Significant (t=62.09) | +-------------------------+--------------------------+-----------------------------+--------------+-----------------------+ | regex_compile | 256 ms | 227 ms | 1.13x faster | Significant (t=29.88) | +-------------------------+--------------------------+-----------------------------+--------------+-----------------------+ | regex_dna | 256 ms | 256 ms | 1.00x faster | Not significant | +-------------------------+--------------------------+-----------------------------+--------------+-----------------------+ | regex_effbot | 4.29 ms | 4.35 ms | 1.01x slower | Not significant | +-------------------------+--------------------------+-----------------------------+--------------+-----------------------+ | regex_v8 | 35.7 ms | 35.5 ms | 1.00x faster | Not significant | +-------------------------+--------------------------+-----------------------------+--------------+-----------------------+ | richards | 117 ms | 98.3 ms | 1.19x faster | Significant (t=31.70) | +-------------------------+--------------------------+-----------------------------+--------------+-----------------------+ | scimark_fft | 559 ms | 573 ms | 1.02x slower | Significant (t=-6.02) | +-------------------------+--------------------------+-----------------------------+--------------+-----------------------+ | scimark_lu | 254 ms | 249 ms | 1.02x faster | Not significant | +-------------------------+--------------------------+-----------------------------+--------------+-----------------------+ | scimark_monte_carlo | 162 ms | 126 ms | 1.29x faster | Significant (t=41.31) | +-------------------------+--------------------------+-----------------------------+--------------+-----------------------+ | scimark_sor | 305 ms | 281 ms | 1.09x faster | Significant (t=19.82) | +-------------------------+--------------------------+-----------------------------+--------------+-----------------------+ | scimark_sparse_mat_mult | 7.51 ms | 7.59 ms | 1.01x slower | Not significant | +-------------------------+--------------------------+-----------------------------+--------------+-----------------------+ | spectral_norm | 218 ms | 220 ms | 1.01x slower | Not significant | +-------------------------+--------------------------+-----------------------------+--------------+-----------------------+ | telco | 9.65 ms | 9.56 ms | 1.01x faster | Not significant | +-------------------------+--------------------------+-----------------------------+--------------+-----------------------+ | unpack_sequence | 82.4 ns | 75.5 ns | 1.09x faster | Significant (t=15.12) | +-------------------------+--------------------------+-----------------------------+--------------+-----------------------+ | unpickle | 21.0 us | 19.9 us | 1.05x faster | Significant (t=8.02) | +-------------------------+--------------------------+-----------------------------+--------------+-----------------------+ | unpickle_list | 6.49 us | 6.76 us | 1.04x slower | Significant (t=-7.46) | +-------------------------+--------------------------+-----------------------------+--------------+-----------------------+ | unpickle_pure_python | 494 us | 419 us | 1.18x faster | Significant (t=26.60) | +-------------------------+--------------------------+-----------------------------+--------------+-----------------------+ | xml_etree_generate | 144 ms | 140 ms | 1.03x faster | Significant (t=3.75) | +-------------------------+--------------------------+-----------------------------+--------------+-----------------------+ | xml_etree_iterparse | 167 ms | 159 ms | 1.04x faster | Significant (t=7.17) | +-------------------------+--------------------------+-----------------------------+--------------+-----------------------+ | xml_etree_parse | 212 ms | 209 ms | 1.02x faster | Not significant | +-------------------------+--------------------------+-----------------------------+--------------+-----------------------+ | xml_etree_process | 114 ms | 102 ms | 1.11x faster | Significant (t=16.92) | +-------------------------+--------------------------+-----------------------------+--------------+-----------------------+ Skipped 5 benchmarks only in cpython_310_opt_rig.json: sympy_expand, sympy_integrate, sympy_str, sympy_sum, tornado_http And here's the almost entirely non-significant memory benchmarks: cpython_310_mem.json ==================== Performance version: 1.0.1 Report on Linux-5.2.9-229_fbk15_hardened_4185_g357f49b36602-x86_64-with-glibc2.28 Number of logical CPUs: 48 Start date: 2021-05-18 13:09:32.100009 End date: 2021-05-18 13:46:54.655953 cpython_ghdino_mem.json ======================= Performance version: 1.0.1 Report on Linux-5.2.9-229_fbk15_hardened_4185_g357f49b36602-x86_64-with-glibc2.28 Number of logical CPUs: 48 Start date: 2021-05-19 17:17:30.891269 End date: 2021-05-20 10:44:09.117795 +-------------------------+----------------------+-------------------------+---------------+-----------------------+ | Benchmark | cpython_310_mem.json | cpython_ghdino_mem.json | Change | Significance | +=========================+======================+=========================+===============+=======================+ | 2to3 | 21.2 MB | 21.6 MB | 1.02x larger | Not significant | +-------------------------+----------------------+-------------------------+---------------+-----------------------+ | chameleon | 16.5 MB | 16.5 MB | 1.00x smaller | Not significant | +-------------------------+----------------------+-------------------------+---------------+-----------------------+ | chaos | 8303.8 kB | 8170.0 kB | 1.02x smaller | Not significant | +-------------------------+----------------------+-------------------------+---------------+-----------------------+ | crypto_pyaes | 7630.8 kB | 7549.6 kB | 1.01x smaller | Not significant | +-------------------------+----------------------+-------------------------+---------------+-----------------------+ | deltablue | 9620.0 kB | 9839.4 kB | 1.02x larger | Significant (t=-8.20) | +-------------------------+----------------------+-------------------------+---------------+-----------------------+ | django_template | 22.3 MB | 22.6 MB | 1.01x larger | Not significant | +-------------------------+----------------------+-------------------------+---------------+-----------------------+ | dulwich_log | 11.6 MB | 11.7 MB | 1.00x larger | Not significant | +-------------------------+----------------------+-------------------------+---------------+-----------------------+ | fannkuch | 7174.6 kB | 7195.0 kB | 1.00x larger | Not significant | +-------------------------+----------------------+-------------------------+---------------+-----------------------+ | float | 16.7 MB | 18.3 MB | 1.10x larger | Not significant | +-------------------------+----------------------+-------------------------+---------------+-----------------------+ | go | 9132.4 kB | 9170.4 kB | 1.00x larger | Not significant | +-------------------------+----------------------+-------------------------+---------------+-----------------------+ | hexiom | 8311.8 kB | 8372.6 kB | 1.01x larger | Not significant | +-------------------------+----------------------+-------------------------+---------------+-----------------------+ | json_dumps | 9406.6 kB | 9413.0 kB | 1.00x larger | Not significant | +-------------------------+----------------------+-------------------------+---------------+-----------------------+ | json_loads | 7444.0 kB | 7453.0 kB | 1.00x larger | Not significant | +-------------------------+----------------------+-------------------------+---------------+-----------------------+ | logging_format | 11.0 MB | 10.1 MB | 1.08x smaller | Significant (t=17.51) | +-------------------------+----------------------+-------------------------+---------------+-----------------------+ | logging_silent | 7651.0 kB | 7706.2 kB | 1.01x larger | Not significant | +-------------------------+----------------------+-------------------------+---------------+-----------------------+ | logging_simple | 10.3 MB | 10.4 MB | 1.01x larger | Not significant | +-------------------------+----------------------+-------------------------+---------------+-----------------------+ | mako | 13.7 MB | 13.9 MB | 1.02x larger | Not significant | +-------------------------+----------------------+-------------------------+---------------+-----------------------+ | meteor_contest | 9474.6 kB | 9512.0 kB | 1.00x larger | Not significant | +-------------------------+----------------------+-------------------------+---------------+-----------------------+ | nbody | 7365.4 kB | 7461.4 kB | 1.01x larger | Not significant | +-------------------------+----------------------+-------------------------+---------------+-----------------------+ | nqueens | 7471.0 kB | 7487.4 kB | 1.00x larger | Not significant | +-------------------------+----------------------+-------------------------+---------------+-----------------------+ | pathlib | 8682.4 kB | 8732.0 kB | 1.01x larger | Not significant | +-------------------------+----------------------+-------------------------+---------------+-----------------------+ | pickle | 7935.2 kB | 7942.8 kB | 1.00x larger | Not significant | +-------------------------+----------------------+-------------------------+---------------+-----------------------+ | pickle_dict | 7930.6 kB | 7933.2 kB | 1.00x larger | Not significant | +-------------------------+----------------------+-------------------------+---------------+-----------------------+ | pickle_list | 7934.2 kB | 7956.6 kB | 1.00x larger | Not significant | +-------------------------+----------------------+-------------------------+---------------+-----------------------+ | pickle_pure_python | 7962.4 kB | 7971.2 kB | 1.00x larger | Not significant | +-------------------------+----------------------+-------------------------+---------------+-----------------------+ | pidigits | 7396.4 kB | 7435.0 kB | 1.01x larger | Not significant | +-------------------------+----------------------+-------------------------+---------------+-----------------------+ | pyflate | 36.9 MB | 37.2 MB | 1.01x larger | Not significant | +-------------------------+----------------------+-------------------------+---------------+-----------------------+ | python_startup | 9499.6 kB | 9624.0 kB | 1.01x larger | Not significant | +-------------------------+----------------------+-------------------------+---------------+-----------------------+ | python_startup_no_site | 9479.6 kB | 9630.8 kB | 1.02x larger | Not significant | +-------------------------+----------------------+-------------------------+---------------+-----------------------+ | raytrace | 8239.0 kB | 8273.0 kB | 1.00x larger | Not significant | +-------------------------+----------------------+-------------------------+---------------+-----------------------+ | regex_compile | 8602.2 kB | 8662.6 kB | 1.01x larger | Not significant | +-------------------------+----------------------+-------------------------+---------------+-----------------------+ | regex_dna | 15.0 MB | 15.1 MB | 1.01x larger | Not significant | +-------------------------+----------------------+-------------------------+---------------+-----------------------+ | regex_effbot | 8054.6 kB | 8094.8 kB | 1.00x larger | Not significant | +-------------------------+----------------------+-------------------------+---------------+-----------------------+ | regex_v8 | 13.0 MB | 13.0 MB | 1.00x larger | Not significant | +-------------------------+----------------------+-------------------------+---------------+-----------------------+ | richards | 7837.2 kB | 7841.2 kB | 1.00x larger | Not significant | +-------------------------+----------------------+-------------------------+---------------+-----------------------+ | scimark_fft | 8037.0 kB | 8118.8 kB | 1.01x larger | Not significant | +-------------------------+----------------------+-------------------------+---------------+-----------------------+ | scimark_lu | 8059.2 kB | 8107.2 kB | 1.01x larger | Not significant | +-------------------------+----------------------+-------------------------+---------------+-----------------------+ | scimark_monte_carlo | 7968.2 kB | 8020.2 kB | 1.01x larger | Not significant | +-------------------------+----------------------+-------------------------+---------------+-----------------------+ | scimark_sor | 7995.0 kB | 8065.0 kB | 1.01x larger | Not significant | +-------------------------+----------------------+-------------------------+---------------+-----------------------+ | scimark_sparse_mat_mult | 8512.2 kB | 8549.4 kB | 1.00x larger | Not significant | +-------------------------+----------------------+-------------------------+---------------+-----------------------+ | spectral_norm | 7184.4 kB | 7217.8 kB | 1.00x larger | Not significant | +-------------------------+----------------------+-------------------------+---------------+-----------------------+ | telco | 7857.2 kB | 7672.2 kB | 1.02x smaller | Significant (t=38.26) | +-------------------------+----------------------+-------------------------+---------------+-----------------------+ | unpack_sequence | 8809.6 kB | 8835.8 kB | 1.00x larger | Not significant | +-------------------------+----------------------+-------------------------+---------------+-----------------------+ | unpickle | 7943.4 kB | 7965.8 kB | 1.00x larger | Not significant | +-------------------------+----------------------+-------------------------+---------------+-----------------------+ | unpickle_list | 7948.6 kB | 7925.6 kB | 1.00x smaller | Not significant | +-------------------------+----------------------+-------------------------+---------------+-----------------------+ | unpickle_pure_python | 7922.0 kB | 7955.8 kB | 1.00x larger | Not significant | +-------------------------+----------------------+-------------------------+---------------+-----------------------+ | xml_etree_generate | 11.5 MB | 11.7 MB | 1.02x larger | Not significant | +-------------------------+----------------------+-------------------------+---------------+-----------------------+ | xml_etree_iterparse | 12.1 MB | 12.0 MB | 1.01x smaller | Not significant | +-------------------------+----------------------+-------------------------+---------------+-----------------------+ | xml_etree_parse | 11.6 MB | 11.5 MB | 1.01x smaller | Not significant | +-------------------------+----------------------+-------------------------+---------------+-----------------------+ | xml_etree_process | 12.1 MB | 12.5 MB | 1.03x larger | Significant (t=-3.04) | +-------------------------+----------------------+-------------------------+---------------+-----------------------+ On Tue, May 25, 2021 at 2:05 PM Guido van Rossum <guido@python.org> wrote:

On Tue., May 25, 2021, 12:58 Guido van Rossum, <guido@python.org> wrote:
This. Having this as an informational PEP that's already marked as Active seems off somehow to me. I guess it feels more "we're doing this" (which I know isn't intended) rather than "this is our plan, what do you all think? All good?" I don't recall other standards track PEPs that don't also spell out the
specification of the proposal in detail.
I also am not aware of a PEP that's proposed restructuring the eval loop like this either. 😉 I'm personally fine with the detail and saying details may shift as things move forward and lessons are learned based on the scope and updating the PEP accordingly. But that's just me and I don't know if others agree (hence the reason I'm suggesting this be Standards Track). -Brett

On Tue, May 25, 2021 at 7:56 PM Brett Cannon <brett@python.org> wrote:
Right. I have no power to unilaterally decide that "we're doing this", in the sense of "we're definitely merging this", and neither do Eric and Mark. But given the reactions during and since the Language Summit I had assumed that there isn't much dissent, so that "it's okay to try this" seems a reasonable conclusion. And given *that*, I'm not sure that there's much of a difference between the two positions. But I'm not trying to stifle discussion, and there is plenty of work that we (the three authors) can do before we're at the point of no return. (In fact, even if we were to get called back, all we'd need to do would be to revert some commits -- this has happened before.)
Right. Usually we just discuss ideas to improve the eval loop on bpo or in PRs, occasionally on python-dev, but not in PEPs. Over the years the eval loop has become quite a bit more efficient than the loop I wrote three decades ago, but also much more complex, and there's less and less low-hanging fruit left. In the end the proposal here could make it easier to reason about the performance of the eval loop, because there will be fewer one-off hacks, and instead a more systematic approach.
Sure. Do you have a specific text change in mind (or even just a suggestion about where we should insert some language about details shifting and learning lessons? Are there other things you'd like to see changed in the PEP? Also, can you outline a specific process that you would be comfortable with here, given that we're breaking a certain amount of new ground here process-wise? Or should we just change the PEP type to Standards Track and submit it to the Steering Council for review? -- --Guido van Rossum (python.org/~guido) *Pronouns: he/him **(why is my pronoun here?)* <http://feministing.com/2015/02/03/how-using-they-as-a-singular-pronoun-can-c...>

Hi Brett, On 26/05/2021 3:56 am, Brett Cannon wrote:
The PEP is a "we're doing this" document. Maybe it shouldn't be a PEP at all? I've changed its status to "draft" for now. I want to document what we are doing as publicly as possible and a PEP seems like a good way to do that. I also want to reiterate that the PEP doesn't propose changing the language, libraries, Python API or C API in any way. It is just information about how we plan to speed up the interpreter.
Suppose it were a standards PEP, what would that mean if it were rejected? Rejection of a PEP is a choice in favor of an alternative, but what is that alternative? You can't simply say the "status quo" as that would implicitly prevent any development at all on the bytecode interpreter. Cheers, Mark. p.s. For those not at the language summit, here's my grand plan for CPython: https://docs.google.com/presentation/d/1_cvQUwO2WWsaySyCmIy9nj9by4JKnkbiPCqt...

On Wed, May 26, 2021 at 4:23 AM Mark Shannon <mark@hotpy.org> wrote:
Thanks!
Sure, but it isn't a minor change either; if it was then you would have done it already. 😉 This is an architectural shift in how the interpreter works, so it isn't a minor thing.
Then we don't do it.
Rejection of a PEP is a choice in favor of an alternative, but what is that alternative?
The status quo.
You can't simply say the "status quo" as that would implicitly prevent any development at all on the bytecode interpreter.
I don't think that logic holds; if *your *approach happened to be rejected in terms of how to change how the interpreter works it doesn't mean that *no* approach would be accepted (and you can tweak this to be about performance or anything else). PEPs are not submitted where it's "pick from these 2 options" or "pick this or you can never touch the topic/idea again". I think this is all a question of process and how do we want to handle these sort of architectural changes that don't necessarily impact APIs (and thus users directly), but do impact all of us from a perspective of maintenance and understanding? Typically we haven't worried about it because quite frankly none of us have had the time to really tackle something this large or it was so speculative it was done outside of the repo (e.g. gilectomy). If people want to stick with informal consent, then great! If people want to have a more formal discussion, that works too! But my point is this falls into an odd gray area in our overall process that I am suggesting we try to figure out. -Brett

I've gotten some questions privately about my responses on this topic and so I wanted to take the opportunity to clarify a couple of things. First, I totally support the technical work that Guido, Eric, and Mark have planned. This whole line of questions is not meant to act as stop energy for this specific project in any way. I apologize if that didn't come across in my emails. Second, this is totally a process question from my POV. The answer in the end might be that a PEP is not even necessary and it's more about sharing a design doc, or maybe it's a PEP as Standards Track, or maybe an Informational PEP is the right thing in the end; I simply don't know and hence my questions on this topic. I'm just trying to figure out how we as a group want to discuss and handle projects of this size going forward in case getting corporate support to start tackling these bigger projects becomes more normal. On Wed, May 26, 2021 at 1:00 PM Brett Cannon <brett@python.org> wrote:

Hi, The gilectomy was mentioned earlier in the thread. This project was created by a core dev who had the permission to push directly his work into the development branch, but it was done on a work. Another optimization project example was my experimental "FAT Python" project. The main idea was to specialize functions with some assumptions, and check these assumptions at the function entry. I started in a fork and then wrote 3 PEPs to propose to merge the main changes into Python development branch: * PEP 509 -- Add a private version to dict * PEP 510 -- Specialize functions with guards * PEP 511 -- API for code transformers The PEP 509 was accepted since it could be used for other optimizations: Python now uses the dictionary version to optimize LOAD_GLOBAL. The two other PEPs were rejected. Even if I was very disappointed when they were rejected, it seems like it was a wize choice since I was never able to make Python significantly faster (like 2x faster). It was only 10-20% faster on some micro-benchmarks. I also had the permission to push directly, and I think that it was nice to confront my design and ideas to the community. I also understand that optimizing Python is really hard and it requires a lot of preparation work. It's hard to sell the preparation work since it only introduces regressions and noise without any concrete speedup. All of this work is required to implement the real interesting optimization. Moreover, few people are earger to review a Python fork with deep and large changes in Python internals. The work must be re-done step by step with more "atomic" (small) changes to have a readable Git history. The Instagram team behind the Cinder project may be interested to review Mark's work before it's merged into Python development branch. A design PEP would be more convenient than reviewing an concrete implementation. Victor

I can't speak for others on the Cinder team, but I would definitely be happy to help review, and a design document would be great for that. I'm certainly curious what the design is for the Microsoft implementation and how it differs from our shadow code implementation. Right now I certainly don't have enough information to know what the differences are. And of course the reason to open source Cinder was to help avoid duplication of effort as obviously there's a lot of interest in making Python faster! So if there's interest in our shadow code implementation I'd be happy to work to break it out into more reviewable pieces as well - it's got some pretty natural ways to split it up. On Thu, May 27, 2021 at 2:21 PM Victor Stinner <vstinner@python.org> wrote:

On Wed, May 12, 2021 at 10:48 AM Mark Shannon <mark@hotpy.org> wrote:
I was curious what you meant by "is more suitable to a collaborative, open-source development model," but I didn't see it elaborated on in the PEP. If this is indeed a selling point, it might be worth mentioning that and saying why. --Chris

Em qua., 12 de mai. de 2021 às 14:45, Mark Shannon <mark@hotpy.org> escreveu:
Hi Mark, I think this seems like a nice proposal... I do have some questions related to the PEP though (from the point of view of implementing a debugger over it some things are kind of vague to me): 1. When will the specialization happen? (i.e.: is bytecode expected to be changed while the code is running inside a frame or must it all be done prior to entering a frame on a subsequent call?) 2. When the adaptive specialization happens, will that be reflected on the actual bytecode seen externally in the frame or is that all internal? Will clients be able to make sense of that? -- i.e.: In the debugger right now I have a need on some occasions to detect the structure of the code from the bytecode itself (for instance to detect whether some exception would be handled or unhandled at raise time just given the bytecode). 3. Another example: I'm working right now on a feature to step into a method. To do that right now my approach is: - Compute the function call names and bytecode offsets in a given frame. - When a frame is called (during a frame.f_trace call), check the parent frame bytecode offset (frame.f_lasti) to detect if the last thing was the expected call (and if it was, break the execution). This seems reasonable given the current implementation, where bytecodes are all fixed and there's a mapping from the frame.f_lasti ... Will that still work with the specializing adaptive interpreter? 4. Will it still be possible to change the frame.f_code prior to execution from a callback set in `PyThreadState.interp.eval_frame` (which will change the code to add a breakpoint to the bytecode and later call `_PyEval_EvalFrameDefault`)? Note: this is done in the debugger so that Python can run without any tracing until the breakpoint is hit (tracing is set afterwards to actually pause the execution as well as doing step operations). Best regards, Fabio

Hi Fabio, On 13/05/2021 7:11 pm, Fabio Zadrozny wrote:
The specialization is adaptive, so it can happen at anytime during execution.
The bytecode, as externally visible, will be unchanged. All specializations will be internal and should be invisible to all Python tools.
If you are implementing this in Python, then everything should work as it does now. OOI, would inserting a breakpoint at offset 0 in the callee function work?
Since frame.f_code is read-only in Python, I assume you mean in C. I can make no guarantees about the layout or meaning of fields in the C frame struct, I'm afraid. But I'm sure we can get something to work for you. Cheers, Mark.
Best regards,
Fabio

Ok... this part is all done in Python, so, if frame.f_lasti is still updated properly according to the original bytecode while executing the super instructions, then all seems to work properly on my side ;).
OOI, would inserting a breakpoint at offset 0 in the callee function work?
Yes... if you're curious, for the breakpoint to actually work, what is done is generate bytecode which calls a function to set the tracing and later generates a spurious line event (so that the tracing function is then able to make the pause. The related code that generates this bytecode would be: https://github.com/fabioz/PyDev.Debugger/blob/pydev_debugger_2_4_1/_pydevd_f... ).
Yes, it's indeed done in C (cython in this case... the related code for reference is: https://github.com/fabioz/PyDev.Debugger/blob/pydev_debugger_2_4_1/_pydevd_f... ). I'm fine in going through some other way or using some other API (this is quite specific to a debugger after all), I just wanted to let you know of the use case so that something along those lines can still be supported (currently on CPython, this is as low-overhead for debugging as I can think of, but since there'll be an adaptive bytecode specializer, using any other custom API would also be reasonable). Cheers, Fabio

The related code that generates this bytecode https://www.credosystemz.com/training-in-chennai/best-python-training-in-che...

On 5/12/2021 1:40 PM, Mark Shannon wrote:
This is an informational PEP about a key part of our plan to improve CPython performance for 3.11 and beyond.
As always, comments and suggestions are welcome.
The claim that starts the Motivation section, "Python is widely acknowledged as slow.", has multiple problems. While some people believe, or at least claim to believe "Python is slow", other know that as stated, the latter is false. Languages do not have a speed, only implementations running code for particular applications have a speed, or a speed relative to equivalent code in another language with a different runtime. I reason I am picking on this is that the meme 'Python is slow' is being morphed into 'Python is destroying the earth' (and should be abandoned, if not banned). Last fall, a science news journal (Nature News?) quoted a 'concerned scientist' saying just this. An internet troll repeated it last week on comp.lang.python (from where it leaked onto python-list). It is true that Python has characteristics that make it *relatively* difficult to write interpreters that are *relatively* fast in certain applications. But the opposite is also true. The language does *not* mandate that objects, their methods, and modules be written in the language. Hence, CPython implements builtin objects and function and some stdlib modules in C and allows 3rd party modules written in C or C++ or Fortran. I believe the first killer app for Python, in the mid 1990s, numerical computing with NumericalPython. Rather than being 'slow', CPython *enabled* people, with a few percent of added time, to access fast, heavily optimized C and Fortran libraries and do things they could not do in Fortran and that would have been much more difficult in C. My daughter's PhD thesis work is a recent example of using Python to access C libraries. The concerned scientist mentioned above noted, more or less correctly, that numerical code, such as neuro-network code, is, say, 80x slower in pure python than in compiled C. But he did not mention that serious numerical and scientific work in Python is not done with such code. I have seen this sort of bogus comparison before.
-- Terry Jan Reedy

Hi Terry, On 13/05/2021 5:32 am, Terry Reedy wrote:
I broadly agree, but CPython is largely synonymous with Python and CPython is slower than it could be. The phrase was not meant to upset anyone. How would you rephrase it, bearing in mind that needs to be short?
It is a legitimate concern that CPython is bad for the environment, and one that I hope we can address by speeding up CPython. Since, faster == less energy for the same amount of work, making CPython faster will reduce the amount of CO2 produced to do that work and hopefully make it less of a concern. Of course, compared to the environmental disaster that is BitCoin, it's not a big deal.
Yes, one of the great things about Python is that almost every library of any size has Python bindings. But there is a difference between making code that is already written in C/Fortran available to Python and telling people to write code in C/Fortran because their Python code is too slow. We want people to be able to write code in Python and have it perform at the level they would get from a good Javascript or lua implementation.
It is still important to speed up Python though. If a program does 95% of its work in a C++ library and 5% in Python, it can easily spend the majority of its time in Python because CPython is a lot slower than C++ (in general). Cheers, Mark.

On Thu, 13 May 2021 at 09:23, Mark Shannon <mark@hotpy.org> wrote:
How about simply "The CPython interpreter, while sufficiently fast for much use, could be faster"? Along with the following sentence, this seems to me to state the situation fairly but in a way that motivates this proposal. Paul

On Thu, May 13, 2021 at 9:18 AM Mark Shannon <mark@hotpy.org> wrote:
[...] hopefully make it less of a concern.
Of course, compared to the environmental disaster that is BitCoin, it's not a big deal.
Every little helps. Please switch off the light as you leave the room. [...]
It is still important to speed up Python though.
Agreed.

I suggest we keep it really simple, and name the implementation. Building on Steve Holden’s suggestion: There is broad interest in improving the performance of the cPython runtime. (Interpreter?) -CHB -- Christopher Barker, PhD (Chris) Python Language Consulting - Teaching - Scientific Software Development - Desktop GUI and Web Development - wxPython, numpy, scipy, Cython

On Tue, May 18, 2021 at 8:51 PM Stephen J. Turnbull <turnbull.stephen.fw@u.tsukuba.ac.jp> wrote:
I feel the need to redress the balance of names here. This thread has had a mere two Chrises so far, and I am improving that statistic by 50%. ChrisA (used to style his name as "chrisa" but people complained that I looked like a girl)

On Thu, May 13, 2021 at 09:18:27AM +0100, Mark Shannon wrote:
Work expands to fill the time available: if Python is 10% more efficient, people will use that extra speed to do 10% more work. There will be no nett savings in CO2 and if anything a faster Python will lead to more people using it and hence a nett increase in Python's share of the CO2 emissions. Let's make Python faster, but don't fool ourselves that we're doing it for the environment. Every time people improve the efficiency of some resource, we respond by using more of that resource. -- Steve

Mark Shannon writes:
On 13/05/2021 5:32 am, Terry Reedy wrote:
The claim that starts the Motivation section, "Python is widely acknowledged as slow.", has multiple problems.
How would you rephrase it, bearing in mind that needs to be short?
We can make CPython run significantly faster, at a reasonable cost in developer time, without otherwise changing the sematics of the language. If you have good justification for saying "as fast as the best JS/Lua implementations" or whatever, feel free to substitute that for "significantly faster". And now this:
It is a legitimate concern that CPython is bad for the environment,
It is not. I do this for a living (5 hours in a research hackathon just this afternoon on a closely related topic[1]), and I assure you that such "concern" is legitimate only as a matter of purely speculative metaphysics. We don't have the data to analyze the possibilities, and we don't even have the models if we did have the data. The implied model that gets you from your tautology to "concern" is just plain wrong -- work to be done is not independent of the cost of doing it[2], not to mention several other relevant variables, and cannot be made so in a useful model.
and hopefully make it less of a concern.
It is only a concern in the Tucker Carlson "just asking questions" mode of "concern". Really -- it's *that* bad.
So say that. Nothing to be ashamed of there! The work you propose to do is valuable for a lot of valid reasons, the most important of which is "because we can and there's no immediate downside".[3] Stick to those. Footnotes: [1] Yoshida, M., Turnbull, S.J. Voluntary provision of environmental offsets under monopolistic competition. Int Tax Public Finance (2021). https://doi.org/10.1007/s10797-020-09630-5. Paywalled, available from the author, rather specialized, though. Two works-in-progress are much more closely related, but I have a paranoid coauthor so can't say more at this time. :-) [2] As Steven d'Aprano points out colorfully, using Parkinson's Law. [3] Look up Braess's Paradox for a classic and mathematically simple example of how reducing cost "with no immediate downside" can increase expense "once everything works itself out."

On 5/13/2021 4:18 AM, Mark Shannon wrote:
Others have given some fine suggestions. Take your pick. [ship]
We want people to be able to write code in Python and have it perform at the level they would get from a good Javascript or lua implementation.
I agree with others that this is a good way to state the goal. It also seems on the face of it reasonable, though not trivial. I get the impression that you are proposing to use python-adapted variations of techniques already used for such implementations.
It is still important to speed up Python though.
I completely agree. Some application areas are amenable to speedup be resorting to C libraries, often already available. Others are not. The latter involve lots of idiosyncratic business logic, individual numbers rather than arrays of numbers, and strings. Numpy based applications gain firstly from using unboxed arrays of machine ints and floats instead of lists (and lists) of boxed ints and floats and secondly from C or assembly-coded routines Python strings are already arrays of machine ints (codepoints). Basic operations on strings, such as 'substring in string' are already coded in C working on machine values. So the low-hanging fruit has already been picked.
I believe the ratio for the sort of numerical computing getting bogus complaints is sometimes more like 95% of *time* in compiled C and only, say, 5% of *time* in the Python interpreter. So even if the interpreter ran instantly, it would make also no difference -- for such applications. -- Terry Jan Reedy

On Thu, 20 May 2021 at 04:58, Terry Reedy <tjreedy@udel.edu> wrote:
Not necessarily because if the interpreter is faster then it opens up new options that perhaps don't involve the same C routines. The situation right now is that it is often faster to do more "computation" than needed using efficient C routines rather than do precisely what is needed in bare Python. If the bare Python part becomes faster then maybe you don't need the C routine at all. To give a concrete example, in SymPy I have written a pure Python implementation of typed sparse matrices (this is much faster than the public Matrix class so don't compare with that). I would like to use the flint library to speed up some of these matrix calculations and the flint library has a highly optimised C/assembly implementation of dense matrices of arbitrary precision integers. Which of these implementations is faster for e.g. matrix multiplication depends on how sparse the matrix actually is. If I have a large matrix say 1000 x 1000 and only 1% of the elements are nonzero then the pure Python sparse implementation is faster (it can be much faster as the density reduces since it does not have the same big-O characteristics). On the other hand for fully dense matrices where all elements are nonzero the flint implementation is consistently around 100x faster. The break even point where both implementations take equal time is around about 5% density. What that means is that for a 1000 x 1000 matrix with 10% of elements nonzero it is faster to ask flint to construct an enormous dense matrix and perform a huge number of arithmetic operations (mostly involving zeros) than it is to use a pure Python implementation that has more favourable asymptotic complexity and theoretically computes the result with 100x fewer arithmetic "operations". In this situation there is a sliding scale where the faster the Python interpreter gets the less often you benefit from calling the C routine in the first place. Although this is a very specific example it illustrates something that I see very often which is that while the efficient C routines can make things "run at the speed of C" you can often end up optimising things to use an approach that would seem inefficient if you were working in C directly. This happens because it works out faster from the perspective of pure Python code that is encumbered by interpreter overhead and has a limited range of C routines to choose from. If the interpreter overhead is less then the options to choose from are improved. Also for many applications it is much easier for the programmer to write an algorithm directly in loops rather than coming up with a vectorised version based on e.g. numpy arrays. Vectorisation as a way of optimising code is actually work for the programmer. There is another tradeoff here which is not about C speed vs Python speed but about programmer time vs CPU time. If a straight-forward Python implementation is already "fast enough" then you don't need to spend time thinking about how to translate that into something that would possibly run faster (potentially at the expense of code readability). In the case of SymPy/flint if the maximum speed gain of flint was only 10x then I might not bother using it at all to avoid the complexity of having multiple implementations to choose from and external dependencies etc. -- Oscar

On 5/20/2021 10:49 AM, Oscar Benjamin wrote:
On Thu, 20 May 2021 at 04:58, Terry Reedy <tjreedy@udel.edu> wrote:
'also' was meant to be 'almost'
Not necessarily
In the context I carefully defined, where Python is falsely accused of endangering the earth, by people who set up strawpeople images of how Python is actually used and who care nothing about programmer time and joy, yes, necessarily. However, in the related context you define, faster Python could help save the earth by reducing the need for brute-force C routines when they are grossly inefficient. How ironic that would be.
-- Terry Jan Reedy

Oscar Benjamin writes:
Sure, but what's also happening here is that you're optimizing programmer cost by not writing the sparse algorithm in C, C++, or Rust. So I haven't done the math, but I guess to double the percentage of nonzero matrix elements that constitutes the breakeven point you need to double the speed of the Python runtime, and I don't think that's going to happen any time soon. As far as I can see, any reasonably anticipatable speedup is quite marginal for you (a 10% speedup in arithmetic is, I hope, a dream, but that would get you from 5% to 5.5% -- is that really a big deal?)
Absolutely. But the real problem you face is that nobody is writing routines for sparse matrices in languages that compile to machine code (or worse, not wrapping already available C libraries).
Sure, but my guesstimate is that that would require a 90% speedup in Python arithmetic. Really, is that going to happen? I feel your pain (even though for me it's quite theoretical, my own data is dense, even impenetrable). But I just don't see even potential 10% or 20% speedups in Python overcoming the generic need for programmers to either (1) accept the practical limits to the size of data they can work with in Python or (2) bite the bullet and write C (or ctypes) that can do the calculations 100x as fast as a well-tuned Python program. I'm all for Mark's work, and I'm glad somebody's willing to pay him some approximation to what he's worth, even though I probably won't benefit myself (nor my students). But I really don't see the economics of individual programmers changing very much -- 90% of us will just use the tried-and-true packages (some of which are accelerated like NumPy and Pandas), 9% will think for ten minutes and choose (1) or (2) above, and 1% will do the careful breakeven analysis you do, and write (and deal with the annoyances of) hybrid code. Steve

I find this whole conversation confusing -- does anyone really think a substantial performance boost to cPython is not a "good thing"? Worth the work? Maybe not, but it seems that Mark, Guido, and MS think it is -- more power to them! Anyway: potential 10% or 20% speedups in Python I believe the folks involved think they may get a factor of two speedup -- but in any case, Oscar has a point -- there is a trade-off of effort vs performance, and increasing the performance of cPython moves that trade-off point, even if just a little. I like Oscar's example, because it's got hard numbers attached to it, but the principle is the same for any time you are considering writing, or even using, a non-python library.
Oddly missing from this conversation is PyPy -- which can buy you a lot of performance for some types of code in pure Python, and things like Cython or numba, which can buy you a lot with slightly modified Python. All those options are why Python is very useful today -- but none of them make the case that making cPython run faster isn't a worthy goal. -CHB -- Christopher Barker, PhD (Chris) Python Language Consulting - Teaching - Scientific Software Development - Desktop GUI and Web Development - wxPython, numpy, scipy, Cython

Christopher Barker writes:
I find this whole conversation confusing -- does anyone really think a substantial performance boost to cPython is not a "good thing"?
I don't understand why you think anybody, except maybe some crank who caused the editors of Science or whatever it was to seriously embarrass themselves, opposes the goal of making cPython run faster. All I want is some sanity when advocating changes to Python. For performance work, tell us how much faster cPython is going to be, explain where you got your numbers, and let us decide how we'll use the cycles saved. There's been a lot of nonsense peddled in support of this proposal by the proponent and thirds parties, when all anybody needs is Mark says he can make cPython noticably faster and we believe him! More important, Microsoft does. Steve

On 5/12/2021 1:40 PM, Mark Shannon wrote:
This is an informational PEP about a key part of our plan to improve CPython performance for 3.11 and beyond.
What is the purpose of this PEP? It seems in part to be like a Standards Track PEP in that it proposes a new (revised) implementation idea for the CPython bycode interpreter. Do you not intend this to not constitute approval of even the principle? One of the issues in the new project gave formulas for the cost versus benefit calculations underlying specialization. Depending on your purpose, it might be good to include them. They certainly gave some clarity to me. -- Terry Jan Reedy

Hi Terry, On 13/05/2021 8:20 am, Terry Reedy wrote:
I will make it a standards PEP if anyone feels that would be better. We can implement PEP 659 incrementally, without any large changes to the implementation or any to the language or API/ABI, so a standards PEP didn't seem necessary to us. However, because it is a large change to the implementation, it seemed worth documenting and doing so in a clearly public fashion. Hence the informational PEP.
Which ones in particular? I can add something like them to the PEP. Cheers, Mark.

On Thu, May 13, 2021 at 1:38 AM Mark Shannon <mark@hotpy.org> wrote:
I personally think it should be a Standards Track PEP. This PEP isn't documenting some detail like PEP 13 or some release schedule, but is instead proposing a rather major change to the interpreter which a lot of us will need to understand in order to support the code (and I do realize the entire area of "what requires a PEP and what doesn't" is very hazy). -Brett

On Tue, May 25, 2021 at 12:34 PM Brett Cannon <brett@python.org> wrote:
Does that also mean you think the design should be completely hashed out and approved by the SC ahead of merging the implementation? Given the amount of work, that would run into another issue -- many of the details of the design can't be fixed until the implementation has proceeded, and we'd end up with a long-living fork of the implementation followed by a giant merge. My preference (and my promise at the Language Summit) is to avoid mega-PRs and instead work on this incrementally. Now, we've done similar things before (for example, the pattern matching implementation was a long-living branch), but the difference is that for pattern matching, the implementation followed the design, whereas for the changes to the bytecode interpreter that we're undertaking here, much of the architecture will be designed as the implementation proceeds, based on what we learn during the implementation. Or do you think the "Standards Track" PEP should just codify general agreement that we're going to implement a specializing adaptive interpreter, with the level of detail that's currently in the PEP? I don't recall other standards track PEPs that don't also spell out the specification of the proposal in detail. -- --Guido van Rossum (python.org/~guido) *Pronouns: he/him **(why is my pronoun here?)* <http://feministing.com/2015/02/03/how-using-they-as-a-singular-pronoun-can-c...>

On Tue, May 25, 2021 at 1:50 PM Łukasz Langa <lukasz@langa.pl> wrote:
I think it's different -- the problems with the Gilectomy were pretty predictable (slower single-core perf due to way more locking calls), but it was not predictable whether Larry would be able to overcome them (I was rooting for him the whole time). Here, we're looking at something where Mark has prototyped the proposed approach extensively (HoyPy, HotPy2), and the question is more whether Python 3.11 is going to be 15% faster or 50%. And some of the ideas have also been prototyped by the existing inline caches (some of the proposal is just to do more of those, and reducing the overhead by specializing opcodes), and further validated by Dino's work at Facebook/Instagram on Shadowcode (part of Cinder), which also specializes opcodes. -- --Guido van Rossum (python.org/~guido) *Pronouns: he/him **(why is my pronoun here?)* <http://feministing.com/2015/02/03/how-using-they-as-a-singular-pronoun-can-c...>

To potentially help provide a little bit of additional detail around our approach I've spent some time writing up our internal details of the shadow byte code implementation, and landed that in our Cinder repo here: https://github.com/facebookincubator/cinder/blob/cinder/3.8/CinderDoc/shadow.... That might at least spark discussion or ideas about possible internal implementation details or things which could be different/more efficient in our implementation. I've also had a version of it against 3.10 going for a while (as internally we're still at 3.8) and I've updated it to a relatively recent merge of 3.11 main. I've pushed the latest version of that here here: https://github.com/DinoV/cpython/tree/shadowcode_rebase_2021_05_12. The 3.11 version obviously isn't as battle tested as what we've been running in production for some time now but it pretty much the same. It is missing our improved global caching which uses dictionary watches though. And it is a rather large PR (almost 7k lines) but over 1/3rd of that is the test cases. Also just to inform the discussion around potential performance benefits, here's how that alone is currently benchmarking versus the base commit: cpython_310_opt_rig.json ======================== Performance version: 1.0.1 Report on Linux-5.2.9-229_fbk15_hardened_4185_g357f49b36602-x86_64-with-glibc2.28 Number of logical CPUs: 48 Start date: 2021-05-17 21:57:08.095822 End date: 2021-05-17 22:40:33.374232 cpython_ghdino_opt_rig.json =========================== Performance version: 1.0.1 Report on Linux-5.2.9-229_fbk15_hardened_4185_g357f49b36602-x86_64-with-glibc2.28 Number of logical CPUs: 48 Start date: 2021-05-21 17:25:24.410644 End date: 2021-05-21 18:02:53.524314 +-------------------------+--------------------------+-----------------------------+--------------+-----------------------+ | Benchmark | cpython_310_opt_rig.json | cpython_ghdino_opt_rig.json | Change | Significance | +=========================+==========================+=============================+==============+=======================+ | 2to3 | 498 ms | 459 ms | 1.09x faster | Significant (t=15.60) | +-------------------------+--------------------------+-----------------------------+--------------+-----------------------+ | chameleon | 13.4 ms | 12.6 ms | 1.07x faster | Significant (t=11.10) | +-------------------------+--------------------------+-----------------------------+--------------+-----------------------+ | chaos | 163 ms | 135 ms | 1.21x faster | Significant (t=33.07) | +-------------------------+--------------------------+-----------------------------+--------------+-----------------------+ | crypto_pyaes | 171 ms | 147 ms | 1.16x faster | Significant (t=24.93) | +-------------------------+--------------------------+-----------------------------+--------------+-----------------------+ | deltablue | 11.7 ms | 8.38 ms | 1.40x faster | Significant (t=70.51) | +-------------------------+--------------------------+-----------------------------+--------------+-----------------------+ | django_template | 73.7 ms | 68.1 ms | 1.08x faster | Significant (t=13.12) | +-------------------------+--------------------------+-----------------------------+--------------+-----------------------+ | dulwich_log | 108 ms | 98.6 ms | 1.10x faster | Significant (t=18.11) | +-------------------------+--------------------------+-----------------------------+--------------+-----------------------+ | fannkuch | 734 ms | 731 ms | 1.00x faster | Not significant | +-------------------------+--------------------------+-----------------------------+--------------+-----------------------+ | float | 166 ms | 140 ms | 1.18x faster | Significant (t=29.38) | +-------------------------+--------------------------+-----------------------------+--------------+-----------------------+ | go | 345 ms | 305 ms | 1.13x faster | Significant (t=31.29) | +-------------------------+--------------------------+-----------------------------+--------------+-----------------------+ | hexiom | 14.4 ms | 13.1 ms | 1.10x faster | Significant (t=15.95) | +-------------------------+--------------------------+-----------------------------+--------------+-----------------------+ | json_dumps | 19.6 ms | 18.1 ms | 1.09x faster | Significant (t=13.85) | +-------------------------+--------------------------+-----------------------------+--------------+-----------------------+ | json_loads | 37.5 us | 34.8 us | 1.08x faster | Significant (t=16.23) | +-------------------------+--------------------------+-----------------------------+--------------+-----------------------+ | logging_format | 14.5 us | 10.9 us | 1.33x faster | Significant (t=43.42) | +-------------------------+--------------------------+-----------------------------+--------------+-----------------------+ | logging_silent | 274 ns | 238 ns | 1.15x faster | Significant (t=23.00) | +-------------------------+--------------------------+-----------------------------+--------------+-----------------------+ | logging_simple | 13.4 us | 10.2 us | 1.31x faster | Significant (t=46.73) | +-------------------------+--------------------------+-----------------------------+--------------+-----------------------+ | mako | 23.1 ms | 22.3 ms | 1.04x faster | Significant (t=5.78) | +-------------------------+--------------------------+-----------------------------+--------------+-----------------------+ | meteor_contest | 151 ms | 152 ms | 1.01x slower | Not significant | +-------------------------+--------------------------+-----------------------------+--------------+-----------------------+ | nbody | 217 ms | 208 ms | 1.04x faster | Significant (t=6.52) | +-------------------------+--------------------------+-----------------------------+--------------+-----------------------+ | nqueens | 153 ms | 145 ms | 1.06x faster | Significant (t=10.43) | +-------------------------+--------------------------+-----------------------------+--------------+-----------------------+ | pathlib | 29.2 ms | 24.5 ms | 1.19x faster | Significant (t=27.86) | +-------------------------+--------------------------+-----------------------------+--------------+-----------------------+ | pickle | 14.6 us | 14.6 us | 1.00x slower | Not significant | +-------------------------+--------------------------+-----------------------------+--------------+-----------------------+ | pickle_dict | 36.3 us | 35.4 us | 1.03x faster | Significant (t=6.24) | +-------------------------+--------------------------+-----------------------------+--------------+-----------------------+ | pickle_list | 5.55 us | 5.44 us | 1.02x faster | Significant (t=3.42) | +-------------------------+--------------------------+-----------------------------+--------------+-----------------------+ | pickle_pure_python | 708 us | 576 us | 1.23x faster | Significant (t=56.02) | +-------------------------+--------------------------+-----------------------------+--------------+-----------------------+ | pidigits | 262 ms | 255 ms | 1.03x faster | Significant (t=6.37) | +-------------------------+--------------------------+-----------------------------+--------------+-----------------------+ | pyflate | 1.02 sec | 919 ms | 1.11x faster | Significant (t=24.26) | +-------------------------+--------------------------+-----------------------------+--------------+-----------------------+ | python_startup | 13.1 ms | 13.1 ms | 1.01x faster | Not significant | +-------------------------+--------------------------+-----------------------------+--------------+-----------------------+ | python_startup_no_site | 8.69 ms | 8.56 ms | 1.01x faster | Not significant | +-------------------------+--------------------------+-----------------------------+--------------+-----------------------+ | raytrace | 758 ms | 590 ms | 1.28x faster | Significant (t=62.09) | +-------------------------+--------------------------+-----------------------------+--------------+-----------------------+ | regex_compile | 256 ms | 227 ms | 1.13x faster | Significant (t=29.88) | +-------------------------+--------------------------+-----------------------------+--------------+-----------------------+ | regex_dna | 256 ms | 256 ms | 1.00x faster | Not significant | +-------------------------+--------------------------+-----------------------------+--------------+-----------------------+ | regex_effbot | 4.29 ms | 4.35 ms | 1.01x slower | Not significant | +-------------------------+--------------------------+-----------------------------+--------------+-----------------------+ | regex_v8 | 35.7 ms | 35.5 ms | 1.00x faster | Not significant | +-------------------------+--------------------------+-----------------------------+--------------+-----------------------+ | richards | 117 ms | 98.3 ms | 1.19x faster | Significant (t=31.70) | +-------------------------+--------------------------+-----------------------------+--------------+-----------------------+ | scimark_fft | 559 ms | 573 ms | 1.02x slower | Significant (t=-6.02) | +-------------------------+--------------------------+-----------------------------+--------------+-----------------------+ | scimark_lu | 254 ms | 249 ms | 1.02x faster | Not significant | +-------------------------+--------------------------+-----------------------------+--------------+-----------------------+ | scimark_monte_carlo | 162 ms | 126 ms | 1.29x faster | Significant (t=41.31) | +-------------------------+--------------------------+-----------------------------+--------------+-----------------------+ | scimark_sor | 305 ms | 281 ms | 1.09x faster | Significant (t=19.82) | +-------------------------+--------------------------+-----------------------------+--------------+-----------------------+ | scimark_sparse_mat_mult | 7.51 ms | 7.59 ms | 1.01x slower | Not significant | +-------------------------+--------------------------+-----------------------------+--------------+-----------------------+ | spectral_norm | 218 ms | 220 ms | 1.01x slower | Not significant | +-------------------------+--------------------------+-----------------------------+--------------+-----------------------+ | telco | 9.65 ms | 9.56 ms | 1.01x faster | Not significant | +-------------------------+--------------------------+-----------------------------+--------------+-----------------------+ | unpack_sequence | 82.4 ns | 75.5 ns | 1.09x faster | Significant (t=15.12) | +-------------------------+--------------------------+-----------------------------+--------------+-----------------------+ | unpickle | 21.0 us | 19.9 us | 1.05x faster | Significant (t=8.02) | +-------------------------+--------------------------+-----------------------------+--------------+-----------------------+ | unpickle_list | 6.49 us | 6.76 us | 1.04x slower | Significant (t=-7.46) | +-------------------------+--------------------------+-----------------------------+--------------+-----------------------+ | unpickle_pure_python | 494 us | 419 us | 1.18x faster | Significant (t=26.60) | +-------------------------+--------------------------+-----------------------------+--------------+-----------------------+ | xml_etree_generate | 144 ms | 140 ms | 1.03x faster | Significant (t=3.75) | +-------------------------+--------------------------+-----------------------------+--------------+-----------------------+ | xml_etree_iterparse | 167 ms | 159 ms | 1.04x faster | Significant (t=7.17) | +-------------------------+--------------------------+-----------------------------+--------------+-----------------------+ | xml_etree_parse | 212 ms | 209 ms | 1.02x faster | Not significant | +-------------------------+--------------------------+-----------------------------+--------------+-----------------------+ | xml_etree_process | 114 ms | 102 ms | 1.11x faster | Significant (t=16.92) | +-------------------------+--------------------------+-----------------------------+--------------+-----------------------+ Skipped 5 benchmarks only in cpython_310_opt_rig.json: sympy_expand, sympy_integrate, sympy_str, sympy_sum, tornado_http And here's the almost entirely non-significant memory benchmarks: cpython_310_mem.json ==================== Performance version: 1.0.1 Report on Linux-5.2.9-229_fbk15_hardened_4185_g357f49b36602-x86_64-with-glibc2.28 Number of logical CPUs: 48 Start date: 2021-05-18 13:09:32.100009 End date: 2021-05-18 13:46:54.655953 cpython_ghdino_mem.json ======================= Performance version: 1.0.1 Report on Linux-5.2.9-229_fbk15_hardened_4185_g357f49b36602-x86_64-with-glibc2.28 Number of logical CPUs: 48 Start date: 2021-05-19 17:17:30.891269 End date: 2021-05-20 10:44:09.117795 +-------------------------+----------------------+-------------------------+---------------+-----------------------+ | Benchmark | cpython_310_mem.json | cpython_ghdino_mem.json | Change | Significance | +=========================+======================+=========================+===============+=======================+ | 2to3 | 21.2 MB | 21.6 MB | 1.02x larger | Not significant | +-------------------------+----------------------+-------------------------+---------------+-----------------------+ | chameleon | 16.5 MB | 16.5 MB | 1.00x smaller | Not significant | +-------------------------+----------------------+-------------------------+---------------+-----------------------+ | chaos | 8303.8 kB | 8170.0 kB | 1.02x smaller | Not significant | +-------------------------+----------------------+-------------------------+---------------+-----------------------+ | crypto_pyaes | 7630.8 kB | 7549.6 kB | 1.01x smaller | Not significant | +-------------------------+----------------------+-------------------------+---------------+-----------------------+ | deltablue | 9620.0 kB | 9839.4 kB | 1.02x larger | Significant (t=-8.20) | +-------------------------+----------------------+-------------------------+---------------+-----------------------+ | django_template | 22.3 MB | 22.6 MB | 1.01x larger | Not significant | +-------------------------+----------------------+-------------------------+---------------+-----------------------+ | dulwich_log | 11.6 MB | 11.7 MB | 1.00x larger | Not significant | +-------------------------+----------------------+-------------------------+---------------+-----------------------+ | fannkuch | 7174.6 kB | 7195.0 kB | 1.00x larger | Not significant | +-------------------------+----------------------+-------------------------+---------------+-----------------------+ | float | 16.7 MB | 18.3 MB | 1.10x larger | Not significant | +-------------------------+----------------------+-------------------------+---------------+-----------------------+ | go | 9132.4 kB | 9170.4 kB | 1.00x larger | Not significant | +-------------------------+----------------------+-------------------------+---------------+-----------------------+ | hexiom | 8311.8 kB | 8372.6 kB | 1.01x larger | Not significant | +-------------------------+----------------------+-------------------------+---------------+-----------------------+ | json_dumps | 9406.6 kB | 9413.0 kB | 1.00x larger | Not significant | +-------------------------+----------------------+-------------------------+---------------+-----------------------+ | json_loads | 7444.0 kB | 7453.0 kB | 1.00x larger | Not significant | +-------------------------+----------------------+-------------------------+---------------+-----------------------+ | logging_format | 11.0 MB | 10.1 MB | 1.08x smaller | Significant (t=17.51) | +-------------------------+----------------------+-------------------------+---------------+-----------------------+ | logging_silent | 7651.0 kB | 7706.2 kB | 1.01x larger | Not significant | +-------------------------+----------------------+-------------------------+---------------+-----------------------+ | logging_simple | 10.3 MB | 10.4 MB | 1.01x larger | Not significant | +-------------------------+----------------------+-------------------------+---------------+-----------------------+ | mako | 13.7 MB | 13.9 MB | 1.02x larger | Not significant | +-------------------------+----------------------+-------------------------+---------------+-----------------------+ | meteor_contest | 9474.6 kB | 9512.0 kB | 1.00x larger | Not significant | +-------------------------+----------------------+-------------------------+---------------+-----------------------+ | nbody | 7365.4 kB | 7461.4 kB | 1.01x larger | Not significant | +-------------------------+----------------------+-------------------------+---------------+-----------------------+ | nqueens | 7471.0 kB | 7487.4 kB | 1.00x larger | Not significant | +-------------------------+----------------------+-------------------------+---------------+-----------------------+ | pathlib | 8682.4 kB | 8732.0 kB | 1.01x larger | Not significant | +-------------------------+----------------------+-------------------------+---------------+-----------------------+ | pickle | 7935.2 kB | 7942.8 kB | 1.00x larger | Not significant | +-------------------------+----------------------+-------------------------+---------------+-----------------------+ | pickle_dict | 7930.6 kB | 7933.2 kB | 1.00x larger | Not significant | +-------------------------+----------------------+-------------------------+---------------+-----------------------+ | pickle_list | 7934.2 kB | 7956.6 kB | 1.00x larger | Not significant | +-------------------------+----------------------+-------------------------+---------------+-----------------------+ | pickle_pure_python | 7962.4 kB | 7971.2 kB | 1.00x larger | Not significant | +-------------------------+----------------------+-------------------------+---------------+-----------------------+ | pidigits | 7396.4 kB | 7435.0 kB | 1.01x larger | Not significant | +-------------------------+----------------------+-------------------------+---------------+-----------------------+ | pyflate | 36.9 MB | 37.2 MB | 1.01x larger | Not significant | +-------------------------+----------------------+-------------------------+---------------+-----------------------+ | python_startup | 9499.6 kB | 9624.0 kB | 1.01x larger | Not significant | +-------------------------+----------------------+-------------------------+---------------+-----------------------+ | python_startup_no_site | 9479.6 kB | 9630.8 kB | 1.02x larger | Not significant | +-------------------------+----------------------+-------------------------+---------------+-----------------------+ | raytrace | 8239.0 kB | 8273.0 kB | 1.00x larger | Not significant | +-------------------------+----------------------+-------------------------+---------------+-----------------------+ | regex_compile | 8602.2 kB | 8662.6 kB | 1.01x larger | Not significant | +-------------------------+----------------------+-------------------------+---------------+-----------------------+ | regex_dna | 15.0 MB | 15.1 MB | 1.01x larger | Not significant | +-------------------------+----------------------+-------------------------+---------------+-----------------------+ | regex_effbot | 8054.6 kB | 8094.8 kB | 1.00x larger | Not significant | +-------------------------+----------------------+-------------------------+---------------+-----------------------+ | regex_v8 | 13.0 MB | 13.0 MB | 1.00x larger | Not significant | +-------------------------+----------------------+-------------------------+---------------+-----------------------+ | richards | 7837.2 kB | 7841.2 kB | 1.00x larger | Not significant | +-------------------------+----------------------+-------------------------+---------------+-----------------------+ | scimark_fft | 8037.0 kB | 8118.8 kB | 1.01x larger | Not significant | +-------------------------+----------------------+-------------------------+---------------+-----------------------+ | scimark_lu | 8059.2 kB | 8107.2 kB | 1.01x larger | Not significant | +-------------------------+----------------------+-------------------------+---------------+-----------------------+ | scimark_monte_carlo | 7968.2 kB | 8020.2 kB | 1.01x larger | Not significant | +-------------------------+----------------------+-------------------------+---------------+-----------------------+ | scimark_sor | 7995.0 kB | 8065.0 kB | 1.01x larger | Not significant | +-------------------------+----------------------+-------------------------+---------------+-----------------------+ | scimark_sparse_mat_mult | 8512.2 kB | 8549.4 kB | 1.00x larger | Not significant | +-------------------------+----------------------+-------------------------+---------------+-----------------------+ | spectral_norm | 7184.4 kB | 7217.8 kB | 1.00x larger | Not significant | +-------------------------+----------------------+-------------------------+---------------+-----------------------+ | telco | 7857.2 kB | 7672.2 kB | 1.02x smaller | Significant (t=38.26) | +-------------------------+----------------------+-------------------------+---------------+-----------------------+ | unpack_sequence | 8809.6 kB | 8835.8 kB | 1.00x larger | Not significant | +-------------------------+----------------------+-------------------------+---------------+-----------------------+ | unpickle | 7943.4 kB | 7965.8 kB | 1.00x larger | Not significant | +-------------------------+----------------------+-------------------------+---------------+-----------------------+ | unpickle_list | 7948.6 kB | 7925.6 kB | 1.00x smaller | Not significant | +-------------------------+----------------------+-------------------------+---------------+-----------------------+ | unpickle_pure_python | 7922.0 kB | 7955.8 kB | 1.00x larger | Not significant | +-------------------------+----------------------+-------------------------+---------------+-----------------------+ | xml_etree_generate | 11.5 MB | 11.7 MB | 1.02x larger | Not significant | +-------------------------+----------------------+-------------------------+---------------+-----------------------+ | xml_etree_iterparse | 12.1 MB | 12.0 MB | 1.01x smaller | Not significant | +-------------------------+----------------------+-------------------------+---------------+-----------------------+ | xml_etree_parse | 11.6 MB | 11.5 MB | 1.01x smaller | Not significant | +-------------------------+----------------------+-------------------------+---------------+-----------------------+ | xml_etree_process | 12.1 MB | 12.5 MB | 1.03x larger | Significant (t=-3.04) | +-------------------------+----------------------+-------------------------+---------------+-----------------------+ On Tue, May 25, 2021 at 2:05 PM Guido van Rossum <guido@python.org> wrote:

On Tue., May 25, 2021, 12:58 Guido van Rossum, <guido@python.org> wrote:
This. Having this as an informational PEP that's already marked as Active seems off somehow to me. I guess it feels more "we're doing this" (which I know isn't intended) rather than "this is our plan, what do you all think? All good?" I don't recall other standards track PEPs that don't also spell out the
specification of the proposal in detail.
I also am not aware of a PEP that's proposed restructuring the eval loop like this either. 😉 I'm personally fine with the detail and saying details may shift as things move forward and lessons are learned based on the scope and updating the PEP accordingly. But that's just me and I don't know if others agree (hence the reason I'm suggesting this be Standards Track). -Brett

On Tue, May 25, 2021 at 7:56 PM Brett Cannon <brett@python.org> wrote:
Right. I have no power to unilaterally decide that "we're doing this", in the sense of "we're definitely merging this", and neither do Eric and Mark. But given the reactions during and since the Language Summit I had assumed that there isn't much dissent, so that "it's okay to try this" seems a reasonable conclusion. And given *that*, I'm not sure that there's much of a difference between the two positions. But I'm not trying to stifle discussion, and there is plenty of work that we (the three authors) can do before we're at the point of no return. (In fact, even if we were to get called back, all we'd need to do would be to revert some commits -- this has happened before.)
Right. Usually we just discuss ideas to improve the eval loop on bpo or in PRs, occasionally on python-dev, but not in PEPs. Over the years the eval loop has become quite a bit more efficient than the loop I wrote three decades ago, but also much more complex, and there's less and less low-hanging fruit left. In the end the proposal here could make it easier to reason about the performance of the eval loop, because there will be fewer one-off hacks, and instead a more systematic approach.
Sure. Do you have a specific text change in mind (or even just a suggestion about where we should insert some language about details shifting and learning lessons? Are there other things you'd like to see changed in the PEP? Also, can you outline a specific process that you would be comfortable with here, given that we're breaking a certain amount of new ground here process-wise? Or should we just change the PEP type to Standards Track and submit it to the Steering Council for review? -- --Guido van Rossum (python.org/~guido) *Pronouns: he/him **(why is my pronoun here?)* <http://feministing.com/2015/02/03/how-using-they-as-a-singular-pronoun-can-c...>

Hi Brett, On 26/05/2021 3:56 am, Brett Cannon wrote:
The PEP is a "we're doing this" document. Maybe it shouldn't be a PEP at all? I've changed its status to "draft" for now. I want to document what we are doing as publicly as possible and a PEP seems like a good way to do that. I also want to reiterate that the PEP doesn't propose changing the language, libraries, Python API or C API in any way. It is just information about how we plan to speed up the interpreter.
Suppose it were a standards PEP, what would that mean if it were rejected? Rejection of a PEP is a choice in favor of an alternative, but what is that alternative? You can't simply say the "status quo" as that would implicitly prevent any development at all on the bytecode interpreter. Cheers, Mark. p.s. For those not at the language summit, here's my grand plan for CPython: https://docs.google.com/presentation/d/1_cvQUwO2WWsaySyCmIy9nj9by4JKnkbiPCqt...

On Wed, May 26, 2021 at 4:23 AM Mark Shannon <mark@hotpy.org> wrote:
Thanks!
Sure, but it isn't a minor change either; if it was then you would have done it already. 😉 This is an architectural shift in how the interpreter works, so it isn't a minor thing.
Then we don't do it.
Rejection of a PEP is a choice in favor of an alternative, but what is that alternative?
The status quo.
You can't simply say the "status quo" as that would implicitly prevent any development at all on the bytecode interpreter.
I don't think that logic holds; if *your *approach happened to be rejected in terms of how to change how the interpreter works it doesn't mean that *no* approach would be accepted (and you can tweak this to be about performance or anything else). PEPs are not submitted where it's "pick from these 2 options" or "pick this or you can never touch the topic/idea again". I think this is all a question of process and how do we want to handle these sort of architectural changes that don't necessarily impact APIs (and thus users directly), but do impact all of us from a perspective of maintenance and understanding? Typically we haven't worried about it because quite frankly none of us have had the time to really tackle something this large or it was so speculative it was done outside of the repo (e.g. gilectomy). If people want to stick with informal consent, then great! If people want to have a more formal discussion, that works too! But my point is this falls into an odd gray area in our overall process that I am suggesting we try to figure out. -Brett

I've gotten some questions privately about my responses on this topic and so I wanted to take the opportunity to clarify a couple of things. First, I totally support the technical work that Guido, Eric, and Mark have planned. This whole line of questions is not meant to act as stop energy for this specific project in any way. I apologize if that didn't come across in my emails. Second, this is totally a process question from my POV. The answer in the end might be that a PEP is not even necessary and it's more about sharing a design doc, or maybe it's a PEP as Standards Track, or maybe an Informational PEP is the right thing in the end; I simply don't know and hence my questions on this topic. I'm just trying to figure out how we as a group want to discuss and handle projects of this size going forward in case getting corporate support to start tackling these bigger projects becomes more normal. On Wed, May 26, 2021 at 1:00 PM Brett Cannon <brett@python.org> wrote:

Hi, The gilectomy was mentioned earlier in the thread. This project was created by a core dev who had the permission to push directly his work into the development branch, but it was done on a work. Another optimization project example was my experimental "FAT Python" project. The main idea was to specialize functions with some assumptions, and check these assumptions at the function entry. I started in a fork and then wrote 3 PEPs to propose to merge the main changes into Python development branch: * PEP 509 -- Add a private version to dict * PEP 510 -- Specialize functions with guards * PEP 511 -- API for code transformers The PEP 509 was accepted since it could be used for other optimizations: Python now uses the dictionary version to optimize LOAD_GLOBAL. The two other PEPs were rejected. Even if I was very disappointed when they were rejected, it seems like it was a wize choice since I was never able to make Python significantly faster (like 2x faster). It was only 10-20% faster on some micro-benchmarks. I also had the permission to push directly, and I think that it was nice to confront my design and ideas to the community. I also understand that optimizing Python is really hard and it requires a lot of preparation work. It's hard to sell the preparation work since it only introduces regressions and noise without any concrete speedup. All of this work is required to implement the real interesting optimization. Moreover, few people are earger to review a Python fork with deep and large changes in Python internals. The work must be re-done step by step with more "atomic" (small) changes to have a readable Git history. The Instagram team behind the Cinder project may be interested to review Mark's work before it's merged into Python development branch. A design PEP would be more convenient than reviewing an concrete implementation. Victor

I can't speak for others on the Cinder team, but I would definitely be happy to help review, and a design document would be great for that. I'm certainly curious what the design is for the Microsoft implementation and how it differs from our shadow code implementation. Right now I certainly don't have enough information to know what the differences are. And of course the reason to open source Cinder was to help avoid duplication of effort as obviously there's a lot of interest in making Python faster! So if there's interest in our shadow code implementation I'd be happy to work to break it out into more reviewable pieces as well - it's got some pretty natural ways to split it up. On Thu, May 27, 2021 at 2:21 PM Victor Stinner <vstinner@python.org> wrote:

On Wed, May 12, 2021 at 10:48 AM Mark Shannon <mark@hotpy.org> wrote:
I was curious what you meant by "is more suitable to a collaborative, open-source development model," but I didn't see it elaborated on in the PEP. If this is indeed a selling point, it might be worth mentioning that and saying why. --Chris

Em qua., 12 de mai. de 2021 às 14:45, Mark Shannon <mark@hotpy.org> escreveu:
Hi Mark, I think this seems like a nice proposal... I do have some questions related to the PEP though (from the point of view of implementing a debugger over it some things are kind of vague to me): 1. When will the specialization happen? (i.e.: is bytecode expected to be changed while the code is running inside a frame or must it all be done prior to entering a frame on a subsequent call?) 2. When the adaptive specialization happens, will that be reflected on the actual bytecode seen externally in the frame or is that all internal? Will clients be able to make sense of that? -- i.e.: In the debugger right now I have a need on some occasions to detect the structure of the code from the bytecode itself (for instance to detect whether some exception would be handled or unhandled at raise time just given the bytecode). 3. Another example: I'm working right now on a feature to step into a method. To do that right now my approach is: - Compute the function call names and bytecode offsets in a given frame. - When a frame is called (during a frame.f_trace call), check the parent frame bytecode offset (frame.f_lasti) to detect if the last thing was the expected call (and if it was, break the execution). This seems reasonable given the current implementation, where bytecodes are all fixed and there's a mapping from the frame.f_lasti ... Will that still work with the specializing adaptive interpreter? 4. Will it still be possible to change the frame.f_code prior to execution from a callback set in `PyThreadState.interp.eval_frame` (which will change the code to add a breakpoint to the bytecode and later call `_PyEval_EvalFrameDefault`)? Note: this is done in the debugger so that Python can run without any tracing until the breakpoint is hit (tracing is set afterwards to actually pause the execution as well as doing step operations). Best regards, Fabio

Hi Fabio, On 13/05/2021 7:11 pm, Fabio Zadrozny wrote:
The specialization is adaptive, so it can happen at anytime during execution.
The bytecode, as externally visible, will be unchanged. All specializations will be internal and should be invisible to all Python tools.
If you are implementing this in Python, then everything should work as it does now. OOI, would inserting a breakpoint at offset 0 in the callee function work?
Since frame.f_code is read-only in Python, I assume you mean in C. I can make no guarantees about the layout or meaning of fields in the C frame struct, I'm afraid. But I'm sure we can get something to work for you. Cheers, Mark.
Best regards,
Fabio

Ok... this part is all done in Python, so, if frame.f_lasti is still updated properly according to the original bytecode while executing the super instructions, then all seems to work properly on my side ;).
OOI, would inserting a breakpoint at offset 0 in the callee function work?
Yes... if you're curious, for the breakpoint to actually work, what is done is generate bytecode which calls a function to set the tracing and later generates a spurious line event (so that the tracing function is then able to make the pause. The related code that generates this bytecode would be: https://github.com/fabioz/PyDev.Debugger/blob/pydev_debugger_2_4_1/_pydevd_f... ).
Yes, it's indeed done in C (cython in this case... the related code for reference is: https://github.com/fabioz/PyDev.Debugger/blob/pydev_debugger_2_4_1/_pydevd_f... ). I'm fine in going through some other way or using some other API (this is quite specific to a debugger after all), I just wanted to let you know of the use case so that something along those lines can still be supported (currently on CPython, this is as low-overhead for debugging as I can think of, but since there'll be an adaptive bytecode specializer, using any other custom API would also be reasonable). Cheers, Fabio

The related code that generates this bytecode https://www.credosystemz.com/training-in-chennai/best-python-training-in-che...
participants (17)
-
Brett Cannon
-
Chris Angelico
-
Chris Jerdonek
-
Christopher Barker
-
Dino
-
Fabio Zadrozny
-
gopinathinchennai01@gmail.com
-
Guido van Rossum
-
Mark Shannon
-
Oscar Benjamin
-
Paul Moore
-
Stephen J. Turnbull
-
Steve Holden
-
Steven D'Aprano
-
Terry Reedy
-
Victor Stinner
-
Łukasz Langa