Micro-optimizations by adding special-case bytecodes?

Hi folks, I was looking at some `dis` output today, and I was wondering if anyone has investigated optimizing Python (slightly) by adding special-case bytecodes for common expressions or statements involving constants? For example, I (and, based on a quick grep of the stdlib, many others) write "x is None" and "x is not None" very often. Or "return True" or "return None" or "return 1" and things like that. These all expand into two bytecodes, which seems pretty non-optimal (LOAD_CONST + COMPARE_OP or LOAD_CONST + RETURN_VALUE). It seems we could get an easy speedup for these common cases by adding a peephole optimization and some new opcodes (maybe COMPARE_IS_SMALL_CONST and RETURN_SMALL_CONST for these cases). I'm not proposing to do this yet, as I'd need to benchmark to see how much of a gain (if any) it would amount to, but I'm just wondering if there's any previous work on this kind of thing. Or, if not, any other thoughts before I try it? Thanks, Ben

Python 3.6 added an opcode specifically for dicts with constant keys: https://bugs.python.org/issue27140. Though I guess that's a slightly different case as it's not a peephole-fused opcode.

And followup, Python 2.7 did add fused opcodes related to conditional: http://bugs.python.org/issue4715.

Hi Ben, On 24/05/17 19:07, Ben Hoyt wrote:
This is exactly what I looked into just over a year ago. As Stephane suggests, I did this by adding new opcodes that the peephole optimizer generated and the interpreter loop understood (but the compiler itself did not need to know anything about these new opcodes, so that makes things much easier). Adding new opcodes like this at the time wasn't straightforward because of issues with the build process (see this thread: https://mail.python.org/pipermail/python-dev/2015-December/142600.html - it starts out as a question about the bytecode format but ended up with some very useful information on the build process). Note that since that thread, a couple of things have changed - the bytecode is now wordcode so some of my original questions aren't relevant, and some of the things I had a problem with in the build system are now auto-generated with a new 'make' target. So it _should_ be easier now than it was then. In terms of the results I got once I had things building and running, I didn't manage to find any particular magic bullets that gave me a significant enough speedup. Perhaps I just didn't pick the right opcode sequences or the right test cases (though what I was trying to do was quite successful in terms of doing things like replacing branches-to-RETURN into a single RETURN - so LOAD_CONST/RETURN_VALUE became RETURN_CONST and therefore if the target of an unconditional branch was to a RETURN_CONST op, the branch op could be replaced by the RETURN_CONST). I figured that one thing every function or method needs to do is return, so I tried to make that more efficient. I only had two weeks to spend on it though ... I was trying to do that by avoiding trips-around-the-interpreter-loop as that was historically something that would give speedups. However, with the new computed-goto version of the interpreter I came to the conclusion that it's not at important as it used to be. I was building with gcc though and what I *didn't* do was disable the computed-goto code (it's controlled by a #define) to see if my changes improved performance on platforms that can't use it. I have other opcode sequences that I identified that might be useful to look at further. I didn't (and still don't) have enough bandwidth to *drive* something like this through though, but if you want to do that I'd be more than happy to be kept in the loop on what you're doing and can possibly find time to write some code too. Regards, E.

Interesting -- thanks very much, Erik. Yeah, I wonder if the wordcode will change things too. But interesting to know you got not much of a speedup! Also, the "x is None" and "x is not None" were the opcodes I started out thinking would be helpful, so I'm interested to try those. I'm wondering if the simplest way to test would be to add the new opcodes and handle them in ceval.c, but then do all the bytecode manipulation in a pure Python peephole optimizer ("bytecode munger"). Then if it worked we could put those optimizations in the compiler or peephole.c. -Ben On Wed, May 24, 2017 at 4:14 PM, Erik <python@lucidity.plus.com> wrote:

On 24/05/17 21:56, Ben Hoyt wrote:
But interesting to know you got not much of a speedup!
I think that improvements at the hardware level in terms of parallelizing instruction and data fetching (and branch prediction) in even the cheapest processors these days have largely trivialized the amount of time it take the interpreter loop to read another opcode and branch to the code that executes it. I think that the answer to speeding things up is better algorithms at a higher level rather than micro-optimizations. But it's still fun to _try_ this sort of thing isn't it? ;)
There used to be just such a thing in pure Python (but not with _new_ opcodes, obviously) - Skip Montanaro wrote it IIRC. Probably about 1997 or so. I think that may be where the current peephole optimizer originated (the way it works is certainly similar to the Python version I remember experimenting with back then). E.

When testing things like this, as well as testing whether it speeds up your target cases, remember to check that it doesn't slow everything else down due to the increased size of the eval code pushing something out of instruction cache or some such effect. -- Greg

On 24/05/17 11:07, Ben Hoyt wrote:
[snip] Hi Ben, What you are suggesting is an ad-hoc form of what is known as "static superinstructions". Take a look at http://www.complang.tuwien.ac.at/projects/interpreters.html Cheers, Mark.

Hi Ben, I am not convinced that combining operations will have a significant impact in term of performance. Mark Shanon implemented that in his HotPy project. I proposed a RETURN_NONE opcode to combine LOAD_CONST with RETURN_VALUE. The issue was rejected because I failed to show any speedup. https://bugs.python.org/issue28800 I would be interested to restart/finish my registervm project to use register-based bytecode. It allows to implement more optmisations and reduce the number of instructions. In my experience, less instructions = faster code. http://faster-cpython.readthedocs.io/registervm.html Mark's bytecode uses registers but also a stack. Victor Le 24 mai 2017 8:09 PM, "Ben Hoyt" <benhoyt@gmail.com> a écrit :

Thanks, Victor. That's very helpful. So RETURN_NONE (and probably RETURN_SMALL_CONST) are not worth it, based on your empirical tests. Your patch shows how (relatively) straight-forward it is to test out new opcodes. I'm still optimistic about the value of COMPARE_IS_NONE and COMPARE_IS_NOT_NONE, though. Mainly because I've done a quick expansion of LOAD_CONST(None) + COMPARE_OP and it's quite a bit more code and many more instructions than COMPARE_IS_NONE would be: LOAD_CONST(None) COMPARE_OP PyObject *value = ((PyTupleObject *)(consts))->ob_item[oparg]; value->ob_refcnt++; *stack_pointer++ = value; FAST_DISPATCH(); PyObject *right = *--stack_pointer; PyObject *left = stack_pointer[-1] // cmp_outcome(), presumably inlined int r = 0; switch (compare_oparg) { case PyCmp_IS: r = (left == right); break; case PyCmp_IS_NOT: r = (left != right); break; case ... } PyObject *res = r ? Py_True : Py_False; res->ob_refcnt++; if (--(left)->ob_refcnt == 0) _Py_Dealloc(left); if (--(right)->ob_refcnt == 0) _Py_Dealloc(right); stack_pointer[-1] = res; if (res == NULL) goto error; PREDICT(POP_JUMP_IF_FALSE); PREDICT(POP_JUMP_IF_TRUE); DISPATCH(); COMPARE_IS_NONE PyObject* left = stack_pointer[-1]; // TOP() PyObject* res = (left == Py_None) ? Py_True : Py_False; res->ob_refcnt++; if (--(left)->ob_refcnt == 0) _Py_Dealloc(left); stack_pointer[-1] = res; // SET_TOP(res) PREDICT(POP_JUMP_IF_FALSE); PREDICT(POP_JUMP_IF_TRUE); DISPATCH(); You don't have to get the const arg, there are fewer increfs/decrefs, you skip a pop, you don't have to test res==NULL (because it's Py_True or Py_False, which are never NULL), and if there are separate COMPARE_IS_NONE and COMPARE_IS_NOT_NONE you don't have to switch on the compare arg (though I'm not sure if that part will be worth it). For reference, based on a grep, " is None" occurs 2737 times in the CPython source tree, and " is not None" 2010 times. And I know personally I often use them in loops as well is at the start of functions (for mutable default arg handling). Still, the performance proof will be in the pudding! I might hack these two opcodes together and test it at some point. -Ben On Thu, May 25, 2017 at 6:47 AM, Victor Stinner <victor.stinner@gmail.com> wrote:

Hi Ben, for what you're interested in, you might give a look at WPython 1.0 ( https://code.google.com/archive/p/wpython/downloads ) and 1.1 ( https://code.google.com/archive/p/wpython2/downloads ), but they cover a lot of optimizations (as you can see from a brief look at the slides): RETURN_CONST and fusing some opcodes for binary operations are only some of them. For this reason, it's also very difficult to micro-benchmark every single change... :-/ Cheers, Cesare <https://www.avast.com/sig-email?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=webmail> Mail priva di virus. www.avast.com <https://www.avast.com/sig-email?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=webmail> <#DAB4FAD8-2DD7-40BB-A1B8-4E2AA1F9FDF2> 2017-05-25 16:28 GMT+02:00 Ben Hoyt <benhoyt@gmail.com>:

I've now got a patch to add new COMPARE_IS_NONE and COMPARE_IS_NOT_NONE bytecodes to CPython to attempt to speed up the very common "is None" and "is not None" expressions. It's a very simple change overall. Here is my patch: https://gist.github.com/benhoyt/e5ba19afe7b869fd743c1c39fc2afdf8 Note: The intent here is a proof of concept, not a CPython-quality implementation. I've put the code to generate these in compile.c, though it may be better off in peephole.c. Also, I'm not sure two new bytecodes are needed -- it'd be simpler and probably almost as fast to have one bytecode instruction COMPARE_NONE with an argument that determines the "is" vs "is not" part. Anyway, it looks like a nice improvement on a micro-benchmark, with a 20% speed increase for "is None" and "is not None" expressions using timeit. See results at [1] below. However, I'm not very familiar with the python/performance benchmarks and with my first try at running those I'm getting wildly varying results (most faster, but some tests say 1.7x slower, so I'm pretty sure this is an artifact of how I'm running them). See [2] below for my results. Would someone with more experience running those be able to run the performance tests and post results? Any other thoughts welcome as well. -Ben [1] Here are the micro-benchmarks using timeit against the latest master (normal build and also PGO) as well as with my patch: latest master (e613e6add5f07ff6aad5802924596b631b707d2a, June 27) ./configure make -j ../test_none.sh x = 1234; x is None -- 20000000 loops, best of 5: 19.8 nsec per loop x = 1234; x is not None -- 10000000 loops, best of 5: 20 nsec per loop x = None; x is None -- 10000000 loops, best of 5: 20.7 nsec per loop x = None; x is not None -- 10000000 loops, best of 5: 20.8 nsec per loop avg 20.3 nsec per loop ./configure --enable-optimizations make -j ../test_none.sh x = 1234; x is None -- 20000000 loops, best of 5: 18.6 nsec per loop x = 1234; x is not None -- 20000000 loops, best of 5: 19.2 nsec per loop x = None; x is None -- 10000000 loops, best of 5: 19.5 nsec per loop x = None; x is not None -- 20000000 loops, best of 5: 19.4 nsec per loop avg 19.2 nsec per loop is_none_bytecode branch (with my patch) ./configure make -j ../test_none.sh x = 1234; x is None -- 20000000 loops, best of 5: 16 nsec per loop x = 1234; x is not None -- 20000000 loops, best of 5: 16.4 nsec per loop x = None; x is None -- 20000000 loops, best of 5: 16.6 nsec per loop x = None; x is not None -- 20000000 loops, best of 5: 15.5 nsec per loop avg 16.1 nsec per loop (21% faster than master) ./configure --enable-optimizations make -j ../test_none.sh x = 1234; x is None -- 20000000 loops, best of 5: 15.6 nsec per loop x = 1234; x is not None -- 20000000 loops, best of 5: 16.4 nsec per loop x = None; x is None -- 20000000 loops, best of 5: 15.7 nsec per loop x = None; x is not None -- 20000000 loops, best of 5: 15.4 nsec per loop avg 15.8 nsec per loop (18% faster than master) [2] Benchmarks comparing master and is_none_bytecode patch (each compiled with --enable-optimizations) using python/performance: +-------------------------+------------+------------------------------+ | Benchmark | master_opt | is_none_bytecode_opt | +=========================+============+==============================+ | 2to3 | 617 ms | 541 ms: 1.14x faster (-12%) | +-------------------------+------------+------------------------------+ | chameleon | 19.9 ms | 18.6 ms: 1.07x faster (-7%) | +-------------------------+------------+------------------------------+ | crypto_pyaes | 208 ms | 201 ms: 1.04x faster (-3%) | +-------------------------+------------+------------------------------+ | deltablue | 13.8 ms | 12.9 ms: 1.07x faster (-7%) | +-------------------------+------------+------------------------------+ | django_template | 245 ms | 233 ms: 1.05x faster (-5%) | +-------------------------+------------+------------------------------+ | dulwich_log | 132 ms | 126 ms: 1.05x faster (-5%) | +-------------------------+------------+------------------------------+ | fannkuch | 898 ms | 849 ms: 1.06x faster (-5%) | +-------------------------+------------+------------------------------+ | float | 204 ms | 191 ms: 1.07x faster (-7%) | +-------------------------+------------+------------------------------+ | genshi_text | 57.5 ms | 54.8 ms: 1.05x faster (-5%) | +-------------------------+------------+------------------------------+ | genshi_xml | 122 ms | 114 ms: 1.07x faster (-6%) | +-------------------------+------------+------------------------------+ | hexiom | 18.4 ms | 26.9 ms: 1.46x slower (+46%) | +-------------------------+------------+------------------------------+ | html5lib | 155 ms | 265 ms: 1.72x slower (+72%) | +-------------------------+------------+------------------------------+ | json_dumps | 23.0 ms | 30.6 ms: 1.33x slower (+33%) | +-------------------------+------------+------------------------------+ | logging_format | 23.1 us | 21.6 us: 1.07x faster (-6%) | +-------------------------+------------+------------------------------+ | logging_simple | 19.6 us | 18.1 us: 1.09x faster (-8%) | +-------------------------+------------+------------------------------+ | mako | 35.4 ms | 33.5 ms: 1.06x faster (-5%) | +-------------------------+------------+------------------------------+ | meteor_contest | 177 ms | 169 ms: 1.04x faster (-4%) | +-------------------------+------------+------------------------------+ | nqueens | 184 ms | 174 ms: 1.06x faster (-5%) | +-------------------------+------------+------------------------------+ | pathlib | 46.7 ms | 45.1 ms: 1.03x faster (-3%) | +-------------------------+------------+------------------------------+ | pickle | 19.5 us | 17.4 us: 1.12x faster (-11%) | +-------------------------+------------+------------------------------+ | pickle_dict | 58.6 us | 49.9 us: 1.17x faster (-15%) | +-------------------------+------------+------------------------------+ | pickle_list | 7.45 us | 6.86 us: 1.09x faster (-8%) | +-------------------------+------------+------------------------------+ | python_startup | 28.1 ms | 31.0 ms: 1.10x slower (+10%) | +-------------------------+------------+------------------------------+ | regex_compile | 335 ms | 353 ms: 1.05x slower (+5%) | +-------------------------+------------+------------------------------+ | regex_dna | 230 ms | 247 ms: 1.07x slower (+7%) | +-------------------------+------------+------------------------------+ | regex_effbot | 4.16 ms | 4.49 ms: 1.08x slower (+8%) | +-------------------------+------------+------------------------------+ | regex_v8 | 33.0 ms | 36.2 ms: 1.10x slower (+10%) | +-------------------------+------------+------------------------------+ | richards | 120 ms | 124 ms: 1.03x slower (+3%) | +-------------------------+------------+------------------------------+ | scimark_fft | 539 ms | 593 ms: 1.10x slower (+10%) | +-------------------------+------------+------------------------------+ | scimark_monte_carlo | 183 ms | 172 ms: 1.07x faster (-6%) | +-------------------------+------------+------------------------------+ | scimark_sparse_mat_mult | 6.42 ms | 6.54 ms: 1.02x slower (+2%) | +-------------------------+------------+------------------------------+ | spectral_norm | 251 ms | 233 ms: 1.08x faster (-7%) | +-------------------------+------------+------------------------------+ | sqlalchemy_declarative | 246 ms | 229 ms: 1.07x faster (-7%) | +-------------------------+------------+------------------------------+ | sqlalchemy_imperative | 48.8 ms | 45.9 ms: 1.06x faster (-6%) | +-------------------------+------------+------------------------------+ | sqlite_synth | 4.56 us | 4.69 us: 1.03x slower (+3%) | +-------------------------+------------+------------------------------+ | sympy_integrate | 31.4 ms | 33.5 ms: 1.07x slower (+7%) | +-------------------------+------------+------------------------------+ | telco | 9.76 ms | 9.99 ms: 1.02x slower (+2%) | +-------------------------+------------+------------------------------+ | unpack_sequence | 73.8 ns | 76.9 ns: 1.04x slower (+4%) | +-------------------------+------------+------------------------------+ | unpickle | 24.1 us | 24.8 us: 1.03x slower (+3%) | +-------------------------+------------+------------------------------+ | unpickle_list | 6.04 us | 6.46 us: 1.07x slower (+7%) | +-------------------------+------------+------------------------------+ | xml_etree_parse | 223 ms | 245 ms: 1.10x slower (+10%) | +-------------------------+------------+------------------------------+ | xml_etree_iterparse | 175 ms | 203 ms: 1.16x slower (+16%) | +-------------------------+------------+------------------------------+ | xml_etree_process | 144 ms | 149 ms: 1.03x slower (+3%) | +-------------------------+------------+------------------------------+ Not significant (15): chaos; go; json_loads; nbody; pickle_pure_python; pidigits; python_startup_no_site; raytrace; scimark_lu; scimark_sor; sympy_expand; sympy_sum; sympy_str; unpickle_pure_python; xml_etree_generate On Thu, May 25, 2017 at 10:28 AM, Ben Hoyt <benhoyt@gmail.com> wrote:

(Victor wears his benchmark hat.) 2017-06-28 15:50 GMT+02:00 Ben Hoyt <benhoyt@gmail.com>:
Hum, please use perf timeit instead of timeit, it's more reliable. See also: "How to get reproductible benchmark results" http://perf.readthedocs.io/en/latest/run_benchmark.html#how-to-get-reproduct...
FYI you can add the -G option to perf compare_to to sort results by speed (group faster & slower). It gives a more readable table. I also like using --min-speed=5 to ignore changes smaller than 5%, it reduces the noise and makes the stable more readable. Victor

On Wed, 28 Jun 2017 09:50:10 -0400 Ben Hoyt <benhoyt@gmail.com> wrote:
I don't want to discourage you, but this is unlikely to produce significant speedups on real-world code. The simple reason is that comparing to None is not a bottleneck for any real application -- comparing to None is probably damn fast compared to everything else the real application does. That said, it would be nice if you could get stable benchmark results to validate that theory (or not!) ;-) Regards Antoine.

That's true. However, I don't think that makes small optimizations worthless. For example, the core devs speed up memory allocation or method calls or whatever by a few percent, and consider it worthwhile, because it all adds up.
That said, it would be nice if you could get stable benchmark results to validate that theory (or not!) ;-)
Yep, I'm more than willing to accept that outcome if that's what the results show! :-) -Ben

On 25/05/17 03:47, Victor Stinner wrote:
I don't think that I did ;) HotPy implemented a trace-based optimising interpreter, that preformed type specialisation and deferred evaluation of objects, with a "dumb" compiler that applied standard optimisations to the resulting traces. It was technologically similar to PyPy (at the time), but designed for ease of engineering,
The "registers" were used to hold values across calls and the like when optimising traces, but I wouldn't use that design if I were to do it again.

Python 3.6 added an opcode specifically for dicts with constant keys: https://bugs.python.org/issue27140. Though I guess that's a slightly different case as it's not a peephole-fused opcode.

And followup, Python 2.7 did add fused opcodes related to conditional: http://bugs.python.org/issue4715.

Hi Ben, On 24/05/17 19:07, Ben Hoyt wrote:
This is exactly what I looked into just over a year ago. As Stephane suggests, I did this by adding new opcodes that the peephole optimizer generated and the interpreter loop understood (but the compiler itself did not need to know anything about these new opcodes, so that makes things much easier). Adding new opcodes like this at the time wasn't straightforward because of issues with the build process (see this thread: https://mail.python.org/pipermail/python-dev/2015-December/142600.html - it starts out as a question about the bytecode format but ended up with some very useful information on the build process). Note that since that thread, a couple of things have changed - the bytecode is now wordcode so some of my original questions aren't relevant, and some of the things I had a problem with in the build system are now auto-generated with a new 'make' target. So it _should_ be easier now than it was then. In terms of the results I got once I had things building and running, I didn't manage to find any particular magic bullets that gave me a significant enough speedup. Perhaps I just didn't pick the right opcode sequences or the right test cases (though what I was trying to do was quite successful in terms of doing things like replacing branches-to-RETURN into a single RETURN - so LOAD_CONST/RETURN_VALUE became RETURN_CONST and therefore if the target of an unconditional branch was to a RETURN_CONST op, the branch op could be replaced by the RETURN_CONST). I figured that one thing every function or method needs to do is return, so I tried to make that more efficient. I only had two weeks to spend on it though ... I was trying to do that by avoiding trips-around-the-interpreter-loop as that was historically something that would give speedups. However, with the new computed-goto version of the interpreter I came to the conclusion that it's not at important as it used to be. I was building with gcc though and what I *didn't* do was disable the computed-goto code (it's controlled by a #define) to see if my changes improved performance on platforms that can't use it. I have other opcode sequences that I identified that might be useful to look at further. I didn't (and still don't) have enough bandwidth to *drive* something like this through though, but if you want to do that I'd be more than happy to be kept in the loop on what you're doing and can possibly find time to write some code too. Regards, E.

Interesting -- thanks very much, Erik. Yeah, I wonder if the wordcode will change things too. But interesting to know you got not much of a speedup! Also, the "x is None" and "x is not None" were the opcodes I started out thinking would be helpful, so I'm interested to try those. I'm wondering if the simplest way to test would be to add the new opcodes and handle them in ceval.c, but then do all the bytecode manipulation in a pure Python peephole optimizer ("bytecode munger"). Then if it worked we could put those optimizations in the compiler or peephole.c. -Ben On Wed, May 24, 2017 at 4:14 PM, Erik <python@lucidity.plus.com> wrote:

On 24/05/17 21:56, Ben Hoyt wrote:
But interesting to know you got not much of a speedup!
I think that improvements at the hardware level in terms of parallelizing instruction and data fetching (and branch prediction) in even the cheapest processors these days have largely trivialized the amount of time it take the interpreter loop to read another opcode and branch to the code that executes it. I think that the answer to speeding things up is better algorithms at a higher level rather than micro-optimizations. But it's still fun to _try_ this sort of thing isn't it? ;)
There used to be just such a thing in pure Python (but not with _new_ opcodes, obviously) - Skip Montanaro wrote it IIRC. Probably about 1997 or so. I think that may be where the current peephole optimizer originated (the way it works is certainly similar to the Python version I remember experimenting with back then). E.

When testing things like this, as well as testing whether it speeds up your target cases, remember to check that it doesn't slow everything else down due to the increased size of the eval code pushing something out of instruction cache or some such effect. -- Greg

On 24/05/17 11:07, Ben Hoyt wrote:
[snip] Hi Ben, What you are suggesting is an ad-hoc form of what is known as "static superinstructions". Take a look at http://www.complang.tuwien.ac.at/projects/interpreters.html Cheers, Mark.

Hi Ben, I am not convinced that combining operations will have a significant impact in term of performance. Mark Shanon implemented that in his HotPy project. I proposed a RETURN_NONE opcode to combine LOAD_CONST with RETURN_VALUE. The issue was rejected because I failed to show any speedup. https://bugs.python.org/issue28800 I would be interested to restart/finish my registervm project to use register-based bytecode. It allows to implement more optmisations and reduce the number of instructions. In my experience, less instructions = faster code. http://faster-cpython.readthedocs.io/registervm.html Mark's bytecode uses registers but also a stack. Victor Le 24 mai 2017 8:09 PM, "Ben Hoyt" <benhoyt@gmail.com> a écrit :

Thanks, Victor. That's very helpful. So RETURN_NONE (and probably RETURN_SMALL_CONST) are not worth it, based on your empirical tests. Your patch shows how (relatively) straight-forward it is to test out new opcodes. I'm still optimistic about the value of COMPARE_IS_NONE and COMPARE_IS_NOT_NONE, though. Mainly because I've done a quick expansion of LOAD_CONST(None) + COMPARE_OP and it's quite a bit more code and many more instructions than COMPARE_IS_NONE would be: LOAD_CONST(None) COMPARE_OP PyObject *value = ((PyTupleObject *)(consts))->ob_item[oparg]; value->ob_refcnt++; *stack_pointer++ = value; FAST_DISPATCH(); PyObject *right = *--stack_pointer; PyObject *left = stack_pointer[-1] // cmp_outcome(), presumably inlined int r = 0; switch (compare_oparg) { case PyCmp_IS: r = (left == right); break; case PyCmp_IS_NOT: r = (left != right); break; case ... } PyObject *res = r ? Py_True : Py_False; res->ob_refcnt++; if (--(left)->ob_refcnt == 0) _Py_Dealloc(left); if (--(right)->ob_refcnt == 0) _Py_Dealloc(right); stack_pointer[-1] = res; if (res == NULL) goto error; PREDICT(POP_JUMP_IF_FALSE); PREDICT(POP_JUMP_IF_TRUE); DISPATCH(); COMPARE_IS_NONE PyObject* left = stack_pointer[-1]; // TOP() PyObject* res = (left == Py_None) ? Py_True : Py_False; res->ob_refcnt++; if (--(left)->ob_refcnt == 0) _Py_Dealloc(left); stack_pointer[-1] = res; // SET_TOP(res) PREDICT(POP_JUMP_IF_FALSE); PREDICT(POP_JUMP_IF_TRUE); DISPATCH(); You don't have to get the const arg, there are fewer increfs/decrefs, you skip a pop, you don't have to test res==NULL (because it's Py_True or Py_False, which are never NULL), and if there are separate COMPARE_IS_NONE and COMPARE_IS_NOT_NONE you don't have to switch on the compare arg (though I'm not sure if that part will be worth it). For reference, based on a grep, " is None" occurs 2737 times in the CPython source tree, and " is not None" 2010 times. And I know personally I often use them in loops as well is at the start of functions (for mutable default arg handling). Still, the performance proof will be in the pudding! I might hack these two opcodes together and test it at some point. -Ben On Thu, May 25, 2017 at 6:47 AM, Victor Stinner <victor.stinner@gmail.com> wrote:

Hi Ben, for what you're interested in, you might give a look at WPython 1.0 ( https://code.google.com/archive/p/wpython/downloads ) and 1.1 ( https://code.google.com/archive/p/wpython2/downloads ), but they cover a lot of optimizations (as you can see from a brief look at the slides): RETURN_CONST and fusing some opcodes for binary operations are only some of them. For this reason, it's also very difficult to micro-benchmark every single change... :-/ Cheers, Cesare <https://www.avast.com/sig-email?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=webmail> Mail priva di virus. www.avast.com <https://www.avast.com/sig-email?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=webmail> <#DAB4FAD8-2DD7-40BB-A1B8-4E2AA1F9FDF2> 2017-05-25 16:28 GMT+02:00 Ben Hoyt <benhoyt@gmail.com>:

I've now got a patch to add new COMPARE_IS_NONE and COMPARE_IS_NOT_NONE bytecodes to CPython to attempt to speed up the very common "is None" and "is not None" expressions. It's a very simple change overall. Here is my patch: https://gist.github.com/benhoyt/e5ba19afe7b869fd743c1c39fc2afdf8 Note: The intent here is a proof of concept, not a CPython-quality implementation. I've put the code to generate these in compile.c, though it may be better off in peephole.c. Also, I'm not sure two new bytecodes are needed -- it'd be simpler and probably almost as fast to have one bytecode instruction COMPARE_NONE with an argument that determines the "is" vs "is not" part. Anyway, it looks like a nice improvement on a micro-benchmark, with a 20% speed increase for "is None" and "is not None" expressions using timeit. See results at [1] below. However, I'm not very familiar with the python/performance benchmarks and with my first try at running those I'm getting wildly varying results (most faster, but some tests say 1.7x slower, so I'm pretty sure this is an artifact of how I'm running them). See [2] below for my results. Would someone with more experience running those be able to run the performance tests and post results? Any other thoughts welcome as well. -Ben [1] Here are the micro-benchmarks using timeit against the latest master (normal build and also PGO) as well as with my patch: latest master (e613e6add5f07ff6aad5802924596b631b707d2a, June 27) ./configure make -j ../test_none.sh x = 1234; x is None -- 20000000 loops, best of 5: 19.8 nsec per loop x = 1234; x is not None -- 10000000 loops, best of 5: 20 nsec per loop x = None; x is None -- 10000000 loops, best of 5: 20.7 nsec per loop x = None; x is not None -- 10000000 loops, best of 5: 20.8 nsec per loop avg 20.3 nsec per loop ./configure --enable-optimizations make -j ../test_none.sh x = 1234; x is None -- 20000000 loops, best of 5: 18.6 nsec per loop x = 1234; x is not None -- 20000000 loops, best of 5: 19.2 nsec per loop x = None; x is None -- 10000000 loops, best of 5: 19.5 nsec per loop x = None; x is not None -- 20000000 loops, best of 5: 19.4 nsec per loop avg 19.2 nsec per loop is_none_bytecode branch (with my patch) ./configure make -j ../test_none.sh x = 1234; x is None -- 20000000 loops, best of 5: 16 nsec per loop x = 1234; x is not None -- 20000000 loops, best of 5: 16.4 nsec per loop x = None; x is None -- 20000000 loops, best of 5: 16.6 nsec per loop x = None; x is not None -- 20000000 loops, best of 5: 15.5 nsec per loop avg 16.1 nsec per loop (21% faster than master) ./configure --enable-optimizations make -j ../test_none.sh x = 1234; x is None -- 20000000 loops, best of 5: 15.6 nsec per loop x = 1234; x is not None -- 20000000 loops, best of 5: 16.4 nsec per loop x = None; x is None -- 20000000 loops, best of 5: 15.7 nsec per loop x = None; x is not None -- 20000000 loops, best of 5: 15.4 nsec per loop avg 15.8 nsec per loop (18% faster than master) [2] Benchmarks comparing master and is_none_bytecode patch (each compiled with --enable-optimizations) using python/performance: +-------------------------+------------+------------------------------+ | Benchmark | master_opt | is_none_bytecode_opt | +=========================+============+==============================+ | 2to3 | 617 ms | 541 ms: 1.14x faster (-12%) | +-------------------------+------------+------------------------------+ | chameleon | 19.9 ms | 18.6 ms: 1.07x faster (-7%) | +-------------------------+------------+------------------------------+ | crypto_pyaes | 208 ms | 201 ms: 1.04x faster (-3%) | +-------------------------+------------+------------------------------+ | deltablue | 13.8 ms | 12.9 ms: 1.07x faster (-7%) | +-------------------------+------------+------------------------------+ | django_template | 245 ms | 233 ms: 1.05x faster (-5%) | +-------------------------+------------+------------------------------+ | dulwich_log | 132 ms | 126 ms: 1.05x faster (-5%) | +-------------------------+------------+------------------------------+ | fannkuch | 898 ms | 849 ms: 1.06x faster (-5%) | +-------------------------+------------+------------------------------+ | float | 204 ms | 191 ms: 1.07x faster (-7%) | +-------------------------+------------+------------------------------+ | genshi_text | 57.5 ms | 54.8 ms: 1.05x faster (-5%) | +-------------------------+------------+------------------------------+ | genshi_xml | 122 ms | 114 ms: 1.07x faster (-6%) | +-------------------------+------------+------------------------------+ | hexiom | 18.4 ms | 26.9 ms: 1.46x slower (+46%) | +-------------------------+------------+------------------------------+ | html5lib | 155 ms | 265 ms: 1.72x slower (+72%) | +-------------------------+------------+------------------------------+ | json_dumps | 23.0 ms | 30.6 ms: 1.33x slower (+33%) | +-------------------------+------------+------------------------------+ | logging_format | 23.1 us | 21.6 us: 1.07x faster (-6%) | +-------------------------+------------+------------------------------+ | logging_simple | 19.6 us | 18.1 us: 1.09x faster (-8%) | +-------------------------+------------+------------------------------+ | mako | 35.4 ms | 33.5 ms: 1.06x faster (-5%) | +-------------------------+------------+------------------------------+ | meteor_contest | 177 ms | 169 ms: 1.04x faster (-4%) | +-------------------------+------------+------------------------------+ | nqueens | 184 ms | 174 ms: 1.06x faster (-5%) | +-------------------------+------------+------------------------------+ | pathlib | 46.7 ms | 45.1 ms: 1.03x faster (-3%) | +-------------------------+------------+------------------------------+ | pickle | 19.5 us | 17.4 us: 1.12x faster (-11%) | +-------------------------+------------+------------------------------+ | pickle_dict | 58.6 us | 49.9 us: 1.17x faster (-15%) | +-------------------------+------------+------------------------------+ | pickle_list | 7.45 us | 6.86 us: 1.09x faster (-8%) | +-------------------------+------------+------------------------------+ | python_startup | 28.1 ms | 31.0 ms: 1.10x slower (+10%) | +-------------------------+------------+------------------------------+ | regex_compile | 335 ms | 353 ms: 1.05x slower (+5%) | +-------------------------+------------+------------------------------+ | regex_dna | 230 ms | 247 ms: 1.07x slower (+7%) | +-------------------------+------------+------------------------------+ | regex_effbot | 4.16 ms | 4.49 ms: 1.08x slower (+8%) | +-------------------------+------------+------------------------------+ | regex_v8 | 33.0 ms | 36.2 ms: 1.10x slower (+10%) | +-------------------------+------------+------------------------------+ | richards | 120 ms | 124 ms: 1.03x slower (+3%) | +-------------------------+------------+------------------------------+ | scimark_fft | 539 ms | 593 ms: 1.10x slower (+10%) | +-------------------------+------------+------------------------------+ | scimark_monte_carlo | 183 ms | 172 ms: 1.07x faster (-6%) | +-------------------------+------------+------------------------------+ | scimark_sparse_mat_mult | 6.42 ms | 6.54 ms: 1.02x slower (+2%) | +-------------------------+------------+------------------------------+ | spectral_norm | 251 ms | 233 ms: 1.08x faster (-7%) | +-------------------------+------------+------------------------------+ | sqlalchemy_declarative | 246 ms | 229 ms: 1.07x faster (-7%) | +-------------------------+------------+------------------------------+ | sqlalchemy_imperative | 48.8 ms | 45.9 ms: 1.06x faster (-6%) | +-------------------------+------------+------------------------------+ | sqlite_synth | 4.56 us | 4.69 us: 1.03x slower (+3%) | +-------------------------+------------+------------------------------+ | sympy_integrate | 31.4 ms | 33.5 ms: 1.07x slower (+7%) | +-------------------------+------------+------------------------------+ | telco | 9.76 ms | 9.99 ms: 1.02x slower (+2%) | +-------------------------+------------+------------------------------+ | unpack_sequence | 73.8 ns | 76.9 ns: 1.04x slower (+4%) | +-------------------------+------------+------------------------------+ | unpickle | 24.1 us | 24.8 us: 1.03x slower (+3%) | +-------------------------+------------+------------------------------+ | unpickle_list | 6.04 us | 6.46 us: 1.07x slower (+7%) | +-------------------------+------------+------------------------------+ | xml_etree_parse | 223 ms | 245 ms: 1.10x slower (+10%) | +-------------------------+------------+------------------------------+ | xml_etree_iterparse | 175 ms | 203 ms: 1.16x slower (+16%) | +-------------------------+------------+------------------------------+ | xml_etree_process | 144 ms | 149 ms: 1.03x slower (+3%) | +-------------------------+------------+------------------------------+ Not significant (15): chaos; go; json_loads; nbody; pickle_pure_python; pidigits; python_startup_no_site; raytrace; scimark_lu; scimark_sor; sympy_expand; sympy_sum; sympy_str; unpickle_pure_python; xml_etree_generate On Thu, May 25, 2017 at 10:28 AM, Ben Hoyt <benhoyt@gmail.com> wrote:

(Victor wears his benchmark hat.) 2017-06-28 15:50 GMT+02:00 Ben Hoyt <benhoyt@gmail.com>:
Hum, please use perf timeit instead of timeit, it's more reliable. See also: "How to get reproductible benchmark results" http://perf.readthedocs.io/en/latest/run_benchmark.html#how-to-get-reproduct...
FYI you can add the -G option to perf compare_to to sort results by speed (group faster & slower). It gives a more readable table. I also like using --min-speed=5 to ignore changes smaller than 5%, it reduces the noise and makes the stable more readable. Victor

On Wed, 28 Jun 2017 09:50:10 -0400 Ben Hoyt <benhoyt@gmail.com> wrote:
I don't want to discourage you, but this is unlikely to produce significant speedups on real-world code. The simple reason is that comparing to None is not a bottleneck for any real application -- comparing to None is probably damn fast compared to everything else the real application does. That said, it would be nice if you could get stable benchmark results to validate that theory (or not!) ;-) Regards Antoine.

That's true. However, I don't think that makes small optimizations worthless. For example, the core devs speed up memory allocation or method calls or whatever by a few percent, and consider it worthwhile, because it all adds up.
That said, it would be nice if you could get stable benchmark results to validate that theory (or not!) ;-)
Yep, I'm more than willing to accept that outcome if that's what the results show! :-) -Ben

On 25/05/17 03:47, Victor Stinner wrote:
I don't think that I did ;) HotPy implemented a trace-based optimising interpreter, that preformed type specialisation and deferred evaluation of objects, with a "dumb" compiler that applied standard optimisations to the resulting traces. It was technologically similar to PyPy (at the time), but designed for ease of engineering,
The "registers" were used to hold values across calls and the like when optimising traces, but I wouldn't use that design if I were to do it again.
participants (9)
-
Antoine Pitrou
-
Ben Hoyt
-
Cesare Di Mauro
-
Erik
-
Greg Ewing
-
Mark Shannon
-
Stephane Wirtel
-
Victor Stinner
-
Xavier Morel