[Python-Dev] Micro-optimizations by adding special-case bytecodes?

Wed Jun 28 09:50:10 EDT 2017

I've now got a patch to add new COMPARE_IS_NONE and
COMPARE_IS_NOT_NONE bytecodes to CPython to attempt to speed up the
very common "is None" and "is not None" expressions. It's a very
simple change overall. Here is my patch:
https://gist.github.com/benhoyt/e5ba19afe7b869fd743c1c39fc2afdf8

Note: The intent here is a proof of concept, not a CPython-quality
implementation. I've put the code to generate these in compile.c,
though it may be better off in peephole.c. Also, I'm not sure two new
bytecodes are needed -- it'd be simpler and probably almost as fast to
have one bytecode instruction COMPARE_NONE with an argument that
determines the "is" vs "is not" part.

Anyway, it looks like a nice improvement on a micro-benchmark, with a
20% speed increase for "is None" and "is not None" expressions using
timeit. See results at [1] below.

However, I'm not very familiar with the python/performance benchmarks
and with my first try at running those I'm getting wildly varying
results (most faster, but some tests say 1.7x slower, so I'm pretty
sure this is an artifact of how I'm running them). See [2] below for
my results. Would someone with more experience running those be able
to run the performance tests and post results?

Any other thoughts welcome as well.

-Ben

[1] Here are the micro-benchmarks using timeit against the latest
master (normal build and also PGO) as well as with my patch:

latest master (e613e6add5f07ff6aad5802924596b631b707d2a, June 27)

./configure
make -j
../test_none.sh
x = 1234; x is None -- 20000000 loops, best of 5: 19.8 nsec per loop
x = 1234; x is not None -- 10000000 loops, best of 5: 20 nsec per loop
x = None; x is None -- 10000000 loops, best of 5: 20.7 nsec per loop
x = None; x is not None -- 10000000 loops, best of 5: 20.8 nsec per loop
avg 20.3 nsec per loop

./configure --enable-optimizations
make -j
../test_none.sh
x = 1234; x is None -- 20000000 loops, best of 5: 18.6 nsec per loop
x = 1234; x is not None -- 20000000 loops, best of 5: 19.2 nsec per loop
x = None; x is None -- 10000000 loops, best of 5: 19.5 nsec per loop
x = None; x is not None -- 20000000 loops, best of 5: 19.4 nsec per loop
avg 19.2 nsec per loop

is_none_bytecode branch (with my patch)

./configure
make -j
../test_none.sh
x = 1234; x is None -- 20000000 loops, best of 5: 16 nsec per loop
x = 1234; x is not None -- 20000000 loops, best of 5: 16.4 nsec per loop
x = None; x is None -- 20000000 loops, best of 5: 16.6 nsec per loop
x = None; x is not None -- 20000000 loops, best of 5: 15.5 nsec per loop
avg 16.1 nsec per loop (21% faster than master)

./configure --enable-optimizations
make -j
../test_none.sh
x = 1234; x is None -- 20000000 loops, best of 5: 15.6 nsec per loop
x = 1234; x is not None -- 20000000 loops, best of 5: 16.4 nsec per loop
x = None; x is None -- 20000000 loops, best of 5: 15.7 nsec per loop
x = None; x is not None -- 20000000 loops, best of 5: 15.4 nsec per loop
avg 15.8 nsec per loop (18% faster than master)

[2] Benchmarks comparing master and is_none_bytecode patch (each
compiled with --enable-optimizations) using python/performance:

+-------------------------+------------+------------------------------+
| Benchmark               | master_opt | is_none_bytecode_opt         |
+=========================+============+==============================+
| 2to3                    | 617 ms     | 541 ms: 1.14x faster (-12%)  |
+-------------------------+------------+------------------------------+
| chameleon               | 19.9 ms    | 18.6 ms: 1.07x faster (-7%)  |
+-------------------------+------------+------------------------------+
| crypto_pyaes            | 208 ms     | 201 ms: 1.04x faster (-3%)   |
+-------------------------+------------+------------------------------+
| deltablue               | 13.8 ms    | 12.9 ms: 1.07x faster (-7%)  |
+-------------------------+------------+------------------------------+
| django_template         | 245 ms     | 233 ms: 1.05x faster (-5%)   |
+-------------------------+------------+------------------------------+
| dulwich_log             | 132 ms     | 126 ms: 1.05x faster (-5%)   |
+-------------------------+------------+------------------------------+
| fannkuch                | 898 ms     | 849 ms: 1.06x faster (-5%)   |
+-------------------------+------------+------------------------------+
| float                   | 204 ms     | 191 ms: 1.07x faster (-7%)   |
+-------------------------+------------+------------------------------+
| genshi_text             | 57.5 ms    | 54.8 ms: 1.05x faster (-5%)  |
+-------------------------+------------+------------------------------+
| genshi_xml              | 122 ms     | 114 ms: 1.07x faster (-6%)   |
+-------------------------+------------+------------------------------+
| hexiom                  | 18.4 ms    | 26.9 ms: 1.46x slower (+46%) |
+-------------------------+------------+------------------------------+
| html5lib                | 155 ms     | 265 ms: 1.72x slower (+72%)  |
+-------------------------+------------+------------------------------+
| json_dumps              | 23.0 ms    | 30.6 ms: 1.33x slower (+33%) |
+-------------------------+------------+------------------------------+
| logging_format          | 23.1 us    | 21.6 us: 1.07x faster (-6%)  |
+-------------------------+------------+------------------------------+
| logging_simple          | 19.6 us    | 18.1 us: 1.09x faster (-8%)  |
+-------------------------+------------+------------------------------+
| mako                    | 35.4 ms    | 33.5 ms: 1.06x faster (-5%)  |
+-------------------------+------------+------------------------------+
| meteor_contest          | 177 ms     | 169 ms: 1.04x faster (-4%)   |
+-------------------------+------------+------------------------------+
| nqueens                 | 184 ms     | 174 ms: 1.06x faster (-5%)   |
+-------------------------+------------+------------------------------+
| pathlib                 | 46.7 ms    | 45.1 ms: 1.03x faster (-3%)  |
+-------------------------+------------+------------------------------+
| pickle                  | 19.5 us    | 17.4 us: 1.12x faster (-11%) |
+-------------------------+------------+------------------------------+
| pickle_dict             | 58.6 us    | 49.9 us: 1.17x faster (-15%) |
+-------------------------+------------+------------------------------+
| pickle_list             | 7.45 us    | 6.86 us: 1.09x faster (-8%)  |
+-------------------------+------------+------------------------------+
| python_startup          | 28.1 ms    | 31.0 ms: 1.10x slower (+10%) |
+-------------------------+------------+------------------------------+
| regex_compile           | 335 ms     | 353 ms: 1.05x slower (+5%)   |
+-------------------------+------------+------------------------------+
| regex_dna               | 230 ms     | 247 ms: 1.07x slower (+7%)   |
+-------------------------+------------+------------------------------+
| regex_effbot            | 4.16 ms    | 4.49 ms: 1.08x slower (+8%)  |
+-------------------------+------------+------------------------------+
| regex_v8                | 33.0 ms    | 36.2 ms: 1.10x slower (+10%) |
+-------------------------+------------+------------------------------+
| richards                | 120 ms     | 124 ms: 1.03x slower (+3%)   |
+-------------------------+------------+------------------------------+
| scimark_fft             | 539 ms     | 593 ms: 1.10x slower (+10%)  |
+-------------------------+------------+------------------------------+
| scimark_monte_carlo     | 183 ms     | 172 ms: 1.07x faster (-6%)   |
+-------------------------+------------+------------------------------+
| scimark_sparse_mat_mult | 6.42 ms    | 6.54 ms: 1.02x slower (+2%)  |
+-------------------------+------------+------------------------------+
| spectral_norm           | 251 ms     | 233 ms: 1.08x faster (-7%)   |
+-------------------------+------------+------------------------------+
| sqlalchemy_declarative  | 246 ms     | 229 ms: 1.07x faster (-7%)   |
+-------------------------+------------+------------------------------+
| sqlalchemy_imperative   | 48.8 ms    | 45.9 ms: 1.06x faster (-6%)  |
+-------------------------+------------+------------------------------+
| sqlite_synth            | 4.56 us    | 4.69 us: 1.03x slower (+3%)  |
+-------------------------+------------+------------------------------+
| sympy_integrate         | 31.4 ms    | 33.5 ms: 1.07x slower (+7%)  |
+-------------------------+------------+------------------------------+
| telco                   | 9.76 ms    | 9.99 ms: 1.02x slower (+2%)  |
+-------------------------+------------+------------------------------+
| unpack_sequence         | 73.8 ns    | 76.9 ns: 1.04x slower (+4%)  |
+-------------------------+------------+------------------------------+
| unpickle                | 24.1 us    | 24.8 us: 1.03x slower (+3%)  |
+-------------------------+------------+------------------------------+
| unpickle_list           | 6.04 us    | 6.46 us: 1.07x slower (+7%)  |
+-------------------------+------------+------------------------------+
| xml_etree_parse         | 223 ms     | 245 ms: 1.10x slower (+10%)  |
+-------------------------+------------+------------------------------+
| xml_etree_iterparse     | 175 ms     | 203 ms: 1.16x slower (+16%)  |
+-------------------------+------------+------------------------------+
| xml_etree_process       | 144 ms     | 149 ms: 1.03x slower (+3%)   |
+-------------------------+------------+------------------------------+

Not significant (15): chaos; go; json_loads; nbody;
pickle_pure_python; pidigits; python_startup_no_site; raytrace;
scimark_lu; scimark_sor; sympy_expand; sympy_sum; sympy_str;
unpickle_pure_python; xml_etree_generate

On Thu, May 25, 2017 at 10:28 AM, Ben Hoyt <benhoyt at gmail.com> wrote:
> Thanks, Victor. That's very helpful. So RETURN_NONE (and probably
> RETURN_SMALL_CONST) are not worth it, based on your empirical tests. Your
> patch shows how (relatively) straight-forward it is to test out new opcodes.
>
> I'm still optimistic about the value of COMPARE_IS_NONE and
> COMPARE_IS_NOT_NONE, though. Mainly because I've done a quick expansion of
> LOAD_CONST(None) + COMPARE_OP and it's quite a bit more code and many more
> instructions than COMPARE_IS_NONE would be:
>
> LOAD_CONST(None)
> COMPARE_OP
>     PyObject *value = ((PyTupleObject *)(consts))->ob_item[oparg];
>     value->ob_refcnt++;
>     *stack_pointer++ = value;
>     FAST_DISPATCH();
>     PyObject *right = *--stack_pointer;
>     PyObject *left = stack_pointer[-1]
>
>     // cmp_outcome(), presumably inlined
>     int r = 0;
>     switch (compare_oparg) {
>     case PyCmp_IS:
>         r = (left == right);
>         break;
>     case PyCmp_IS_NOT:
>         r = (left != right);
>         break;
>     case ...
>     }
>     PyObject *res = r ? Py_True : Py_False;
>     res->ob_refcnt++;
>
>     if (--(left)->ob_refcnt == 0)
>         _Py_Dealloc(left);
>     if (--(right)->ob_refcnt == 0)
>         _Py_Dealloc(right);
>     stack_pointer[-1] = res;
>     if (res == NULL)
>         goto error;
>
>     PREDICT(POP_JUMP_IF_FALSE);
>     PREDICT(POP_JUMP_IF_TRUE);
>     DISPATCH();
>
>
> COMPARE_IS_NONE
>     PyObject* left = stack_pointer[-1];  // TOP()
>     PyObject* res = (left == Py_None) ? Py_True : Py_False;
>     res->ob_refcnt++;
>     if (--(left)->ob_refcnt == 0)
>         _Py_Dealloc(left);
>     stack_pointer[-1] = res;  // SET_TOP(res)
>     PREDICT(POP_JUMP_IF_FALSE);
>     PREDICT(POP_JUMP_IF_TRUE);
>     DISPATCH();
>
> You don't have to get the const arg, there are fewer increfs/decrefs, you
> skip a pop, you don't have to test res==NULL (because it's Py_True or
> Py_False, which are never NULL), and if there are separate COMPARE_IS_NONE
> and COMPARE_IS_NOT_NONE you don't have to switch on the compare arg (though
> I'm not sure if that part will be worth it).
>
> For reference, based on a grep, " is None" occurs 2737 times in the CPython
> source tree, and " is not None" 2010 times. And I know personally I often
> use them in loops as well is at the start of functions (for mutable default
> arg handling).
>
> Still, the performance proof will be in the pudding! I might hack these two
> opcodes together and test it at some point.
>
> -Ben
>
> On Thu, May 25, 2017 at 6:47 AM, Victor Stinner <victor.stinner at gmail.com>
> wrote:
>>
>> Hi Ben,
>>
>> I am not convinced that combining operations will have a significant
>> impact in term of performance. Mark Shanon implemented that in his HotPy
>> project.
>>
>> I proposed a RETURN_NONE opcode to combine LOAD_CONST with RETURN_VALUE.
>> The issue was rejected because I failed to show any speedup.
>>
>> https://bugs.python.org/issue28800
>>
>> I would be interested to restart/finish my registervm project to use
>> register-based bytecode. It allows to implement more optmisations and reduce
>> the number of instructions. In my experience, less instructions = faster
>> code.
>>
>> http://faster-cpython.readthedocs.io/registervm.html
>>
>> Mark's bytecode uses registers but also a stack.
>>
>> Victor
>>
>> Le 24 mai 2017 8:09 PM, "Ben Hoyt" <benhoyt at gmail.com> a écrit :
>>>
>>> Hi folks,
>>>
>>> I was looking at some `dis` output today, and I was wondering if anyone
>>> has investigated optimizing Python (slightly) by adding special-case
>>> bytecodes for common expressions or statements involving constants?
>>>
>>> For example, I (and, based on a quick grep of the stdlib, many others)
>>> write "x is None" and "x is not None" very often. Or "return True" or
>>> "return None" or "return 1" and things like that. These all expand into two
>>> bytecodes, which seems pretty non-optimal (LOAD_CONST + COMPARE_OP or
>>> LOAD_CONST + RETURN_VALUE). It seems we could get an easy speedup for these
>>> common cases by adding a peephole optimization and some new opcodes (maybe
>>> COMPARE_IS_SMALL_CONST and RETURN_SMALL_CONST for these cases).
>>>
>>> I'm not proposing to do this yet, as I'd need to benchmark to see how
>>> much of a gain (if any) it would amount to, but I'm just wondering if
>>> there's any previous work on this kind of thing. Or, if not, any other
>>> thoughts before I try it?
>>>
>>> Thanks,
>>> Ben
>>>
>>>
>>> _______________________________________________
>>> Python-Dev mailing list
>>> Python-Dev at python.org
>>> https://mail.python.org/mailman/listinfo/python-dev
>>> Unsubscribe:
>>> https://mail.python.org/mailman/options/python-dev/victor.stinner%40gmail.com
>>>
>