CFFI: better performance when calling a function from address

Hi all, I am JIT-generating functions in my code and call them by first ffi.cast'ing them to a function pointer. In my microbenchmarks its has pretty much the same call performance as when using cffi ABI mode (dumping the functions to a shared library first) and is around 250ns per call slower than when using API mode. I haven't looked at the generating assembly yet, but I guess pypy has to be more careful, since not all information from API mode is available. So the question is: is there any way to provide the missing information, since I basically know everything about the function behind the pointer that a compile would know? Cheers, Dimitri.

Hi Dimitri, On Wed, 26 Sep 2018 at 21:19, Dimitri Vorona via pypy-dev <pypy-dev@python.org> wrote:
In my microbenchmarks its has pretty much the same call performance as when using cffi ABI mode (dumping the functions to a shared library first) and is around 250ns per call slower than when using API mode.
I doubt that these microbenchmarks are relevant. But just in case, I found out that the JIT is producing two extra instructions in the ABI case, if you call ``lib.foobar()``. These two instructions are caused by reading the ``foobar`` method on the ``lib`` object. If you write instead ``foobar()``, with either ``foobar = lib.foobar`` or ``from _x_cffi.lib import foobar`` done earlier, then the speed is exactly the same. A bientôt, Armin.

Hi Carl Friedrich, On Wed, 26 Sep 2018 at 22:28, Carl Friedrich Bolz-Tereick <cfbolz@gmx.de> wrote:
Couldn't that slowness of getattr be fixed by making the lib objects eg use module dicts or something?
If we use the out-of-line API mode then ``lib`` is an RPython object, but if we use the in-line ABI mode then it's a pure Python object. More precisely it's a singleton instance of a newly created Python class, and the two extra instructions are reading and guard_value'ing the map. It might be possible to rewrite the whole section of pure-Python code making the ``lib`` for the in-line ABI mode, but that looks like it would be even slower on CPython. And I don't like the idea of duplicating---or even making any non-necessary changes to---this slightly-fragile-looking logic... Anyway, I'm not sure to understand how just a guard_value on the map of an object can cause a 250 ns slow-down. I'd rather have thought it would cause no measurable difference. Maybe I missed another difference. Maybe the effect is limited to microbenchmarks. Likely another mystery of modern CPUs. A bientôt, Armin.

Hi guys, thanks for your replies. 250ns sounded like a lot, and apparently it was. I can't reproduce it anymore. Thanks for the confirmation that API and ABI modes should have the same performance. I looked at the jitlog, and both api, abi and cast pointer seem to produce exactly the same code (assuming I bind the function to its own variable). Cheers, Dimitri. On Wed, Sep 26, 2018 at 11:05 PM Armin Rigo <armin.rigo@gmail.com> wrote:

Hi again, On Wed, 26 Sep 2018 at 21:19, Dimitri Vorona via pypy-dev <pypy-dev@python.org> wrote:
In my microbenchmarks its has pretty much the same call performance as when using cffi ABI mode (dumping the functions to a shared library first) and is around 250ns per call slower than when using API mode.
I haven't looked at the generating assembly yet, but I guess pypy has to be more careful, since not all information from API mode is available.
Just for completeness, the documentation of CFFI says indeed that the API mode is faster than the ABI mode. That's true mostly on CPython, where the ABI mode always requires using libffi for the calls, which is slow. On PyPy, the JIT has got enough information to do mostly the same thing in the end. (And before the JIT, PyPy uses libffi both for the ABI and the API mode for simplicity.) A bientôt, Armin.

Hi Dimitri, On Wed, 26 Sep 2018 at 21:19, Dimitri Vorona via pypy-dev <pypy-dev@python.org> wrote:
In my microbenchmarks its has pretty much the same call performance as when using cffi ABI mode (dumping the functions to a shared library first) and is around 250ns per call slower than when using API mode.
I doubt that these microbenchmarks are relevant. But just in case, I found out that the JIT is producing two extra instructions in the ABI case, if you call ``lib.foobar()``. These two instructions are caused by reading the ``foobar`` method on the ``lib`` object. If you write instead ``foobar()``, with either ``foobar = lib.foobar`` or ``from _x_cffi.lib import foobar`` done earlier, then the speed is exactly the same. A bientôt, Armin.

Hi Carl Friedrich, On Wed, 26 Sep 2018 at 22:28, Carl Friedrich Bolz-Tereick <cfbolz@gmx.de> wrote:
Couldn't that slowness of getattr be fixed by making the lib objects eg use module dicts or something?
If we use the out-of-line API mode then ``lib`` is an RPython object, but if we use the in-line ABI mode then it's a pure Python object. More precisely it's a singleton instance of a newly created Python class, and the two extra instructions are reading and guard_value'ing the map. It might be possible to rewrite the whole section of pure-Python code making the ``lib`` for the in-line ABI mode, but that looks like it would be even slower on CPython. And I don't like the idea of duplicating---or even making any non-necessary changes to---this slightly-fragile-looking logic... Anyway, I'm not sure to understand how just a guard_value on the map of an object can cause a 250 ns slow-down. I'd rather have thought it would cause no measurable difference. Maybe I missed another difference. Maybe the effect is limited to microbenchmarks. Likely another mystery of modern CPUs. A bientôt, Armin.

Hi guys, thanks for your replies. 250ns sounded like a lot, and apparently it was. I can't reproduce it anymore. Thanks for the confirmation that API and ABI modes should have the same performance. I looked at the jitlog, and both api, abi and cast pointer seem to produce exactly the same code (assuming I bind the function to its own variable). Cheers, Dimitri. On Wed, Sep 26, 2018 at 11:05 PM Armin Rigo <armin.rigo@gmail.com> wrote:

Hi again, On Wed, 26 Sep 2018 at 21:19, Dimitri Vorona via pypy-dev <pypy-dev@python.org> wrote:
In my microbenchmarks its has pretty much the same call performance as when using cffi ABI mode (dumping the functions to a shared library first) and is around 250ns per call slower than when using API mode.
I haven't looked at the generating assembly yet, but I guess pypy has to be more careful, since not all information from API mode is available.
Just for completeness, the documentation of CFFI says indeed that the API mode is faster than the ABI mode. That's true mostly on CPython, where the ABI mode always requires using libffi for the calls, which is slow. On PyPy, the JIT has got enough information to do mostly the same thing in the end. (And before the JIT, PyPy uses libffi both for the ABI and the API mode for simplicity.) A bientôt, Armin.
participants (3)
-
Armin Rigo
-
Carl Friedrich Bolz-Tereick
-
Dimitri Vorona