New GitHub issue #123500 from neonene:<br>

<hr>

<pre>

# Bug report

### Bug description:

There are callables implemented with the `METH_METHOD|METH_FASTCALL` signature in C.  They can be 5%-15% less efficient than using only `METH_FASTCALL` (or `METH_O`) with a `PyType_GetModuleByDef` function call.

For example, I measured the difference on Windows PGO builds by duplicating functions:

* `CDataType_from_buffer_copy()` in `_ctypes.c`, which is not called when profiling:

  ```py

  from timeit import timeit

  setup = """if 1:

 import ctypes

      buf = bytearray(16)

      cls = ctypes.c_char * len(buf)

  """

  # with a warmup

  for _ in range(2):

      # METH_METHOD|METH_FASTCALL (as-is)

      r0 = timeit(s0 := f'cls.from_buffer_copy (buf)', setup)

      # METH_FASTCALL (no `defining_class`) + PyType_GetModuleByDef

      r1 = timeit(s1 := f'cls.from_buffer_copy1(buf)', setup)

  print(s0, r0, 1 + (1 - r0 / r0))

 print(s1, r1, 1 + (1 - r1 / r0))

  ```

  ```py

  cls.from_buffer_copy (buf) 0.15552800190635024 1.0

  cls.from_buffer_copy1(buf) 0.13187471489945893 1.1520837837364741

  ```

* `dec_mpd_qquantize()` in `_decimal.c` profiled with 6800 calls (unfair?):

  ```py

  # legacy (as-is)

  d1.quantize (d2) 0.1694609627971658 1.0

  # METH_METHOD|METH_FASTCALL (`defining_class`) + _PyType_GetModuleState

 d1.quantize1(d2) 0.1408861404022900 1.168621857938327

  # METH_FASTCALL (no `defining_class`) + PyType_GetModuleByDef

  d1.quantize2(d2) 0.1258157708973158 1.257553074049807

  ```

  <details><summary>Script (expand)</summary>

  ```py

  from timeit import timeit

  setup = """if 1:

      from _decimal import Decimal

      d1,d2 = Decimal(1.414), Decimal('0.01')

  """

  for _ in range(2):

      r0 = timeit(s0 := f'd1.quantize (d2)', setup)

      r1 = timeit(s1 := f'd1.quantize1(d2)', setup)

      r2 = timeit(s2 := f'd1.quantize2(d2)', setup)

  print(s0, r0, 1 + (1 - r0 / r0))

  print(s1, r1, 1 + (1 - r1 / r0))

  print(s2, r2, 1 + (1 - r2 / r0))

  ```

 </details>

Observations:

* The number of arguments had little to do with this.

* The gaps seem to be consistent as long as they are equally (un)exercised.

* The same goes for non-PGO builds and builtin modules (e.g. `_sre`), where the impacts may be less significant.

### CPython versions tested on:

CPython main branch

### Operating systems tested on:

Windows

</pre>

<hr>

<a href="https://github.com/python/cpython/issues/123500">View on GitHub</a>

<p>Labels: type-bug</p>

<p>Assignee: </p>