[Python-Dev] Micro-benchmarks for PEP 580

Wed Jul 11 03:53:58 EDT 2018

First of all, please don't be so defensive.

I just say I need example target application.   I don't against to PEP 580.
Actually speaking, I lean to PEP 580 than PEP 576, although I wonder if
some part of PEP 580 can be simplified or postponed.

But PEP 580 is very complicated.  I need enough evidence what
PEP 580 provides before voting to PEP 580.

I know Python is important for data scientists and Cython is important
for them too.
But I don't have any examples target applications because I'm not data
scientist and I don't use Jupyter, numpy, etc.
Python's performance test suite doesn't contain such applications too.
So we can't measure or estimate benefits.

That's why I requested real world application sample again and again.

In my experience, I need Cython for making hotspot faster.  Calling overhead
is much smaller than the hotspot.  I haven't focused on inter extension calling
performance because `cdef` or `cpdef` is enough.
So I really don't have any target application which bottleneck is
calling performance of Cython.

> >>> Currently, we create temporary long object for passing argument.
> >>> If there is protocol for exposeing format used by PyArg_Parse*, we can
> >>> bypass temporal Python object and call myfunc_impl directly.
>
> Note that this is not fast at all. It actually has to parse the format
> description at runtime. For really fast calls, this should be avoided, and
> it can be avoided by using a str object for the signature description and
> interning it. That relies on signature normalisation, which requires a
> proper formal specification of C/C++ signature strings, which ... is pretty
> much the can of worms that Antoine mentioned.
>

Please don't start discussion about detail.
This is just an example of possible optimization of future.
(And I'm happy about hearing Cython will tackling this).

> > If my idea has 50% gain and current PEP 580 has only 5% gain,
> > why we should accept PEP 580?
>
> Because PEP 580 is also meant as a preparation for a fast C-to-C call
> interface in Python.
>
> Unpacking C callables is quite an involved protocol, and we have been
> pushing the idea around and away in the Cython project for some seven
> years. It's about time to consider it more seriously now, and there are
> plans to implement it on top of PEP 580 (which is also mentioned in the PEP).
>

I want to see it before accepting PEP 580.

>
> > And PEP 576 seems much simpler and straightforward way to expose
> > FASTCALL.
>
> Except that it does not get us one step forward on the path towards what
> you proposed. So, why would *you* prefer it over PEP 580?
>

I prefer PEP 580!
I just don't have enough rational to accept PEP 580 complexities.

>>
>> But I have worrying about it.  If we do it for all function, it makes Python
>> binary fatter, consume more CPU cache.  Once CPU cache is start
>> stashing, application performance got slower quickly.
>
> Now, I'd like to see benchmark numbers for that before I believe it. Macro
> benchmarks, not micro benchmarks! *wink*

Yes, when I try inlining argument parsing or other optimization having
significant memory overhead, I'll try to macro benchmark of cache efficiency.

But for now, I'm working on making Python memory footprint smaller,
not fatter.

Regards,
-- 
INADA Naoki  <songofacandy at gmail.com>