[Cython] CEP1000: Native dispatch through callables

Sat Apr 14 23:00:26 CEST 2012

On 14 April 2012 20:08, Dag Sverre Seljebotn <d.s.seljebotn at astro.uio.no> wrote:
> On 04/13/2012 12:11 AM, Dag Sverre Seljebotn wrote:
>>
>> Travis Oliphant recently raised the issue on the NumPy list of what
>> mechanisms to use to box native functions produced by his Numba so that
>> SciPy functions can call it, e.g. (I'm making the numba part up):
>
> <snip>
>
> This thread is turning into one of those big ones...
>
> But I think it is really worth it in the end; I'm getting excited about the
> possibility down the road of importing functions using normal Python
> mechanisms and still have fast calls.
>
> Anyway, to organize discussion I've tried to revamp the CEP and describe
> both the intern-way and the strcmp-way.
>
> The wiki appears to be down, so I'll post it below...
>
> Dag
>
> = CEP 1000: Convention for native dispatches through Python callables =
>
> Many callable objects are simply wrappers around native code.  This
> holds for any Cython function, f2py functions, manually
> written CPython extensions, Numba, etc.
>
> Obviously, when native code calls other native code, it would be
> nice to skip the significant cost of boxing and unboxing all the arguments.
> Early binding at compile-time is only possible
> between different Cython modules, not between all the tools
> listed above.
>
> [[enhancements/nativecall|CEP 523]] deals with Cython-specific aspects
> (and is out-of-date w.r.t. this CEP); this CEP is intended to be about
> a cross-project convention only. If a success, this CEP may be
> proposesd as a PEP in a modified form.
>
> Motivating example (looking a year or two into the future):
>
> {{{
> @numba
> def f(x): return 2 * x
>
> @cython.inline
> def g(x : cython.double): return 3 * x
>
> from fortranmod import h
>
> print f(3)
> print g(3)
> print h(3)
> print scipy.integrate.quad(f, 0.2, 3) # fast callback!
> print scipy.integrate.quad(g, 0.2, 3) # fast callback!
> print scipy.integrate.quad(h, 0.2, 3) # fast callback!
>
> }}}
>
> == The native-call slot ==
>
> We need ''fast'' access to probing whether a callable object supports
> this CEP.  Other mechanisms, such as an attribute in a dict, is too
> slow for many purposes (quoting robertwb: "We're trying to get a 300ns
> dispatch down to 10ns; you do not want a 50ns dict lookup"). (Obviously,
> if you call a callable in a loop you can fetch the pointer outside
> of the loop. But in particular if this becomes a language feature
> in Cython it will be used in all sorts of places.)
>
> So we hack another type slot into existing and future CPython
> implementations in the following way: This CEP provides a C header
> that for all Python versions define a macro
> {{{Py_TPFLAGS_UNOFFICIAL_EXTRAS}}} for a free bit in
> {{{tp_flags}}} in the {{{PyTypeObject}}}.
>
> If present, then we extend {{{PyTypeObject}}}
> as follows:
> {{{
> typedef struct {
>    PyTypeObject tp_main;
>    size_t tp_unofficial_flags;
>    size_t tp_nativecall_offset;
> } PyUnofficialTypeObject;
> }}}
>
> {{{tp_unofficial_flags}}} is unused and should be all 0 for the time
> being, but can be used later to indicate features beyond this CEP.
>
> If {{{tp_nativecall_offset != 0}}}, this CEP is supported, and
> the information for doing a native dispatch on a callable {{{obj}}}
> is located at
> {{{
> (char*)obj + ((PyUnofficialTypeObject*)obj->ob_type)->tp_nativecall_offset;
> }}}
>
> === GIL-less accesss ===
>
> It is OK to access the native-call table without holding the GIL. This
> should of course only be used to call functions that state in their
> signature that they don't need the GIL.
>
> This is important for JITted callables who would like to rewrite their
> table as more specializations gets added; if one needs to reallocate
> the table, the old table must linger along long enough that all
> threads that are currently accessing it are done with it.
>
> == Native dispatch descriptor ==
>
> The final format for the descriptor is not agreed upon yet; this sums
> up the major alternatives.
>
> The descriptor should be a list of specializations/overload, each
> described by a function pointer and a signature specification
> string, such as "id)i" for {{{int f(int, double)}}}.
>
> The way it is stored must cater for two cases; first, when the caller
> expects one or more hard-coded signatures:
> {{{
> if (obj has signature "id)i") {
>    call;
> } else if (obj has signature "if)i") {
>    call with promoted second argument;
> } else {
>    box all arguments;
>    PyObject_Call;
> }
> }}}

There may be a lot of promotion/demotion (you likely only want the
former) combinations, especially for multiple arguments, so perhaps it
makes sense to limit ourselves a bit. For instance for numeric scalar
argument types we could limit to long (and the unsigned counterparts),
double and double complex. So char, short and int scalars will be
promoted to long, float to double and float complex to double complex.
Anything bigger, like long long etc will be matched specifically.
Promotions and associated demotions if necessary in the callee should
be fairly cheap compared to checking all combinations or going through
the python layer.

> The second is when a call stack is built dynamically while parsing the
> string. Since this has higher overhead anyway, optimizing for the first
> case makes sense.
>
> === Approach 1: Interning/run-time allocated IDs ===
>
>
> 1A: Let each overload have a struct
> {{{
> struct {
>    size_t signature_id;
>    char *signature;
>    void *func_ptr;
> };
> }}}
> Within each process run, there is a 1:1 between {{{signature}}} and
> {{{signature_id}}}. {{{signature_id}}} is allocated by some central
> registry.
>
> 1B: Intern the string instead:
> {{{
> struct {
>    char *signature; /* pointer must come from the central registry */
>    void *func_ptr;
> };
> }}}
> However this is '''not'' trivial, since signature strings can
> be allocated on the heap (e.g., a JIT would do this), so interned strings
> must be memory managed and reference counted. This could be done by
> each object passing in the signature '''both''' when incref-ing and
> decref-ing the signature string in the interning machinery.
> Using Python {{{bytes}}} objects is another option.
>
> ==== Discussion ====
>
> '''The cost of comparing a signature''': Comparing a global variable
> (needle)
> to a value that is guaranteed to already be in cache (candidate match)
>
> '''Pros:'''
>
>  * Conceptually simple struct format.
>
> '''Cons:'''
>
>  * Requires a registry for interning strings. This must be
>   "handshaked" between the implementors of this CEP (probably by
>   "first to get at {{{sys.modules["_nativecall"}}} sticks it there),
>   as we can't ship a common dependency library for this CEP.
>
> === Approach 2: Efficient strcmp of verbatim signatures ===
>
> The idea is to store the full signatures and the function pointers together
> in the same memory area, but still have some structure to allow for quick
> scanning through the list.
>
> Each entry has the structure {{{[signature_string, funcptr]}}}
> where:
>
>  * The signature string has variable length, but the length is
>   divisible by 8 bytes on all platforms. The {{{funcptr}}} is always
>   8 bytes (it is padded on 32-bit systems).
>
>  * The total size of the entry should be divisible by 16 bytes (= the
>   signature data should be 8 bytes, or 24 bytes, or...)
>
>  * All but the first chunk of signature data should start with a
>   continuation character "-", i.e. a really long signature string
>   could be {{{"iiiidddd-iiidddd-iiidddd-)d"}}}. That is, a "-" is
>   inserted on all positions in the string divisible by 8, except the
>   first.
>
> The point is that if you know a signature, you can quickly scan
> through the binary blob for the signature in 128 bit increments,
> without worrying about the variable size nature of each entry.  The
> rules above protects against spurious matches.
>
> ==== Optional: Encoding ====
>
> The strcmp approach can be made efficient for larger signatures by
> using a more efficient encoding than ASCII. E.g., an encoding could
> use 4 bits for the 12 most common symbols and 8 bits
> for 64 symbols (for a total of 78 symbols), of which some could be
> letter combinations ("Zd", "T{"). This should be reasonably simple
> to encode and decode.
>
> The CEP should provide C routines in a header file to work with the
> signatures. Callers that wish to parse the format string and build a
> call stack on the fly should probably work with the encoded
> representation.
>
> ==== Discussion ====
>
> '''The cost of comparing a signature''': For the vast majority of
> functions, the cost is comparing a 64-bit number stored in the CPU
> instruction stream (needle) to a value that is guaranteed to already
> be in cache (candidate match).
>
> '''Pros:'''
>
>  * Readability-wise, one can use the C switch statement to dispatch
>
>  * "Stateless data", for compiled code it does not require any
>   run-time initialization like interning does
>
>  * One less pointer-dereference in the common case of a short
>   signature
>
> '''Cons:'''
>
>  * Long signatures will require more than 8 bytes to store and could
>   thus be more expensive than interned strings
>
>  * Format looks uglier in the form of literals in C source code
>
>
> == Signature strings ==
>
> Example: The function
> {{{
> int f(double x, float y);
> }}}
> would have the signature string {{{"df)i"}}} (or, to save space,
> {{{"idf"}}}).
>
> Fields would follow the PEP3118 extensions of the struct-module format
> string, but with some modifications:
>
>  * The format should be canonical and fit for {{{strcmp}}}-like
>   comparison: No whitespace, no field names (TBD: what else?)

I think alignment is also a troublemaker. Maybe we should allow '@'
(which cannot appear in the character string but will be the default,
that is native size, alignment and byteorder) and '^', unaligned
native size and byteorder (to be used for packed structs).

>  * TBD: Information about GIL requirements (nogil, with gil?), how
>   exceptions are reported

Maybe that could be a separate list, to be consulted mostly for
explicit casts (I think PyErr_Occurred() would be the default for
non-object return types).

>  * TBD: Support for Cython-specific constructs like memoryview slices
>   (so that arrays with strides and shape can be passed faster than
>   passing an {{{"O"}}}).

Definitely, maybe something simple like M{1f}, for a 1D memoryview
slice of floats.

> _______________________________________________
> cython-devel mailing list
> cython-devel at python.org
> http://mail.python.org/mailman/listinfo/cython-devel