[Cython] SEP 201 draft: Native callable objects
Dag Sverre Seljebotn
d.s.seljebotn at astro.uio.no
Fri Jun 1 16:25:18 CEST 2012
On 06/01/2012 03:49 PM, Dag Sverre Seljebotn wrote:
> On 05/31/2012 10:13 PM, Robert Bradshaw wrote:
>> On Thu, May 31, 2012 at 12:29 PM, Dag Sverre Seljebotn
>> <d.s.seljebotn at astro.uio.no> wrote:
>>> On 05/31/2012 08:50 PM, Robert Bradshaw wrote:
>>>>
>>>> On Thu, May 31, 2012 at 7:04 AM, Dag Sverre Seljebotn
>>>> <d.s.seljebotn at astro.uio.no> wrote:
>>>>>
>>>>> [Discussion on numfocus at googlegroups.com please]
>>>>>
>>>>> I've uploaded a draft-state SEP 201 (previously CEP 1000):
>>>>>
>>>>> https://github.com/numfocus/sep/blob/master/sep201.rst
>>>>>
>>>>> """
>>>>> Many callable objects are simply wrappers around native code. This
>>>>> holds
>>>>> for
>>>>> any Cython function, f2py functions, manually written CPython
>>>>> extensions,
>>>>> Numba, etc.
>>>>>
>>>>> Obviously, when native code calls other native code, it would be
>>>>> nice to
>>>>> skip the significant cost of boxing and unboxing all the arguments.
>>>>> """
>>>>>
>>>>>
>>>>> The thread about this on the Cython list is almost endless:
>>>>>
>>>>> http://thread.gmane.org/gmane.comp.python.cython.devel/13416/focus=13443
>>>>>
>>>>>
>>>>> There was a long discussion on the key-comparison vs. interned-string
>>>>> approach. I've written both up in SEP 201 since it was the major
>>>>> point of
>>>>> contention. There was some benchmarks starting here:
>>>>>
>>>>> http://thread.gmane.org/gmane.comp.python.cython.devel/13416/focus=13443
>>>>>
>>>>>
>>>>> And why provide a table and not a get_function_pointer starting here:
>>>>>
>>>>> http://thread.gmane.org/gmane.comp.python.cython.devel/13416/focus=13443
>>>>>
>>>>>
>>>>> For those who followed that and don't want to read the entire spec,
>>>>> the
>>>>> aspect of flags is new. How do we avoid to duplicate entries/check
>>>>> against
>>>>> two signatures for cases like a GIL-holding caller wanting to call a
>>>>> nogil
>>>>> function? My take: For key-comparison you can compare under a mask,
>>>>> for
>>>>> interned-string we should have additional flags field.
>>>>>
>>>>> The situation is a bit awkward: The Cython list consensus (well, me
>>>>> and
>>>>> Robert Bradshaw) decided on what is "Approach 1" (key-comparison)
>>>>> in SEP
>>>>> 201. I pushed for that.
>>>>>
>>>>> Still, now that a month has passed, I just think key-comparison is too
>>>>> ugly,
>>>>> and that the interning mechanism shouldn't be *that* hard to code up,
>>>>> probably 500 lines of C code if one just requires the GIL in a first
>>>>> iteration, and that keeping the spec simpler is more important.
>>>>>
>>>>> So I'm tentatively proposing Approach 2.
>>>>
>>>>
>>>> I'm still not convinced that a hybrid approach, where signatures below
>>>> some cutoff are compiled down to keys, is not a worthwhile approach.
>>>> This gets around variable-length keys (both the complexity and
>>>> possible runtime costs for long keys) and allows simple libraries to
>>>> produce and consume fast callables without participating in the
>>>> interning mechanism.
>>>
>>> I still think this gives us the "worst of both worlds", all the
>>> disadvantages and none of the advantages.
>>
>> It avoids the one of the primary disadvantage of keys, namely the
>> variable length complexity.
>>
>>> How many simple libraries are there really? Cython on one end, the
>>> magnificently complicated NumPy ufuncs on the other? Thinking big,
>>> perhaps
>>> PyPy and Julia? Cython, PyPy, Julia would all have to deal with long
>>> signatures anyway. And NumPy ufuncs are already complicated so even more
>>> low-level stuff wouldn't hurt.
>>
>> I was thinking of, for example, a differential equation solver written
>> in C, C++, or Fortran that could take a PyNativeCallableTable*
>> directly, primarily avoiding welding this spec to Python.
>
> I'm not sure how real-world that is in the end. But, the size of Cython
> generated code would be kept down for most modules as it wouldn't need
> to bundle an interner.
>
> AND, a problem with interning is spreading the signature strings all
> over memory (in the event you actually need to look at the contents).
> With a smart interner I guess this can be eliminated to some extent, but
> much better if one doesn't have to worry, and if all short signatures
> are keys you don't.
>
> Playing along:
>
> a) It'd be very nice to avoid explicit decoding. I think one should be
> able to cast the key to char[]; this a) avoids having to allocate a
> buffer on the stack to pass to a Decode function, b) let's you inspect
> the table in a debugger easily.
>
> b) Flags are needed in addition to interning; GIL status and exception
> return values do not require exact matches. I think more than 3 bits are
> needed for flags => our minimal padded table entry size is actually 24
> bytes! (And this is OK, my benchmarks weren't affected by 8-byte vs
> 16-byte comparisons, branching is so dominating.)
>
> Now, 16 bits seems about right for flags, so this means we can actually
> for free use 14-char keys (12 for signature data, one for \0, one for
> guard)
>
> That pushes the number of non-interning signatures high enough to make
> it really fit 95% of the use-cases. I feel 6 chars is a little low,
> remember that a "pointer to a double complex" is "&Zd" by itself unless
> we play with encoding.
>
> BUT, it then gets rather complicated to have things work on
> little-endian vs. big-endian though as the guard byte must be in
> different positions. If you want to align the pointer you get this:
>
> typedef struct {
> void *funcptr;
> union {
> union {
> struct {
> uint16_t interned_flags;
> uint16_t padding1;
> uint32_t padding2;
> uint64_t interned_sig;
> };
> struct {
> uint16_t flags;
> char sig[14];
> };
> } big_endian;
>
> union {
> struct {
> uint64_t interned_sig;
> uint32_t padding1;
> uint16_t padding2;
> uint16_t interned_flags;
> };
> struct {
> char sig[14];
> uint16_t flags;
> };
> } little_endian;
> };
> };
>
> (interned_flags and flags is really the same, I just didn't want to mess
> with the struct alignment)
>
> So I think this is *almost* there, but it certainly gets complicated
> because of endianness issues.
>
> Of course, an alternative is to not have the interned_sig be 64-bit
> aligned. Or, play with adapting the string/guard bytes in the middle,
> but that sort of breaks a) above.
OK, now I feel silly.
If we need flags (as I believe we do), and the flag-containing quadword
is being compared anyway, there's no reason at all to play tricks with
aligned pointers and guard bytes.
Simplest approach with a 128-bit compare (which, as I said, doesn't hurt
one bit and may be needed anyway to filter on GIL-ness), is then
struct {
union {
char *interned_sig;
char signature[8]
};
uint64_t flags; // first 8 bits always 0, for terminating \0
void *funcptr;
};
One could also complicate this again to eat a few more flag bits for
signature chars...
Dag
More information about the cython-devel
mailing list