[Cython] New early-binding concept [was: CEP1000]

Thu Apr 19 12:43:15 CEST 2012

On 04/19/2012 10:35 AM, Vitja Makarov wrote:
> 2012/4/19 Dag Sverre Seljebotn<d.s.seljebotn at astro.uio.no>:
>> On 04/19/2012 08:41 AM, Stefan Behnel wrote:
>>>
>>> Dag Sverre Seljebotn, 18.04.2012 23:35:
>>>>
>>>> from numpy import sqrt, sin
>>>>
>>>> cdef double f(double x):
>>>>      return sqrt(x * x) # or sin(x * x)
>>>>
>>>> Of course, here one could get the pointer in the module at import time.
>>>
>>>
>>> That optimisation would actually be very worthwhile all by itself. I mean,
>>> we know what signatures we need for globally imported functions throughout
>>> the module, so we can reduce the call to a single jump through a function
>>> pointer (although likely with a preceding NULL check, which the branch
>>> prediction would be happy to give us for free). At least as long as sqrt
>>> is
>>> not being reassigned, but that should hit the 99% case.
>>>
>>>
>>>> However, here:
>>>>
>>>> from numpy import sqrt
>>
>>
>> Correction: "import numpy as np"
>>
>>>>
>>>> cdef double f(double x):
>>>>      return np.sqrt(x * x) # or np.sin(x * x)
>>>>
>>>> the __getattr__ on np sure is larger than any effect we discuss.
>>>
>>>
>>> Yes, that would have to stay a .pxd case, I guess.
>>
>>
>> How about this mini-CEP:
>>
>> Modules are allowed to specify __nomonkey__ (or __const__, or
>> __notreassigned__), a list of strings naming module-level variables where
>> "we don't hold you responsible if you assume no monkey-patching of these".
>>
>> When doing "import numpy as np", then (assuming "np" is never reassigned in
>> the module), at import time we check all names looked up from it in
>> __nomonkey__, and if so treat them as "from numpy import sqrt as 'np.sqrt'",
>> i.e. the "np." is just a namespace mechanism.
>>
>> Needs a bit more work, it ignores the possibility that others could
>> monkey-patch "np" in the Cython module.
>>
>> Problem with .pxd is that currently you need to pick one overload (np.sqrt
>> works for n-dimensional arrays too, or takes a list and returns an array).
>> And even after adding 3-4 language features to Cython to make this work,
>> you're stuck with having to reimplement parts of NumPy in the pxd files just
>> so that you can early bind from Cython.
>>
>
> Sorry, I'm a bit late.
>
> When should __nomonkey__ be checked at compile time or at import time?

At import time. At compile time we generate one (potential) function 
pointer per call-signature we might try. At import time we fill them in 
if they are in __nomonkey__ using CEP 1000. At call time we likely() 
around the pointer being non-empty, since the cost of a dict-lookup is 
so large anyway.

> It seems to me that compiler must guess function signature at compile
> time. And then check it at runtime.

Yes. Just like Fortran 77, where you don't declare functions, just call 
them.At least with Cython it'll just go slower if you get them wrong, we 
won't get a crash :-)

If you want to help the compiler along explicitly, you would instead do 
something like

cdef double (*sqrt_double)(double)
from numpy import sqrt

> What if integer signature is guessed for sqrt() based on the argument
> type sqrt(16) should this call fallback to PyObject_Call() or cast an
> integer to a double at some point?

a) np.sqrt could export functions for all basic types (this is how NumPy 
currently works under the hood anyway)

b) It doesn't help here, but I also imagine Cython doing a 3-step or 
4-step call down the line:

- Direct call using the types given.
- Promote all scalars to 64 bit, try again.
[- (Optional if an FFI library or LLVM is available): Parse signature 
string of function and build call dynamically using a FFI library)]
- Python call

Without an FFI library, I think giving the user a speedup if he/she 
writes sqrt(3.) rather than sqrt(3) is fine...

c) An optimize-for-host-libraries compilation flag could of course just 
probe for the signatures, similar to profile-guided optimization.

> I've tried to implement trivial approach for CyFunction. Trivial means
> that function accepts PyObjects as arguments and returns an PyObject,
> so trivial signature is only one integer: 1 + len(args). If signature
> match occurs dirct C-function is called and PyObject_Call() is used
> otherwise. I didn't succeed because of argument cloning problems, we
> discussed before.
>
> About dict lookups: it's possible to speedup dict lookup by a constant
> key if we have access to dict's internal implementation. I've
> implemented it for module-level lookups here:
>
> https://github.com/vitek/cython/commit/1d134fe54a74e6fc6d39d09973db499680b2a8d9
>
> And it gave 4 times speed up for dummy test:
>
> def foo():
>      cdef int i, r = 0
>      o = foo
>      for i in range(10000000):
>          if o is foo:
>              r += 1
>
> %timeit foo()
> 1 loops, best of 3: 229 ms per loop
>
> %timeit foo_optimized()
> 10 loops, best of 3: 54.1 ms per loop
>

Cool! Am I right that that translates to 5.4 ns? That's pretty good. 
(What CPU did you use?)

Still, a function pointer call done at import time appears to be roughly 
1 ns, so a full sqrt bound the way I proposed would be 2-3 ns, so 5.4ns 
in addition still relatively large.

But, it does mean that __nomonkey__, if not completely invalid, is 
perhaps not exactly high-priority. For a JIT that would consult it at 
compile time the gain would be higher though.

Dag