
Hi there, Reading Nathaniel summary from the numpy dev meeting, it looks like there is a consensus on using cython in numpy for the Python-C interfaces. This has been on my radar for a long time: that was one of my rationale for splitting multiarray into multiple "independent" .c files half a decade ago. I took the opportunity of EuroScipy sprints to look back into this, but before looking more into it, I'd like to make sure I am not going astray: 1. The transition has to be gradual 2. The obvious way I can think of allowing cython in multiarray is modifying multiarray such as cython "owns" the PyMODINIT_FUNC and the module PyModuleDef table. 3. We start using cython for the parts that are mostly menial refcount work. Things like functions in calculation.c are obvious candidates. Step 2 should not be disruptive, and does not look like a lot of work: there are < 60 methods in the table, and most of them should be fairly straightforward to cythonize. At worse, we could just keep them as is outside cython and just "export" them in cython. Does that sound like an acceptable plan ? If so, I will start working on a PR to work on 2. David

On Sun, Aug 30, 2015 at 2:44 PM, David Cournapeau <cournape@gmail.com> wrote:
Hi there,
Reading Nathaniel summary from the numpy dev meeting, it looks like there is a consensus on using cython in numpy for the Python-C interfaces.
This has been on my radar for a long time: that was one of my rationale for splitting multiarray into multiple "independent" .c files half a decade ago. I took the opportunity of EuroScipy sprints to look back into this, but before looking more into it, I'd like to make sure I am not going astray:
1. The transition has to be gradual
Yes, definitely.
2. The obvious way I can think of allowing cython in multiarray is modifying multiarray such as cython "owns" the PyMODINIT_FUNC and the module PyModuleDef table.
The seems like a plausible place to start. In the longer run, I think we'll need to figure out a strategy to have source code divided over multiple .pyx files (for the same reason we want multiple .c files -- it'll just be impossible to work with otherwise). And this will be difficult for annoying technical reasons, since we definitely do *not* want to increase the API surface exposed by multiarray.so, so we will need to compile these multiple .pyx and .c files into a single module, and have them talk to each other via internal interfaces. But Cython is currently very insistent that every .pyx file should be its own extension module, and the interface between different files should be via public APIs. I spent some time poking at this, and I think it's possible but will take a few kluges at least initially. IIRC the tricky points I noticed are: - For everything except the top-level .pyx file, we'd need to call the generated module initialization functions "by hand", and have a bit of utility code to let us access the symbol tables for the resulting modules - We'd need some preprocessor hack (or something?) to prevent the non-main module initialization functions from being exposed at the .so level (like 'cdef extern from "foo.h"', 'foo.h' re#defines PyMODINIT_FUNC to remove the visibility declaration) - By default 'cdef' functions are name-mangled, which is annoying if you want to be able to do direct C calls between different .pyx and .c files. You can fix this by adding a 'public' declaration to your cdef function. But 'public' also adds dllexport stuff which would need to be hacked out as per above. I think the best strategy for this is to do whatever horrible things are necessary to get an initial version working (on a branch, of course), and then once that's done assess what changes we want to ask the cython folks for to let us eliminate the gross parts. (Insisting on compiling everything into the same .so will probably also help at some point in avoiding Cython-Related Binary Size Blowup Syndrome (CRBSBS), because the masses of boilerplate could in principle be shared between the different files. I think some modern linkers are even clever enough to eliminate this kind of duplicate code automatically, since C++ suffers from a similar problem.)
3. We start using cython for the parts that are mostly menial refcount work. Things like functions in calculation.c are obvious candidates.
Step 2 should not be disruptive, and does not look like a lot of work: there are < 60 methods in the table, and most of them should be fairly straightforward to cythonize. At worse, we could just keep them as is outside cython and just "export" them in cython.
Does that sound like an acceptable plan ?
If so, I will start working on a PR to work on 2.
Makes sense to me! -n -- Nathaniel J. Smith -- http://vorpus.org

On Tue, Sep 1, 2015 at 8:16 AM, Nathaniel Smith <njs@pobox.com> wrote:
Hi there,
Reading Nathaniel summary from the numpy dev meeting, it looks like
On Sun, Aug 30, 2015 at 2:44 PM, David Cournapeau <cournape@gmail.com> wrote: there is
a consensus on using cython in numpy for the Python-C interfaces.
This has been on my radar for a long time: that was one of my rationale for splitting multiarray into multiple "independent" .c files half a decade ago. I took the opportunity of EuroScipy sprints to look back into this, but before looking more into it, I'd like to make sure I am not going astray:
1. The transition has to be gradual
Yes, definitely.
2. The obvious way I can think of allowing cython in multiarray is modifying multiarray such as cython "owns" the PyMODINIT_FUNC and the module PyModuleDef table.
The seems like a plausible place to start.
In the longer run, I think we'll need to figure out a strategy to have source code divided over multiple .pyx files (for the same reason we want multiple .c files -- it'll just be impossible to work with otherwise). And this will be difficult for annoying technical reasons, since we definitely do *not* want to increase the API surface exposed by multiarray.so, so we will need to compile these multiple .pyx and .c files into a single module, and have them talk to each other via internal interfaces. But Cython is currently very insistent that every .pyx file should be its own extension module, and the interface between different files should be via public APIs.
I spent some time poking at this, and I think it's possible but will take a few kluges at least initially. IIRC the tricky points I noticed are:
- For everything except the top-level .pyx file, we'd need to call the generated module initialization functions "by hand", and have a bit of utility code to let us access the symbol tables for the resulting modules
- We'd need some preprocessor hack (or something?) to prevent the non-main module initialization functions from being exposed at the .so level (like 'cdef extern from "foo.h"', 'foo.h' re#defines PyMODINIT_FUNC to remove the visibility declaration)
- By default 'cdef' functions are name-mangled, which is annoying if you want to be able to do direct C calls between different .pyx and .c files. You can fix this by adding a 'public' declaration to your cdef function. But 'public' also adds dllexport stuff which would need to be hacked out as per above.
I think the best strategy for this is to do whatever horrible things are necessary to get an initial version working (on a branch, of course), and then once that's done assess what changes we want to ask the cython folks for to let us eliminate the gross parts.
Agreed. Regarding multiple cython .pyx and symbol pollution, I think it would be fine to have an internal API with the required prefix (say `_npy_cpy_`) in a core library, and control the exported symbols at the .so level. This is how many large libraries work in practice (e.g. MKL), and is a model well understood by library users. I will start the cythonize process without caring about any of that though: one large .pyx file, and everything build together by putting everything in one .so. That will avoid having to fight both cython and distutils at the same time :) David
(Insisting on compiling everything into the same .so will probably also help at some point in avoiding Cython-Related Binary Size Blowup Syndrome (CRBSBS), because the masses of boilerplate could in principle be shared between the different files. I think some modern linkers are even clever enough to eliminate this kind of duplicate code automatically, since C++ suffers from a similar problem.)
3. We start using cython for the parts that are mostly menial refcount work. Things like functions in calculation.c are obvious candidates.
Step 2 should not be disruptive, and does not look like a lot of work: there are < 60 methods in the table, and most of them should be fairly straightforward to cythonize. At worse, we could just keep them as is outside cython and just "export" them in cython.
Does that sound like an acceptable plan ?
If so, I will start working on a PR to work on 2.
Makes sense to me!
-n
-- Nathaniel J. Smith -- http://vorpus.org _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
participants (2)
-
David Cournapeau
-
Nathaniel Smith