Starting to work on runtime plugin system for plugin (automatic sse optimization, etc...)

Hi, I've just started working on a prototype for a plugin system for numpy. The plugin aims at providing a framework for the following user cases: - runtime selection of blas/lapack/etc...: instead of harcoding in the binary one blas/lapack implementation, numpy could choose the SSE optimized if the CPU supports SSE, etc... - this could also be used for core numpy, for example ufuncs: if we want to start implementing some tight loop with aggressively optimized code (SSE, etc...), we could again ship with a default pure C implementation, and choose the best one at runtime. - we could even have a system to choose a different implementation (for example, right now, scipy is shipped with a slow fft for licensing issues mainly, and people installing fftw could then tell scipy to use fftw instead of the included one). Right now, the prototype does not do much, and only works for linux; I mainly focused on automatic generation of the plugin from a list of functions, and transparent use from numpy point of view. It provides the plugin api through pure function pointers, without the need for the user to be aware of it. For example, if you have an api with the following functions: void foo1(); int foo2(); int foo3(int); int foo4(double* , double*); int foo5(double* , double*, int); The current implementation would build the boilerplate to load those functions, etc... and you would just use those functions in numpy like the following: init_foo(); /* all functions are prefixed with npyw, for numpy wrapper */ npyw_foo1(); npyw_foo2(n); etc... The code can be found there: https://code.launchpad.net/~david-ar/+junk/numplug And some thinking (pretty low content for now): http://www.scipy.org/RuntimeOptimization cheers, David

2008/4/28 David Cournapeau <david@ar.media.kyoto-u.ac.jp>:
- this could also be used for core numpy, for example ufuncs: if we want to start implementing some tight loop with aggressively optimized code (SSE, etc...), we could again ship with a default pure C implementation, and choose the best one at runtime.
This would be a *fantastic* addition, especially if a user can add his own ufuncs written in, say Cython.
Right now, the prototype does not do much, and only works for linux; I mainly focused on automatic generation of the plugin from a list of functions, and transparent use from numpy point of view. It provides the plugin api through pure function pointers, without the need for the user to be aware of it. For example, if you have an api with the following functions:
void foo1(); int foo2(); int foo3(int); int foo4(double* , double*); int foo5(double* , double*, int);
The current implementation would build the boilerplate to load those functions, etc... and you would just use those functions in numpy like the following:
I assume that, since you call it a plugin system, it can be done at runtime a-la ctypes? Cheers Stéfan

On Tue, Apr 29, 2008 at 1:00 AM, Stéfan van der Walt <stefan@sun.ac.za> wrote:
I assume that, since you call it a plugin system, it can be done at runtime a-la ctypes?
I am not sure to understand what you mean exactly by a-la ctypes, but yes, the actual implementation of the npyw_* functions would be decided at runtime, that's the whole point. For example, instead of using directly cblas_*dot* functions in blasdot, we would use npyw_cblas*dot* functions, which would point to something in SSE3 optimized atlas if run on SSE3 cpu, etc... For the actual code change in numpy and scipy to be minimal, it should only involve renaming functions, which is what this first prototype focus on. Once the sytem is ready to be integrated (not anytime soon; for once, we would need support for the build system to build dynamically loaded libraries), this would mean numpy and scipy would work like matlab, which ship with different blas/lapack (atlas_sse, atlas_sse2, mkl), without the user to have to deal with this kind of low level details. It would also help to make (if only by making the incentive) a cleaner difference between pure C implementation and python C api boilerplate code. I think that optimizing some ufuncs with SSE and co without a runtime optimization would be a nightmare to deploy otherwise. It is basically linked to the directions I am the most interested in for future python numpy releases cheers, David

2008/4/28 David Cournapeau <cournape@gmail.com>:
On Tue, Apr 29, 2008 at 1:00 AM, Stéfan van der Walt <stefan@sun.ac.za> wrote:
I assume that, since you call it a plugin system, it can be done at runtime a-la ctypes?
What I meant was: would the plugin "slots" be decided beforehand, or could we manipulate them at runtime? I.e. what I would really enjoy doing is define arbitrary ufuncs and plug them in (not only the blas funcs and a select few others). Either way, this is good news for distributors -- it would make it so much easier to provide an optimised version of scipy to windows users, who can't easily compile it themselves. Cheers Stéfan

On Tue, Apr 29, 2008 at 4:03 AM, Stéfan van der Walt <stefan@sun.ac.za> wrote:
What I meant was: would the plugin "slots" be decided beforehand, or could we manipulate them at runtime? I.e. what I would really enjoy doing is define arbitrary ufuncs and plug them in (not only the blas funcs and a select few others).
For each set of functions, there is a init function (just like python extensions) you must call before calling any function, and you can do what you want there: this could be controlled by environment variables and the likes. I still don't have a clear idea on the API for handling the different ways you would like to load the plugin (multi thread vs mono-thread, SSE1/2/3/4 vs no SSE, etc...), though. I would prefer every plugin to have exactly the same signature for the init function, and would like to avoid as much as possible global state. The thing I wanted to do but quickly gave up because of the complexity is unloading/reloading a plugin. For example, at the beginning, you may want to load atlas for the blas, and then use mkl instead. cheers, David

Stéfan van der Walt wrote:
What I meant was: would the plugin "slots" be decided beforehand, or could we manipulate them at runtime? I.e. what I would really enjoy doing is define arbitrary ufuncs and plug them in (not only the blas funcs and a select few others).
Do you want to define the ufuncs at runtime ? Or do you want to be able to select the ufuncs at runtime ? cheers, David

2008/4/29 David Cournapeau <david@ar.media.kyoto-u.ac.jp>:
Stéfan van der Walt wrote:
What I meant was: would the plugin "slots" be decided beforehand, or could we manipulate them at runtime? I.e. what I would really enjoy doing is define arbitrary ufuncs and plug them in (not only the blas funcs and a select few others).
Do you want to define the ufuncs at runtime ? Or do you want to be able to select the ufuncs at runtime ?
Both, eventually, but I realise that this is somewhat out of the scope of your original suggestion. Don't let me distract you, I'm very keen to see what you come up with! Cheers Stéfan

Stéfan van der Walt wrote:
Don't let me distract you, I'm very keen to see what you come up with!
Well, the point of making my preliminary work public is to get "distracted", as you put it :) It is really easy to come up with something not that useful without feedback or people remarks. Typically, I would have never thought about the usefulness of defining ufunc functions on the fly; I have my own ideas on how/why I do things, but I would consider it a failure if it was limited to that, cheers, David

This would be a *fantastic* addition, especially if a user can add his own ufuncs written in, say Cython.
I'd like to add some large number to whatever *fantastic* means in terms of +N! Best, Matthew

David, I briefly took a look at your code, and I have a very, very important observation. Your implementation make uses of low level dlopening. Then, your are going to have to manage all the oddities of runtime loading in the different systems. In this case, 'libtool' could really help. I know, it is GPL, but AFAIK it has some special licencing that let's you ship it with your code in the same licence terms than your code. But, I definitely think that a betther approach would be using a stubs mechanism ala TCL, or wath is currently used in some extension modules in core python, like cStringIO. In short, you access all your functions from a pointer do a struct (statically or heap allocated) where each struct member is filled with a pointer to a function. This is pretty much similar to C++ virtual tables, or the Cython cdef's classes with cdef's methods. And then you just let Python do the work of dynamic loading of extension modules. The numpy C/API uses a similar approach, but uses an array. IMHO, the struct approach is cleaner. What do you think about this? On 4/28/08, David Cournapeau <david@ar.media.kyoto-u.ac.jp> wrote:
Hi,
I've just started working on a prototype for a plugin system for numpy. The plugin aims at providing a framework for the following user cases: - runtime selection of blas/lapack/etc...: instead of harcoding in the binary one blas/lapack implementation, numpy could choose the SSE optimized if the CPU supports SSE, etc... - this could also be used for core numpy, for example ufuncs: if we want to start implementing some tight loop with aggressively optimized code (SSE, etc...), we could again ship with a default pure C implementation, and choose the best one at runtime. - we could even have a system to choose a different implementation (for example, right now, scipy is shipped with a slow fft for licensing issues mainly, and people installing fftw could then tell scipy to use fftw instead of the included one).
Right now, the prototype does not do much, and only works for linux; I mainly focused on automatic generation of the plugin from a list of functions, and transparent use from numpy point of view. It provides the plugin api through pure function pointers, without the need for the user to be aware of it. For example, if you have an api with the following functions:
void foo1(); int foo2(); int foo3(int); int foo4(double* , double*); int foo5(double* , double*, int);
The current implementation would build the boilerplate to load those functions, etc... and you would just use those functions in numpy like the following:
init_foo();
/* all functions are prefixed with npyw, for numpy wrapper */ npyw_foo1(); npyw_foo2(n); etc...
The code can be found there:
https://code.launchpad.net/~david-ar/+junk/numplug
And some thinking (pretty low content for now):
http://www.scipy.org/RuntimeOptimization
cheers,
David _______________________________________________ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
-- Lisandro Dalcín --------------- Centro Internacional de Métodos Computacionales en Ingeniería (CIMEC) Instituto de Desarrollo Tecnológico para la Industria Química (INTEC) Consejo Nacional de Investigaciones Científicas y Técnicas (CONICET) PTLC - Güemes 3450, (3000) Santa Fe, Argentina Tel/Fax: +54-(0)342-451.1594

On Dienstag 29 April 2008, Lisandro Dalcin wrote:
Your implementation make uses of low level dlopening. Then, your are going to have to manage all the oddities of runtime loading in the different systems.
Argh. -1 for a hard dependency on dlopen(). At some point in my life, I might be forced to compile numpy on an IBM Bluegene/L, which does *not* have dynamic linking at all. (Btw, anybody done something like this before?) Andreas

Andreas Klöckner wrote:
Argh. -1 for a hard dependency on dlopen().
There is no hard dependency on dlopen, there is a hard dependency on runtime loading, because well, that's the point of a plugin system. It should not be difficult to be able to disable the plugin system for platforms who do not support it, though (and do as today), but I am not sure it is really useful.
At some point in my life, I might be forced to compile numpy on an IBM Bluegene/L, which does *not* have dynamic linking at all. (Btw, anybody done something like this before?)
How will you build numpy in the case of a system without dynamic linking ? The only solution is then to build numpy and link it statically to the python interpreter. Systems without dynamic linking are common (embedded systems), though. cheers, David

On Dienstag 29 April 2008, David Cournapeau wrote:
Andreas Klöckner wrote:
Argh. -1 for a hard dependency on dlopen().
There is no hard dependency on dlopen, there is a hard dependency on runtime loading, because well, that's the point of a plugin system. It should not be difficult to be able to disable the plugin system for platforms who do not support it, though (and do as today), but I am not sure it is really useful.
As long as it's easy to disable (for example with a preprocessor define), I guess I'm ok.
At some point in my life, I might be forced to compile numpy on an IBM Bluegene/L, which does *not* have dynamic linking at all. (Btw, anybody done something like this before?)
How will you build numpy in the case of a system without dynamic linking ? The only solution is then to build numpy and link it statically to the python interpreter. Systems without dynamic linking are common (embedded systems), though.
Yes, obviously everything will need to be linked into one big static executable blob. I am somewhat certain that distutils will be of no help there, so I will need to "roll my own". There is a CMake-based build of Python for BG/L, I was planning to work off that. But so far, I might not end up having to do all that, for which I'd be endlessly grateful. Andreas

Andreas Klöckner wrote:
Yes, obviously everything will need to be linked into one big static executable blob. I am somewhat certain that distutils will be of no help there, so I will need to "roll my own". There is a CMake-based build of Python for BG/L, I was planning to work off that.
You will have to build numpy too. Not that I want to discourage you, but that will be a hell of a work. cheers, David

On Dienstag 29 April 2008, David Cournapeau wrote:
Andreas Klöckner wrote:
Yes, obviously everything will need to be linked into one big static executable blob. I am somewhat certain that distutils will be of no help there, so I will need to "roll my own". There is a CMake-based build of Python for BG/L, I was planning to work off that.
You will have to build numpy too. Not that I want to discourage you, but that will be a hell of a work.
Good news is that Bluegene/P (the next version of that architecture) *does* support dynamic linking. It's probably broken in some obscure way, but that's (hopefully) better than not exsitent. :) In any case, if I can't dodge porting my code to BG/L, you'll hear from me. :) Andreas

Andreas Klöckner wrote:
But so far, I might not end up having to do all that, for which I'd be endlessly grateful.
If you really need it, note that numpy can be built with scons instead of distutils, and the scons scripts are now available in numpy svn (and will be included in the releases sources starting from 1.1.0). Scons severely lacks in the cross-compilation departement, but I think scons scripts are easier to adapt to cmake than distutils setup.py files if you need to use cmake :) I would actually be quite interested in making numpy build in a cross-compilation environment with scons cheers, David

Lisandro Dalcin wrote:
David, I briefly took a look at your code, and I have a very, very important observation.
Your implementation make uses of low level dlopening. Then, your are going to have to manage all the oddities of runtime loading in the different systems. In this case, 'libtool' could really help. I know, it is GPL, but AFAIK it has some special licencing that let's you ship it with your code in the same licence terms than your code. Ok, there are several issues here: 1 cross platform runtime loading 2 how to access the plugin capabilities (function pointer, interfaces, etc...) 3 how to build
1: the implementation is not cross platform, but the API is; It took me ~ 2 hours to refactor symbol loading, and getting an implementation for posix/win32/Mac os X. I don't know any OS with runtime loading capabilities and without the ability to load a file and a symbol from it; if it does not, it cannot be used by python anyway, everything would have to be build statically at the same time as python, which we do not support in numpy anway, AFAIK. 2: I studied quite a bit several approaches before using this one. That was my main concern at first. For plugins, you have the following possibilities I am aware of: - raw function pointer - COM - pre-defined API Raw function pointer are the simplest, but is not really scalable. COM is this big monstrosity, extremely ackward to use, but can be extended ad nauseum, without pre-defined interface. By pre-defined API, I mean something like VST plugins and the co (used for music softwares, where a host can load may plugins to provide sound effectsl it is the de-facto standard, Mac OS X has its built-in thing called AudioUnit, which is the same thing for what matters here). For the usage I have in mind (blas, fft, lapack, etc...), the API cannot be pre-defined (each one is totally different), so I quickly dismiss the VST-like approach. Then there is the COM thing, which is really complicated. Although each plugin interface is totally different (blas vs lapack), they are relatively fixed in stone, so I thought that by using generated code, the scalability problem of raw pointers could be alleviated. 3: libtool does not know about windows, and I think it is way too overkill. We can't use libtool for building (which is one of the big thing libtool provides). dlopen-like approach is not as portable as libtool, but it is as portable as python, which is good enough for numpy :)
But, I definitely think that a betther approach would be using a stubs mechanism ala TCL, or wath is currently used in some extension modules in core python, like cStringIO. In short, you access all your functions from a pointer do a struct (statically or heap allocated) where each struct member is filled with a pointer to a function. This is pretty much similar to C++ virtual tables, or the Cython cdef's classes with cdef's methods. And then you just let Python do the work of dynamic loading of extension modules. The numpy C/API uses a similar approach, but uses an array. IMHO, the struct approach is cleaner.
What do you think about this?
It may be cleaner, but I am not convinced it buys us much. With my approach, all is needed for the existing code (numpy.core, numpy.linalg, etc...) is a renaming of the used function (which can be done in 5 minutes with sed), because in C, calling a function or a function pointer is exactly the same thing. With the approach of function pointers, you will have to replace all the function calls for blas, lapack, etc... by a system to pass the array of function pointers. That sounds like nightmare to me, because I don't see how to do that automatically. Maybe I just don't see it; in that case, what would be your approach ? I have been convinced that the function pointer approach is usable by looking at liboil, which does exactly the thing we need: http://liboil.freedesktop.org/wiki/ (you can look at the liboil/liboilfuncs* and liboil/liboilfunc.* files). Liboil's approach is more complicated, because the function pointer is provided by a function factory (so that each function can be initialized differently), but I don't think we need the factory in our case (and if we need, we can do it without changing anything in the code which uses the plugin). cheers, David
participants (6)
-
Andreas Klöckner
-
David Cournapeau
-
David Cournapeau
-
Lisandro Dalcin
-
Matthew Brett
-
Stéfan van der Walt