Looking for people interested in helping with Python compiler to LLVM
Hey all, I gave a lightning talk this morning on numba which is the start of a Python compiler to machine code through the LLVM tool-chain. It is proof of concept stage only at this point (use it only if you are interested in helping develop the code at this point). The only thing that works is a fast-vectorize capability on a few functions (without for-loops). But, it shows how creating functions in Python that can be used by the NumPy runtime in various ways. Several NEPS that will be discussed in the coming months will use this concept. Right now there is very little design documentation, but I will be adding some in the days ahead, especially if I get people who are interested in collaborating on the project. I did talk to Fijal and Alex of the PyPy project at PyCon and they both graciously suggested that I look at some of the PyPy code which walks the byte-code and does translation to their intermediate representation for inspiration. Again, the code is not ready for use, it is only proof of concept, but I would like to get feedback and help especially from people who might have written compilers before. The code lives at: https://github.com/ContinuumIO/numba -Travis
what is the difference to http://www.python.org/dev/peps/pep-3146/ ? On Sun, Mar 11, 2012 at 4:35 PM, Travis Oliphant <teoliphant@gmail.com>wrote:
Hey all,
I gave a lightning talk this morning on numba which is the start of a Python compiler to machine code through the LLVM tool-chain. It is proof of concept stage only at this point (use it only if you are interested in helping develop the code at this point). The only thing that works is a fast-vectorize capability on a few functions (without for-loops). But, it shows how creating functions in Python that can be used by the NumPy runtime in various ways. Several NEPS that will be discussed in the coming months will use this concept.
Right now there is very little design documentation, but I will be adding some in the days ahead, especially if I get people who are interested in collaborating on the project. I did talk to Fijal and Alex of the PyPy project at PyCon and they both graciously suggested that I look at some of the PyPy code which walks the byte-code and does translation to their intermediate representation for inspiration.
Again, the code is not ready for use, it is only proof of concept, but I would like to get feedback and help especially from people who might have written compilers before. The code lives at: https://github.com/ContinuumIO/numba
-Travis
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
It says : " Numba is a NumPy aware optimizing compiler for Python. It uses the remarkable LLVM compiler infrastructure to compile Python byte-code to machine code especially for use in the NumPy run-time and SciPy modules."* * If this description is correct, Numba is an additional pass once the cpython bytecode has be produced by cpython. Is it correct?? Is python bytecote a good intermediate representation to perform numpy related optimization? One current big issue with numpy is that C=A+B+D produces temporaries. numexpr addresses this issue and it would be great to get the same result by default in numpy. numexpr also optimizes polynomials using Horner's method. It is hard to do that at bytecode level, isn't it? Unladen swallow wanted to replace the full cpython implementation by a jit compiler built using LLVM...but unladen swallow is dead. Xavier
what is the difference to http://www.python.org/dev/peps/pep-3146/ ?
On Sun, Mar 11, 2012 at 4:35 PM, Travis Oliphant <teoliphant@gmail.com <mailto:teoliphant@gmail.com>> wrote:
Hey all,
I gave a lightning talk this morning on numba which is the start of a Python compiler to machine code through the LLVM tool-chain. It is proof of concept stage only at this point (use it only if you are interested in helping develop the code at this point). The only thing that works is a fast-vectorize capability on a few functions (without for-loops). But, it shows how creating functions in Python that can be used by the NumPy runtime in various ways. Several NEPS that will be discussed in the coming months will use this concept.
Right now there is very little design documentation, but I will be adding some in the days ahead, especially if I get people who are interested in collaborating on the project. I did talk to Fijal and Alex of the PyPy project at PyCon and they both graciously suggested that I look at some of the PyPy code which walks the byte-code and does translation to their intermediate representation for inspiration.
Again, the code is not ready for use, it is only proof of concept, but I would like to get feedback and help especially from people who might have written compilers before. The code lives at: https://github.com/ContinuumIO/numba
-Travis
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org <mailto:NumPy-Discussion@scipy.org> http://mail.scipy.org/mailman/listinfo/numpy-discussion
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
11.03.2012 15:12, xavier.gnata@gmail.com kirjoitti: [clip]
If this description is correct, Numba is an additional pass once the cpython bytecode has be produced by cpython. Is it correct?? Is python bytecote a good intermediate representation to perform numpy related optimization?
One current big issue with numpy is that C=A+B+D produces temporaries. numexpr addresses this issue and it would be great to get the same result by default in numpy. numexpr also optimizes polynomials using Horner's method. It is hard to do that at bytecode level, isn't it?
My impression is that dealing with Python's bytecode is not necessarily significantly harder than dealing with the AST. Your example reads 1 0 LOAD_NAME 0 (A) 3 LOAD_NAME 1 (B) 6 BINARY_ADD 7 LOAD_NAME 2 (D) 10 BINARY_ADD 11 STORE_NAME 3 (C) For instance, interpreting the bytecode (e.g. loop body) once with dummy objects lets you know what the final compound expression is.
Unladen swallow wanted to replace the full cpython implementation by a jit compiler built using LLVM... but unladen swallow is dead.
To get speed gains, you need to optimize not only the bytecode interpreter side, but also the object space --- Python classes, strings and all that. Keeping in mind Python's dynamism, there are potential side effects everywhere. I guess this is what sunk the swallow. Just speeding up effectively statically typed code dealing with arrays and basic types, on the other hand, sounds much easier. The PyPy guys have a much more ambitious approach, and are getting nice results. Also with arrays --- as I understand, the fact that they want to be able to do this sort of optimization is the main reason why they want to reimplement the core parts of Numpy in RPython. The second issue is that unfortunately their emulation of CPython's C-API is at the moment seems to have quite large overheads. Porting Numpy on top of that is possible --- I managed to get basic things (apart from string/unicode arrays) to work, but things took very large speed hits (of the order ~ 100x for things like `arange(10000).sum()`). This pushes the speed advantage of Numpy to a bit too large array sizes. The reason is probably that Numpy uses PyObjects internally heavily, which accumulates the cost of passing objects through the emulation layer. -- Pauli Virtanen
On 03/11/2012 03:52 PM, Pauli Virtanen wrote: > 11.03.2012 15:12, xavier.gnata@gmail.com kirjoitti: > [clip] >> If this description is correct, Numba is an additional pass once the >> cpython bytecode has be produced by cpython. >> Is it correct?? >> Is python bytecote a good intermediate representation to perform numpy >> related optimization? >> >> One current big issue with numpy is that C=A+B+D produces temporaries. >> numexpr addresses this issue and it would be great to get the same >> result by default in numpy. >> numexpr also optimizes polynomials using Horner's method. It is hard to >> do that at bytecode level, isn't it? > My impression is that dealing with Python's bytecode is not necessarily > significantly harder than dealing with the AST. > > Your example reads > > 1 0 LOAD_NAME 0 (A) > 3 LOAD_NAME 1 (B) > 6 BINARY_ADD > 7 LOAD_NAME 2 (D) > 10 BINARY_ADD > 11 STORE_NAME 3 (C) > > For instance, interpreting the bytecode (e.g. loop body) once with dummy > objects lets you know what the final compound expression is. ok. > >> Unladen swallow wanted to replace the full cpython implementation by a >> jit compiler built using LLVM... but unladen swallow is dead. > To get speed gains, you need to optimize not only the bytecode > interpreter side, but also the object space --- Python classes, strings > and all that. Keeping in mind Python's dynamism, there are potential > side effects everywhere. I guess this is what sunk the swallow. Correct : impact of dynamism underestimated and unexpected bugs in LLVM --> dead swallow. > Just speeding up effectively statically typed code dealing with arrays > and basic types, on the other hand, sounds much easier. it is :) > The PyPy guys have a much more ambitious approach, and are getting nice > results. Also with arrays --- as I understand, the fact that they want > to be able to do this sort of optimization is the main reason why they > want to reimplement the core parts of Numpy in RPython. It sonds ok for numpy but it is no sense because numpy is only the beginning of the story. Numpy is not *that* useful without scipy, matplotlib and so on. > The second issue is that unfortunately their emulation of CPython's > C-API is at the moment seems to have quite large overheads. Porting > Numpy on top of that is possible --- I managed to get basic things > (apart from string/unicode arrays) to work, but things took very large > speed hits (of the order ~ 100x for things like `arange(10000).sum()`). > This pushes the speed advantage of Numpy to a bit too large array sizes. > The reason is probably that Numpy uses PyObjects internally heavily, > which accumulates the cost of passing objects through the emulation layer. Yes. We have two types of users : 1) Guys using python without numpy. These guys want the python core language to go faster (with the full complexity of a dynamic language). 2) science users. These guys are fine with python speed because their codes spend 99% the time in numpy/scipy/any_other_science_module_using_c-api. Who cares if it takes 1.1s (in python) instead of 1.0 (in C) to read the data input file if it takes hours (or even minutes) to process them (using the C-api and doing the basics with temporaries and not optimized loops)?
Den 11. mars 2012 kl. 15:52 skrev Pauli Virtanen <pav@iki.fi>:
To get speed gains, you need to optimize not only the bytecode interpreter side, but also the object space --- Python classes, strings and all that. Keeping in mind Python's dynamism, there are potential side effects everywhere. I guess this is what sunk the swallow.
Just speeding up effectively statically typed code dealing with arrays and basic types, on the other hand, sounds much easier.
Yes, it would be like psyco, except it has LLVM as compiler and knows about NumPy. Sturla
Den 11.03.2012 15:52, skrev Pauli Virtanen:
To get speed gains, you need to optimize not only the bytecode interpreter side, but also the object space --- Python classes, strings and all that. Keeping in mind Python's dynamism, there are potential side effects everywhere. I guess this is what sunk the swallow.
I'm not sure what scared off or killed the bird. Maybe they just approached it from the wrong side. Psyco has proven that some "algorithmic" Python code can be accelerated by an order of magnitude or two. But Psyco did not use an optimizing compiler like LLVM, it just generated unoptimized x86 binary code from Python. Also, a JIT for a dynamic language is possible. One example is the Strongtalk JIT compiler for Smalltalk. It was purchased by Sun and renamed "Java VM Hotspot", because Sun could not produce a JIT for its static language Java. The static nature of Java does not have anything to do with the performance of its JIT compiler. Maybe there are differences between Python and Smalltalk that makes the latter more easy to JIT compile? Or perhaps there are differences between Hotspot/Strongtalk and LLVM that makes the latter less efficient for dynamic languages? I don't know. But making a performant JIT for all of Python is obviously not easy. A third example of a fast JIT for a dynamic language is LuaJIT. It can often make "interpreted" and duck-typed Lua run faster than statically compiled C. Like psyco, LuaJIT just focuses on making algorithmic code with a few elementary types run fast. Another approach to speeding up dynamic languages is optional static typing and static compilation. Cython, Bigloo (et al.), and CMUCL/SBCL are exellent examples of that. Maybe one could use type annotations like Cython in pure Python mode? Python 3 has even got type annotations in the syntax. I think (but I am not 100% sure) that the main problem with JIT'ing Python is its dynamic attributes. So maybe some magic with __slots__ or __metaclass__ could be used to turn that dynamicity off? So a JIT like Numba would only work with builtin types (int, float, list, dict, set, tuple), NumPy arrays, and some less-dynamic classes. It would not speed up all of Python, but a sufficient subset to make scientists happy. Sturla
On Mar 11, 2012, at 1:59 AM, Mic wrote:
what is the difference to http://www.python.org/dev/peps/pep-3146/ ?
To fully expound the difference would take a lot of discussion. But, summarizing: * Numba is not nearly as ambitious as US (Numba will be fine with some user-directed information and with a subset of the language that gets compiled) * Numba will focus on compilation rather than JITing --- in other words it won't be trying to detect hot-spots and compile segments (actually LLVM is not a good candidate for that sort of thing as it is not a very fast or memory-efficient compiler for a lot of reasons, I believe this is why PyPy does not use it anymore, for example). * Numba will be much closer to Cython in spirit than Unladen Swallow (or PyPy) --- people who just use Cython for a loop or two will be able to use Numba instead * Numba will borrow the idea of call-sites from IronPython wherein a function is replaced by a generic function that dispatches based on input types to either cached compiled code for the types specified or the generic function. * Numba will be mainly about trying to leverage the code-generation of LLVM which multiple hardware manufacturers are using (Intel's OpenCL support, Nvidia's PTX backend, Apple's CLang, etc.) for NumPy arrays -Travis
Den 11.03.2012 23:11, skrev Travis Oliphant:
* Numba will be much closer to Cython in spirit than Unladen Swallow (or PyPy) --- people who just use Cython for a loop or two will be able to use Numba instead
This is perhaps the most important issue for scientific and algorithmic codes. Not having to resort to Cython, C, C++ or Fortran to get decent performance for algorithmic code would be great. It could also put Python/Numba high up on the Debian shootout ;-) Or one could benchmark a pure Python version of timsort against the standard library. Sturla
Hi, Le 12/03/2012 00:21, Sturla Molden a écrit :
It could also put Python/Numba high up on the Debian shootout ;-)
Can you tell a bit more about it ? (I just didn't understand the whole sentence in fact ;-) ) Thanks ! -- Pierre
Doesent Theano does the same, only via GCC compilation?
One major difference is that Theano doesn't attempt to parse existing Python (byte)code: you need to explicitly code with the Theano syntax (which tries to be close to Numpy, but can end up looking quite different, especially if you want to control the program flow with loops and ifs for instance). A potentially interesting avenue would be to parse Python (byte)code to generate a Theano graph. It'd be nice if numba could output some intermediate information that would represent the computational graph being compiled, so that Theano could re-use it directly :) (probably much easier said than done though) -=- Olivier Le 12 mars 2012 12:57, Till Stensitzki <mail.till@gmx.de> a écrit :
Doesent Theano does the same, only via GCC compilation?
On 03/10/2012 10:35 PM, Travis Oliphant wrote:
Hey all,
I gave a lightning talk this morning on numba which is the start of a Python compiler to machine code through the LLVM tool-chain. It is proof of concept stage only at this point (use it only if you are interested in helping develop the code at this point). The only thing that works is a fast-vectorize capability on a few functions (without for-loops). But, it shows how creating functions in Python that can be used by the NumPy runtime in various ways. Several NEPS that will be discussed in the coming months will use this concept.
Right now there is very little design documentation, but I will be adding some in the days ahead, especially if I get people who are interested in collaborating on the project. I did talk to Fijal and Alex of the PyPy project at PyCon and they both graciously suggested that I look at some of the PyPy code which walks the byte-code and does translation to their intermediate representation for inspiration.
Again, the code is not ready for use, it is only proof of concept, but I would like to get feedback and help especially from people who might have written compilers before. The code lives at: https://github.com/ContinuumIO/numba
Hi Travis, me and Mark F. has been talking today about whether some of numba and Cython development could overlap -- not right away, but in the sense that if Cython gets some features for optimization of numerical code, then make it easy for numba to reuse that functionality. This may be sort of off-topic re: the above-- but part of the goal of this post is to figure out numba's intended scope. If there isn't an overlap, that's good to know in itself. Question 1: Did you look at Clyther and/or Copperhead? Though similar, they target GPUs...but at first glance they look as though they may be parsing Python bytecode to get their ASTs... (didn't check though) Question 2: What kind of performance are you targeting -- in the short term, and in the long term? Is competing with "Fortran-level" performance a goal at all? E.g., for ufunc computations with different iteration orders such as "a + b.T" (a and b in C-order), one must do blocking to get good performance. And when dealing with strided arrays, copying small chunks at the time will sometimes help performance (and sometimes not). This is optimization strategies which (as I understand it) is quite beyond what NumPy iterators etc. can provide. And the LLVM level could be too low -- one has quite a lot of information when generating the ufunc/reduction/etc. that would be thrown away when generating LLVM code. Vectorizing compilers do their best to reconstruct this information; I know nothing about what actually exists here for LLVM. They are certainly a lot more complicated to implement and work with than making use of on higher-level information available before code generation. The idea we've been playing with is for Cython to define a limited subset of its syntax tree (essentially the "GIL-less" subset) seperate from the rest of Cython, with a more well-defined API for optimization passes etc., and targeted for a numerical optimization pipeline. This subset would actually be pretty close to what numba needs to compile, even if the overlap isn't perfect. So such a pipeline could possibly be shared between Cython and numba, even if Cython would use it at compile-time and numba at runtime, and even if the code generation backend is different (the code generation backend is probably not the hard part...). To be concrete, the idea is: (Cython|numba) -> high-level numerical compiler and loop-structure/blocking optimizer (by us on a shared parse tree representation) -> (LLVM/C/OpenCL) -> low-level optimization (by the respective compilers) Some algorithms that could be shareable are iteration strategies (already in NumPy though), blocking strategies, etc. Even if this may be beyond numba's (and perhaps Cython's) current ambition, it may be worth thinking about, if nothing else then just for how Cython's code should be structured. (Mark F., how does the above match how you feel about this?) Dag
On Mar 13, 2012, at 12:58 AM, Dag Sverre Seljebotn wrote:
On 03/10/2012 10:35 PM, Travis Oliphant wrote:
Hey all,
I gave a lightning talk this morning on numba which is the start of a Python compiler to machine code through the LLVM tool-chain. It is proof of concept stage only at this point (use it only if you are interested in helping develop the code at this point). The only thing that works is a fast-vectorize capability on a few functions (without for-loops). But, it shows how creating functions in Python that can be used by the NumPy runtime in various ways. Several NEPS that will be discussed in the coming months will use this concept.
Right now there is very little design documentation, but I will be adding some in the days ahead, especially if I get people who are interested in collaborating on the project. I did talk to Fijal and Alex of the PyPy project at PyCon and they both graciously suggested that I look at some of the PyPy code which walks the byte-code and does translation to their intermediate representation for inspiration.
Again, the code is not ready for use, it is only proof of concept, but I would like to get feedback and help especially from people who might have written compilers before. The code lives at: https://github.com/ContinuumIO/numba
Hi Travis,
me and Mark F. has been talking today about whether some of numba and Cython development could overlap -- not right away, but in the sense that if Cython gets some features for optimization of numerical code, then make it easy for numba to reuse that functionality.
That would be very, very interesting.
This may be sort of off-topic re: the above-- but part of the goal of this post is to figure out numba's intended scope. If there isn't an overlap, that's good to know in itself.
Question 1: Did you look at Clyther and/or Copperhead? Though similar, they target GPUs...but at first glance they look as though they may be parsing Python bytecode to get their ASTs... (didn't check though)
I have looked at both projects although Clyther more in depth. Clyther is parsing bytecode to get the AST (through a sub-project by the same author called Meta: http://srossross.github.com/Meta/html/index.html).
Question 2: What kind of performance are you targeting -- in the short term, and in the long term? Is competing with "Fortran-level" performance a goal at all?
In the short-term, I'm targeting C-equivalent performance (like weave). In the long-term, I'm targeting optimized high-level expressions (i.e. Fortran-level) with GPU and mulit-core.
E.g., for ufunc computations with different iteration orders such as "a + b.T" (a and b in C-order), one must do blocking to get good performance. And when dealing with strided arrays, copying small chunks at the time will sometimes help performance (and sometimes not).
This is optimization strategies which (as I understand it) is quite beyond what NumPy iterators etc. can provide.
And the LLVM level could be too low -- one has quite a lot of information when generating the ufunc/reduction/etc. that would be thrown away when generating LLVM code.
It doesn't need to be thrown away at all. It could be used to generate appropriate code for the arrays being used. The long-term idea is to actually be aware of NumPy arrays and encourage expression of high-level constructs which generate optimized code using chunking, blocking, AVX instructions, multiple threads, etc. To do this, it may make more sense to actually emit OpenMP (unless LLVM grows standard threading intrinsics). This is not out of the question.
Vectorizing compilers do their best to reconstruct this information; I know nothing about what actually exists here for LLVM. They are certainly a lot more complicated to implement and work with than making use of on higher-level information available before code generation.
The idea we've been playing with is for Cython to define a limited subset of its syntax tree (essentially the "GIL-less" subset) seperate from the rest of Cython, with a more well-defined API for optimization passes etc., and targeted for a numerical optimization pipeline.
This subset would actually be pretty close to what numba needs to compile, even if the overlap isn't perfect. So such a pipeline could possibly be shared between Cython and numba, even if Cython would use it at compile-time and numba at runtime, and even if the code generation backend is different (the code generation backend is probably not the hard part...). To be concrete, the idea is:
(Cython|numba) -> high-level numerical compiler and loop-structure/blocking optimizer (by us on a shared parse tree representation) -> (LLVM/C/OpenCL) -> low-level optimization (by the respective compilers)
Some algorithms that could be shareable are iteration strategies (already in NumPy though), blocking strategies, etc.
Even if this may be beyond numba's (and perhaps Cython's) current ambition, it may be worth thinking about, if nothing else then just for how Cython's code should be structured.
This kind of collaboration would be very nice. I agree, there might be some kind of intermediate representation that would be good for both projects. -Travis
(Mark F., how does the above match how you feel about this?)
Dag _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
On 13 March 2012 09:19, Travis Oliphant <travis@continuum.io> wrote:
On Mar 13, 2012, at 12:58 AM, Dag Sverre Seljebotn wrote:
On 03/10/2012 10:35 PM, Travis Oliphant wrote:
Hey all,
I gave a lightning talk this morning on numba which is the start of a
Python compiler to machine code through the LLVM tool-chain. It is proof
of concept stage only at this point (use it only if you are interested
in helping develop the code at this point). The only thing that works is
a fast-vectorize capability on a few functions (without for-loops). But,
it shows how creating functions in Python that can be used by the NumPy
runtime in various ways. Several NEPS that will be discussed in the
coming months will use this concept.
Right now there is very little design documentation, but I will be
adding some in the days ahead, especially if I get people who are
interested in collaborating on the project. I did talk to Fijal and Alex
of the PyPy project at PyCon and they both graciously suggested that I
look at some of the PyPy code which walks the byte-code and does
translation to their intermediate representation for inspiration.
Again, the code is not ready for use, it is only proof of concept, but I
would like to get feedback and help especially from people who might
have written compilers before. The code lives at:
https://github.com/ContinuumIO/numba
Hi Travis,
me and Mark F. has been talking today about whether some of numba and Cython development could overlap -- not right away, but in the sense that if Cython gets some features for optimization of numerical code, then make it easy for numba to reuse that functionality.
That would be very, very interesting.
This may be sort of off-topic re: the above-- but part of the goal of this post is to figure out numba's intended scope. If there isn't an overlap, that's good to know in itself.
Question 1: Did you look at Clyther and/or Copperhead? Though similar, they target GPUs...but at first glance they look as though they may be parsing Python bytecode to get their ASTs... (didn't check though)
I have looked at both projects although Clyther more in depth. Clyther is parsing bytecode to get the AST (through a sub-project by the same author called Meta: http://srossross.github.com/Meta/html/index.html).
Question 2: What kind of performance are you targeting -- in the short term, and in the long term? Is competing with "Fortran-level" performance a goal at all?
In the short-term, I'm targeting C-equivalent performance (like weave). In the long-term, I'm targeting optimized high-level expressions (i.e. Fortran-level) with GPU and mulit-core.
E.g., for ufunc computations with different iteration orders such as "a + b.T" (a and b in C-order), one must do blocking to get good performance. And when dealing with strided arrays, copying small chunks at the time will sometimes help performance (and sometimes not).
This is optimization strategies which (as I understand it) is quite beyond what NumPy iterators etc. can provide.
As for blocking, this could be done by the numpy iterators themselves, by simply introducing more dimensions with appropriate shape and strides (not saying that's a solution :).
And the LLVM level could be too low -- one has quite a lot of information when generating the ufunc/reduction/etc. that would be thrown away when generating LLVM code.
It doesn't need to be thrown away at all. It could be used to generate appropriate code for the arrays being used. The long-term idea is to actually be aware of NumPy arrays and encourage expression of high-level constructs which generate optimized code using chunking, blocking, AVX instructions, multiple threads, etc.
To do this, it may make more sense to actually emit OpenMP (unless LLVM grows standard threading intrinsics). This is not out of the question.
That would be interesting, my experience with OpenMP is that the standard doesn't define (ironically enough) the use of OpenMP in the context of threading, and indeed, trying to use OpenMP outside of the main thread simply segfaults your program. If llvm would get such features, one must be prepared to make the OpenMP runtime thread-safe as well (hopefully it will be in the first place, like I believe Intel's implementation).
Vectorizing compilers do their best to reconstruct this information; I know nothing about what actually exists here for LLVM. They are certainly a lot more complicated to implement and work with than making use of on higher-level information available before code generation.
The idea we've been playing with is for Cython to define a limited subset of its syntax tree (essentially the "GIL-less" subset) seperate from the rest of Cython, with a more well-defined API for optimization passes etc., and targeted for a numerical optimization pipeline.
This subset would actually be pretty close to what numba needs to compile, even if the overlap isn't perfect. So such a pipeline could possibly be shared between Cython and numba, even if Cython would use it at compile-time and numba at runtime, and even if the code generation backend is different (the code generation backend is probably not the hard part...). To be concrete, the idea is:
(Cython|numba) -> high-level numerical compiler and loop-structure/blocking optimizer (by us on a shared parse tree representation) -> (LLVM/C/OpenCL) -> low-level optimization (by the respective compilers)
Some algorithms that could be shareable are iteration strategies (already in NumPy though), blocking strategies, etc.
Even if this may be beyond numba's (and perhaps Cython's) current ambition, it may be worth thinking about, if nothing else then just for how Cython's code should be structured.
This kind of collaboration would be very nice. I agree, there might be some kind of intermediate representation that would be good for both projects.
-Travis
(Mark F., how does the above match how you feel about this?)
I would like collaboration, but from a technical perspective I think this would be much more involved than just dumping the AST to an IR and generating some code from there. For vector expressions I think sharing code would be more feasible than arbitrary (parallel) loops, etc. Cython as a compiler can make many decisions that a Python (bytecode) compiler can't make (at least without annotations and a well-defined subset of the language (not so much the syntax as the semantics)). I think in numba, if parallelism is to be supported, you will want a prange-like construct, as proving independence between iterations can be very hard to near impossible for a compiler. As for code generation, I'm not sure how llvm would do things like slicing arrays, reshaping, resizing etc (for vector expressions you can first evaluate all slicing and indexing operations and then compile the remaining vector expression), but for loops and array reassignment within loops this would have to invoke the actual slicing code from the llvm code (I presume). There are many other things, like bounds checking, wraparound, etc, that are all supported in both numpy and Cython, but going through an llvm layer would as far as I can see, require re-implementing those, at least if you want top-notch performance. Personally, I think for non-trivial performance-critical code (for loops with indexing, slicing, function calls, etc) Cython is a better target. So for vector expressions I think Cython and Numba could work together by specifying AST transformations that operate on vector expressions. For the purposes of Cython it would go from the Cython AST to the IR and after transformations either back to the Cython AST, or directly to llvm. For Cython, going from that code to llvm is not necessarily more useful than C and OpenCL, as you will know the types anyway at compile time and you can immediately exploit multicore as well as SIMD parallelism. In the face of blocking and chunking etc, certain specializations may be created in advance for Cython, or it could even generate a C version (+ openmp + auto-vectorization appeasing pragmas), an OpenCL version for the CPU and possibly a different one for the GPU, and a numba + numba IR version, i.e. feed the IR at runtime to numba and have it compile to llvm. If the compiler additionally fuses vector expressions together, this will be even more powerful. Finally, as for non-vector-expression code, I really believe Cython is a better target. cython.inline can have high overhead (at least the first time it has to compile), but with better (numpy-aware) type inference or profile guided optimizations (see recent threads on the cython-dev mailing list), in addition to things like prange, I personally believe Cython targets most of the use cases where numba would be able to generate performing code.
Dag _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
(Mark F., how does the above match how you feel about this?)
I would like collaboration, but from a technical perspective I think this would be much more involved than just dumping the AST to an IR and generating some code from there. For vector expressions I think sharing code would be more feasible than arbitrary (parallel) loops, etc. Cython as a compiler can make many decisions that a Python (bytecode) compiler can't make (at least without annotations and a well-defined subset of the language (not so much the syntax as the semantics)). I think in numba, if parallelism is to be supported, you will want a prange-like construct, as proving independence between iterations can be very hard to near impossible for a compiler.
I completely agree that you have to define some kind of syntax to get parallelism. But, a prange construct would not be out of the question, of course.
As for code generation, I'm not sure how llvm would do things like slicing arrays, reshaping, resizing etc (for vector expressions you can first evaluate all slicing and indexing operations and then compile the remaining vector expression), but for loops and array reassignment within loops this would have to invoke the actual slicing code from the llvm code (I presume).
There could be some analysis on the byte-code, prior to emitting the llvm code in order to handle lots of things. Basically, you have to "play" the byte-code on a simple machine anyway in order to emit the correct code. The big thing about Cython is you have to typedef too many things that are really quite knowable from the code. If Cython could improve it's type inference, then it would be a more suitable target.
There are many other things, like bounds checking, wraparound, etc, that are all supported in both numpy and Cython, but going through an llvm layer would as far as I can see, require re-implementing those, at least if you want top-notch performance. Personally, I think for non-trivial performance-critical code (for loops with indexing, slicing, function calls, etc) Cython is a better target.
With libclang it is really quite possible to imagine a cython -> C target that itself compiles to llvm so that you can do everything at that intermediate layer. However, LLVM is a much better layer for optimization than C now that there are a lot of people collaborating on that layer. I think it would be great if Cython targeted LLVM actually instead of C.
Finally, as for non-vector-expression code, I really believe Cython is a better target. cython.inline can have high overhead (at least the first time it has to compile), but with better (numpy-aware) type inference or profile guided optimizations (see recent threads on the cython-dev mailing list), in addition to things like prange, I personally believe Cython targets most of the use cases where numba would be able to generate performing code.
Cython and Numba certainly overlap. However, Cython requires: 1) learning another language 2) creating an extension module --- loading bit-code files and dynamically executing (even on a different machine from the one that initially created them) can be a powerful alternative for run-time compilation and distribution of code. These aren't show-stoppers obviously. But, I think some users would prefer an even simpler approach to getting fast-code than Cython (which currently doesn't do enought type-inference and requires building a dlopen extension module). -Travis
Dag _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
On Tue, Mar 13, 2012 at 5:18 PM, Travis Oliphant <travis@continuum.io> wrote:
Cython and Numba certainly overlap. However, Cython requires:
1) learning another language
So is the goal for numba to actually handle arbitrary Python code with correct semantics, i.e., it's actually a compiled implementation of Python-the-language? (I feel like most of where Cython-the-language differs from Python-the-language is that Cython adds extensions for stuff where getting speed out of Python-the-language would just be too hard. Dynamic type inference for numpy arrays is definitely a good start, but you can't even, say, promote a Python integer to a C integer without changing semantics...)
2) creating an extension module --- loading bit-code files and dynamically executing (even on a different machine from the one that initially created them) can be a powerful alternative for run-time compilation and distribution of code.
Totally agreed on this point, the workflow for Cython could definitely be smoother. -- Nathaniel
As far as I understand, the goal is not to handle arbitrary Python code, because this would become too difficult as is not necessary when you have a simple math oriented function which you want to speed up. The idea is to create something similar to http://www.enthought.com/~ischnell/paper.html which only handles some restricted Python code, but using LLVM (instead of C) as the target language. - Ilan On Tue, Mar 13, 2012 at 1:39 PM, Nathaniel Smith <njs@pobox.com> wrote:
On Tue, Mar 13, 2012 at 5:18 PM, Travis Oliphant <travis@continuum.io> wrote:
Cython and Numba certainly overlap. However, Cython requires:
1) learning another language
So is the goal for numba to actually handle arbitrary Python code with correct semantics, i.e., it's actually a compiled implementation of Python-the-language? (I feel like most of where Cython-the-language differs from Python-the-language is that Cython adds extensions for stuff where getting speed out of Python-the-language would just be too hard. Dynamic type inference for numpy arrays is definitely a good start, but you can't even, say, promote a Python integer to a C integer without changing semantics...)
2) creating an extension module --- loading bit-code files and dynamically executing (even on a different machine from the one that initially created them) can be a powerful alternative for run-time compilation and distribution of code.
Totally agreed on this point, the workflow for Cython could definitely be smoother.
-- Nathaniel _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
On 13 March 2012 18:18, Travis Oliphant <travis@continuum.io> wrote:
(Mark F., how does the above match how you feel about this?)
I would like collaboration, but from a technical perspective I think this would be much more involved than just dumping the AST to an IR and generating some code from there. For vector expressions I think sharing code would be more feasible than arbitrary (parallel) loops, etc. Cython as a compiler can make many decisions that a Python (bytecode) compiler can't make (at least without annotations and a well-defined subset of the language (not so much the syntax as the semantics)). I think in numba, if parallelism is to be supported, you will want a prange-like construct, as proving independence between iterations can be very hard to near impossible for a compiler.
I completely agree that you have to define some kind of syntax to get parallelism. But, a prange construct would not be out of the question, of course.
As for code generation, I'm not sure how llvm would do things like slicing arrays, reshaping, resizing etc (for vector expressions you can first evaluate all slicing and indexing operations and then compile the remaining vector expression), but for loops and array reassignment within loops this would have to invoke the actual slicing code from the llvm code (I presume).
There could be some analysis on the byte-code, prior to emitting the llvm code in order to handle lots of things. Basically, you have to "play" the byte-code on a simple machine anyway in order to emit the correct code. The big thing about Cython is you have to typedef too many things that are really quite knowable from the code. If Cython could improve it's type inference, then it would be a more suitable target.
There are many other things, like bounds checking, wraparound, etc, that are all supported in both numpy and Cython, but going through an llvm layer would as far as I can see, require re-implementing those, at least if you want top-notch performance. Personally, I think for non-trivial performance-critical code (for loops with indexing, slicing, function calls, etc) Cython is a better target.
With libclang it is really quite possible to imagine a cython -> C target that itself compiles to llvm so that you can do everything at that intermediate layer. However, LLVM is a much better layer for optimization than C now that there are a lot of people collaborating on that layer. I think it would be great if Cython targeted LLVM actually instead of C.
Finally, as for non-vector-expression code, I really believe Cython is a better target. cython.inline can have high overhead (at least the first time it has to compile), but with better (numpy-aware) type inference or profile guided optimizations (see recent threads on the cython-dev mailing list), in addition to things like prange, I personally believe Cython targets most of the use cases where numba would be able to generate performing code.
Cython and Numba certainly overlap. However, Cython requires:
1) learning another language 2) creating an extension module --- loading bit-code files and dynamically executing (even on a different machine from the one that initially created them) can be a powerful alternative for run-time compilation and distribution of code.
These aren't show-stoppers obviously. But, I think some users would prefer an even simpler approach to getting fast-code than Cython (which currently doesn't do enought type-inference and requires building a dlopen extension module).
Dag and I have been discussing this at PyCon, and here is my take on it (at this moment :). Definitely, if you can avoid Cython then that is easier and more desirable in many ways. So perhaps we can create a third project called X (I'm not very creative, maybe ArrayExprOpt), that takes an abstract syntax tree in a rather simple form, performs code optimizations such as rewriting loops with array accesses to vector expressions, fusing vector expressions and loops, etc, and spits out a transformed AST containing these optimizations. If runtime information is given such as actual shape and stride information the transformations could figure out there and then whether to do things like collapsing, axes swapping, blocking (as in, introducing more axes or loops to retain discontiguous blocks in the cache), blocked memory copies to contiguous chunks, etc. The AST could then also say whether the final expressions are vectorizable. Part of this functionality is already in numpy's nditer, except that this would be implicit and do more (and hopefully with minimal overhead). So numba, Cython and maybe numexpr could use the functionality, simply by building the AST from Python and converting back (if necessary) to its own AST. As such, the AST optimizer would be only part of any (runtime) compiler's pipeline, and it should be very flexible to retain any information (metadata regarding actual types, control flow information, etc) provided by the original AST. It would not do control flow analysis, type inference or promotion, etc, but only deal with abstract types like integers, reals and arrays (C, Fortran or partly contiguous or strided). It would not deal with objects, but would allow to insert nodes like UnreorderableNode and SideEffectNode wrapping parts of the original AST. In short, it should be as easy as possible to convert from an original AST to this project's AST and back again afterwards. As the project matures many optimizations may be added that deal with all sorts of loop restructuring and ways to efficiently utilize the cache as well as enable vectorization and possibly parallelism. Perhaps it could even generate a different AST depending on whether execution target the CPU or the GPU (with optionally available information such as cache sizes, GPU shared/local memory sizes, etc). Seeing that this would be a part of my master dissertation, my supervisor would require me to write the code, so at least until August I think I would have to write (at least the bulk of) this. Otherwise I can also make other parts of my dissertation's project more prominent to make up for it. Anyway, my question is, is there interest from at least the numba and numexpr projects (if code can be transformed into vector operations, it makes sense to use numexpr for that, I'm not sure what numba's interest is in that).
-Travis
Dag _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
This sounds a lot like Theano, did you look into it? -=- Olivier Le 20 mars 2012 13:49, mark florisson <markflorisson88@gmail.com> a écrit :
(Mark F., how does the above match how you feel about this?)
I would like collaboration, but from a technical perspective I think this would be much more involved than just dumping the AST to an IR and generating some code from there. For vector expressions I think sharing code would be more feasible than arbitrary (parallel) loops, etc. Cython as a compiler can make many decisions that a Python (bytecode) compiler can't make (at least without annotations and a well-defined subset of the language (not so much the syntax as the semantics)). I think in numba, if parallelism is to be supported, you will want a prange-like construct, as proving independence between iterations can be very hard to near impossible for a compiler.
I completely agree that you have to define some kind of syntax to get
On 13 March 2012 18:18, Travis Oliphant <travis@continuum.io> wrote: parallelism. But, a prange construct would not be out of the question, of course.
As for code generation, I'm not sure how llvm would do things like slicing arrays, reshaping, resizing etc (for vector expressions you can first evaluate all slicing and indexing operations and then compile the remaining vector expression), but for loops and array reassignment within loops this would have to invoke the actual slicing code from the llvm code (I presume).
There could be some analysis on the byte-code, prior to emitting the
llvm code in order to handle lots of things. Basically, you have to "play" the byte-code on a simple machine anyway in order to emit the correct code. The big thing about Cython is you have to typedef too many things that are really quite knowable from the code. If Cython could improve it's type inference, then it would be a more suitable target.
There are many other things, like bounds checking, wraparound, etc, that are all supported in both numpy and Cython, but going through an llvm layer would as far as I can see, require re-implementing those, at least if you want top-notch performance. Personally, I think for non-trivial performance-critical code (for loops with indexing, slicing, function calls, etc) Cython is a better target.
With libclang it is really quite possible to imagine a cython -> C
target that itself compiles to llvm so that you can do everything at that intermediate layer. However, LLVM is a much better layer for optimization than C now that there are a lot of people collaborating on that layer. I think it would be great if Cython targeted LLVM actually instead of C.
Finally, as for non-vector-expression code, I really believe Cython is a better target. cython.inline can have high overhead (at least the first time it has to compile), but with better (numpy-aware) type inference or profile guided optimizations (see recent threads on the cython-dev mailing list), in addition to things like prange, I personally believe Cython targets most of the use cases where numba would be able to generate performing code.
Cython and Numba certainly overlap. However, Cython requires:
1) learning another language 2) creating an extension module --- loading bit-code files and
dynamically executing (even on a different machine from the one that initially created them) can be a powerful alternative for run-time compilation and distribution of code.
These aren't show-stoppers obviously. But, I think some users would
prefer an even simpler approach to getting fast-code than Cython (which currently doesn't do enought type-inference and requires building a dlopen extension module).
Dag and I have been discussing this at PyCon, and here is my take on it (at this moment :).
Definitely, if you can avoid Cython then that is easier and more desirable in many ways. So perhaps we can create a third project called X (I'm not very creative, maybe ArrayExprOpt), that takes an abstract syntax tree in a rather simple form, performs code optimizations such as rewriting loops with array accesses to vector expressions, fusing vector expressions and loops, etc, and spits out a transformed AST containing these optimizations. If runtime information is given such as actual shape and stride information the transformations could figure out there and then whether to do things like collapsing, axes swapping, blocking (as in, introducing more axes or loops to retain discontiguous blocks in the cache), blocked memory copies to contiguous chunks, etc. The AST could then also say whether the final expressions are vectorizable. Part of this functionality is already in numpy's nditer, except that this would be implicit and do more (and hopefully with minimal overhead).
So numba, Cython and maybe numexpr could use the functionality, simply by building the AST from Python and converting back (if necessary) to its own AST. As such, the AST optimizer would be only part of any (runtime) compiler's pipeline, and it should be very flexible to retain any information (metadata regarding actual types, control flow information, etc) provided by the original AST. It would not do control flow analysis, type inference or promotion, etc, but only deal with abstract types like integers, reals and arrays (C, Fortran or partly contiguous or strided). It would not deal with objects, but would allow to insert nodes like UnreorderableNode and SideEffectNode wrapping parts of the original AST. In short, it should be as easy as possible to convert from an original AST to this project's AST and back again afterwards.
As the project matures many optimizations may be added that deal with all sorts of loop restructuring and ways to efficiently utilize the cache as well as enable vectorization and possibly parallelism. Perhaps it could even generate a different AST depending on whether execution target the CPU or the GPU (with optionally available information such as cache sizes, GPU shared/local memory sizes, etc).
Seeing that this would be a part of my master dissertation, my supervisor would require me to write the code, so at least until August I think I would have to write (at least the bulk of) this. Otherwise I can also make other parts of my dissertation's project more prominent to make up for it. Anyway, my question is, is there interest from at least the numba and numexpr projects (if code can be transformed into vector operations, it makes sense to use numexpr for that, I'm not sure what numba's interest is in that).
-Travis
Dag _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
We talked some about Theano. There are some differences in project goals which means that it makes sense to make this a seperate project: Cython wants to use this to generate C code up front from the Cython AST at compilation time; numba also has a different frontend (parsing of python bytecode) and a different backend (LLVM). However, it may very well be possible that Theano could be refactored so that the more essential algorithms working on the syntax tree could be pulled out and shared with cython and numba. Then the question is whether the core of Theano is smart enough to compete with Fortran compilers and support arbitraily strided inputs optimally. Otherwise one might as well start from scratch. I'll leave that for Mark to figure out... Dag -- Sent from my Android phone with K-9 Mail. Please excuse my brevity. Olivier Delalleau <shish@keba.be> wrote: This sounds a lot like Theano, did you look into it? -=- Olivier Le 20 mars 2012 13:49, mark florisson <markflorisson88@gmail.com> a écrit : On 13 March 2012 18:18, Travis Oliphant <travis@continuum.io> wrote:
(Mark F., how does the above match how you feel about this?)
I would like collaboration, but from a technical perspective I think this would be much more involved than just dumping the AST to an IR and generating some code from there. For vector expressions I think sharing code would be more feasible than arbitrary (parallel) loops, etc. Cython as a compiler can make many decisions that a Python (bytecode) compiler can't make (at least without annotations and a well-defined subset of the language (not so much the syntax as the semantics)). I think in numba, if parallelism is to be supported, you will want a prange-like construct, as proving independence between iterations can be very hard to near impossible for a compiler.
I completely agree that you have to define some kind of syntax to get parallelism. But, a prange construct would not be out of the question, of course.
As for code generation, I'm not sure how llvm would do things like slicing arrays, reshaping, resizing etc (for vector expressions you can first evaluate all slicing and indexing operations and then compile the remaining vector expression), but for loops and array reassignment within loops this would have to invoke the actual slicing code from the llvm code (I presume).
There could be some analysis on the byte-code, prior to emitting the llvm code in order to handle lots of things. Basically, you have to "play" the byte-code on a simple machine anyway in order to emit the correct code. The big thing about Cython is you have to typedef too many things that are really quite knowable from the code. If Cython could improve it's type inference, then it would be a more suitable target.
There are many other things, like bounds checking, wraparound, etc, that are all supported in both numpy and Cython, but going through an llvm layer would as far as I can see, require re-implementing those, at least if you want top-notch performance. Personally, I think for non-trivial performance-critical code (for loops with indexing, slicing, function calls, etc) Cython is a better target.
With libclang it is really quite possible to imagine a cython -> C target that itself compiles to llvm so that you can do everything at that intermediate layer. However, LLVM is a much better layer for optimization than C now that there are a lot of people collaborating on that layer. I think it would be great if Cython targeted LLVM actually instead of C.
Finally, as for non-vector-expression code, I really believe Cython is a better target. cython.inline can have high overhead (at least the first time it has to compile), but with better (numpy-aware) type inference or profile guided optimizations (see recent threads on the cython-dev mailing list), in addition to things like prange, I personally believe Cython targets most of the use cases where numba would be able to generate performing code.
Cython and Numba certainly overlap. However, Cython requires:
1) learning another language 2) creating an extension module --- loading bit-code files and dynamically executing (even on a different machine from the one that initially created them) can be a powerful alternative for run-time compilation and distribution of code.
These aren't show-stoppers obviously. But, I think some users would prefer an even simpler approach to getting fast-code than Cython (which currently doesn't do enought type-inference and requires building a dlopen extension module).
Dag and I have been discussing this at PyCon, and here is my take on it (at this moment :). Definitely, if you can avoid Cython then that is easier and more desirable in many ways. So perhaps we can create a third project called X (I'm not very creative, maybe ArrayExprOpt), that takes an abstract syntax tree in a rather simple form, performs code optimizations such as rewriting loops with array accesses to vector expressions, fusing vector expressions and loops, etc, and spits out a transformed AST containing these optimizations. If runtime information is given such as actual shape and stride information the transformations could figure out there and then whether to do things like collapsing, axes swapping, blocking (as in, introducing more axes or loops to retain discontiguous blocks in the cache), blocked memory copies to contiguous chunks, etc. The AST could then also say whether the final expressions are vectorizable. Part of this functionality is already in numpy's nditer, except that this would be implicit and do more (and hopefully with minimal overhead). So numba, Cython and maybe numexpr could use the functionality, simply by building the AST from Python and converting back (if necessary) to its own AST. As such, the AST optimizer would be only part of any (runtime) compiler's pipeline, and it should be very flexible to retain any information (metadata regarding actual types, control flow information, etc) provided by the original AST. It would not do control flow analysis, type inference or promotion, etc, but only deal with abstract types like integers, reals and arrays (C, Fortran or partly contiguous or strided). It would not deal with objects, but would allow to insert nodes like UnreorderableNode and SideEffectNode wrapping parts of the original AST. In short, it should be as easy as possible to convert from an original AST to this project's AST and back again afterwards. As the project matures many optimizations may be added that deal with all sorts of loop restructuring and ways to efficiently utilize the cache as well as enable vectorization and possibly parallelism. Perhaps it could even generate a different AST depending on whether execution target the CPU or the GPU (with optionally available information such as cache sizes, GPU shared/local memory sizes, etc). Seeing that this would be a part of my master dissertation, my supervisor would require me to write the code, so at least until August I think I would have to write (at least the bulk of) this. Otherwise I can also make other parts of my dissertation's project more prominent to make up for it. Anyway, my question is, is there interest from at least the numba and numexpr projects (if code can be transformed into vector operations, it makes sense to use numexpr for that, I'm not sure what numba's interest is in that).
-Travis
Dag _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
I doubt Theano is already as smart as you'd want it to be right now, however the core mechanisms are there to perform graph optimizations and move computations to GPU. It may save time to start from there instead of starting all over from scratch. I'm not sure though, but it looks like it would be worth considering it at least. -=- Olivier Le 20 mars 2012 15:40, Dag Sverre Seljebotn <d.s.seljebotn@astro.uio.no> a écrit :
** We talked some about Theano. There are some differences in project goals which means that it makes sense to make this a seperate project: Cython wants to use this to generate C code up front from the Cython AST at compilation time; numba also has a different frontend (parsing of python bytecode) and a different backend (LLVM).
However, it may very well be possible that Theano could be refactored so that the more essential algorithms working on the syntax tree could be pulled out and shared with cython and numba. Then the question is whether the core of Theano is smart enough to compete with Fortran compilers and support arbitraily strided inputs optimally. Otherwise one might as well start from scratch. I'll leave that for Mark to figure out...
Dag -- Sent from my Android phone with K-9 Mail. Please excuse my brevity.
Olivier Delalleau <shish@keba.be> wrote:
This sounds a lot like Theano, did you look into it?
-=- Olivier
Le 20 mars 2012 13:49, mark florisson <markflorisson88@gmail.com> a écrit :
(Mark F., how does the above match how you feel about this?)
I would like collaboration, but from a technical perspective I think this would be much more involved than just dumping the AST to an IR and generating some code from there. For vector expressions I think sharing code would be more feasible than arbitrary (parallel) loops, etc. Cython as a compiler can make many decisions that a Python (bytecode) compiler can't make (at least without annotations and a well-defined subset of the language (not so much the syntax as the semantics)). I think in numba, if parallelism is to be supported, you will want a prange-like construct, as proving independence between iterations can be very hard to near impossible for a compiler.
I completely agree that you have to define some kind of syntax to get
On 13 March 2012 18:18, Travis Oliphant <travis@continuum.io> wrote: parallelism. But, a prange construct would not be out of the question, of course.
As for code generation, I'm not sure how llvm would do things like slicing arrays, reshaping, resizing etc (for vector expressions you can first evaluate all slicing and indexing operations and then compile the remaining vector expression), but for loops and array reassignment within loops this would have to invoke the actual slicing code from the llvm code (I presume).
There could be some analysis on the byte-code, prior to emitting the
llvm code in order to handle lots of things. Basically, you have to "play" the byte-code on a simple machine anyway in order to emit the correct code. The big thing about Cython is you have to typedef too many things that are really quite knowable from the code. If Cython could improve it's type inference, then it would be a more suitable target.
There are many other things, like bounds checking, wraparound, etc, that are all supported in both numpy and Cython, but going through an llvm layer would as far as I can see, require re-implementing those, at least if you want top-notch performance. Personally, I think for non-trivial performance-critical code (for loops with indexing, slicing, function calls, etc) Cython is a better target.
With libclang it is really quite possible to imagine a cython -> C
target that itself compiles to llvm so that you can do everything at that intermediate layer. However, LLVM is a much better layer for optimization than C now that there are a lot of people collaborating on that layer. I think it would be great if Cython targeted LLVM actually instead of C.
Finally, as for non-vector-expression code, I really believe Cython is a better target. cython.inline can have high overhead (at least the first time it has to compile), but with better (numpy-aware) type inference or profile guided optimizations (see recent threads on the cython-dev mailing list), in addition to things like prange, I personally believe Cython targets most of the use cases where numba would be able to generate performing code.
Cython and Numba certainly overlap. However, Cython requires:
1) learning another language 2) creating an extension module --- loading bit-code files and
dynamically executing (even on a different machine from the one that initially created them) can be a powerful alternative for run-time compilation and distribution of code.
These aren't show-stoppers obviously. But, I think some users would
prefer an even simpler approach to getting fast-code than Cython (which currently doesn't do enought type-inference and requires building a dlopen extension module).
Dag and I have been discussing this at PyCon, and here is my take on it (at this moment :).
Definitely, if you can avoid Cython then that is easier and more desirable in many ways. So perhaps we can create a third project called X (I'm not very creative, maybe ArrayExprOpt), that takes an abstract syntax tree in a rather simple form, performs code optimizations such as rewriting loops with array accesses to vector expressions, fusing vector expressions and loops, etc, and spits out a transformed AST containing these optimizations. If runtime information is given such as actual shape and stride information the transformations could figure out there and then whether to do things like collapsing, axes swapping, blocking (as in, introducing more axes or loops to retain discontiguous blocks in the cache), blocked memory copies to contiguous chunks, etc. The AST could then also say whether the final expressions are vectorizable. Part of this functionality is already in numpy's nditer, except that this would be implicit and do more (and hopefully with minimal overhead).
So numba, Cython and maybe numexpr could use the functionality, simply by building the AST from Python and converting back (if necessary) to its own AST. As such, the AST optimizer would be only part of any (runtime) compiler's pipeline, and it should be very flexible to retain any information (metadata regarding actual types, control flow information, etc) provided by the original AST. It would not do control flow analysis, type inference or promotion, etc, but only deal with abstract types like integers, reals and arrays (C, Fortran or partly contiguous or strided). It would not deal with objects, but would allow to insert nodes like UnreorderableNode and SideEffectNode wrapping parts of the original AST. In short, it should be as easy as possible to convert from an original AST to this project's AST and back again afterwards.
As the project matures many optimizations may be added that deal with all sorts of loop restructuring and ways to efficiently utilize the cache as well as enable vectorization and possibly parallelism. Perhaps it could even generate a different AST depending on whether execution target the CPU or the GPU (with optionally available information such as cache sizes, GPU shared/local memory sizes, etc).
Seeing that this would be a part of my master dissertation, my supervisor would require me to write the code, so at least until August I think I would have to write (at least the bulk of) this. Otherwise I can also make other parts of my dissertation's project more prominent to make up for it. Anyway, my question is, is there interest from at least the numba and numexpr projects (if code can be transformed into vector operations, it makes sense to use numexpr for that, I'm not sure what numba's interest is in that).
-Travis
Dag _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
On 20 March 2012 20:49, Olivier Delalleau <shish@keba.be> wrote:
I doubt Theano is already as smart as you'd want it to be right now, however the core mechanisms are there to perform graph optimizations and move computations to GPU. It may save time to start from there instead of starting all over from scratch. I'm not sure though, but it looks like it would be worth considering it at least.
Thanks for the suggestion Olivier, as Dag said we discusses it, and indeed we (or I) should look a lot deeper into it and see what components are reusable there and discuss with the Theano community if and how we can collaborate.
-=- Olivier
Le 20 mars 2012 15:40, Dag Sverre Seljebotn <d.s.seljebotn@astro.uio.no> a écrit :
We talked some about Theano. There are some differences in project goals which means that it makes sense to make this a seperate project: Cython wants to use this to generate C code up front from the Cython AST at compilation time; numba also has a different frontend (parsing of python bytecode) and a different backend (LLVM).
However, it may very well be possible that Theano could be refactored so that the more essential algorithms working on the syntax tree could be pulled out and shared with cython and numba. Then the question is whether the core of Theano is smart enough to compete with Fortran compilers and support arbitraily strided inputs optimally. Otherwise one might as well start from scratch. I'll leave that for Mark to figure out...
Dag -- Sent from my Android phone with K-9 Mail. Please excuse my brevity.
Olivier Delalleau <shish@keba.be> wrote:
This sounds a lot like Theano, did you look into it?
-=- Olivier
Le 20 mars 2012 13:49, mark florisson <markflorisson88@gmail.com> a écrit :
On 13 March 2012 18:18, Travis Oliphant <travis@continuum.io> wrote:
> > (Mark F., how does the above match how you feel about this?)
I would like collaboration, but from a technical perspective I think this would be much more involved than just dumping the AST to an IR and generating some code from there. For vector expressions I think sharing code would be more feasible than arbitrary (parallel) loops, etc. Cython as a compiler can make many decisions that a Python (bytecode) compiler can't make (at least without annotations and a well-defined subset of the language (not so much the syntax as the semantics)). I think in numba, if parallelism is to be supported, you will want a prange-like construct, as proving independence between iterations can be very hard to near impossible for a compiler.
I completely agree that you have to define some kind of syntax to get parallelism. But, a prange construct would not be out of the question, of course.
As for code generation, I'm not sure how llvm would do things like slicing arrays, reshaping, resizing etc (for vector expressions you can first evaluate all slicing and indexing operations and then compile the remaining vector expression), but for loops and array reassignment within loops this would have to invoke the actual slicing code from the llvm code (I presume).
There could be some analysis on the byte-code, prior to emitting the llvm code in order to handle lots of things. Basically, you have to "play" the byte-code on a simple machine anyway in order to emit the correct code. The big thing about Cython is you have to typedef too many things that are really quite knowable from the code. If Cython could improve it's type inference, then it would be a more suitable target.
There are many other things, like bounds checking, wraparound, etc, that are all supported in both numpy and Cython, but going through an llvm layer would as far as I can see, require re-implementing those, at least if you want top-notch performance. Personally, I think for non-trivial performance-critical code (for loops with indexing, slicing, function calls, etc) Cython is a better target.
With libclang it is really quite possible to imagine a cython -> C target that itself compiles to llvm so that you can do everything at that intermediate layer. However, LLVM is a much better layer for optimization than C now that there are a lot of people collaborating on that layer. I think it would be great if Cython targeted LLVM actually instead of C.
Finally, as for non-vector-expression code, I really believe Cython is a better target. cython.inline can have high overhead (at least the first time it has to compile), but with better (numpy-aware) type inference or profile guided optimizations (see recent threads on the cython-dev mailing list), in addition to things like prange, I personally believe Cython targets most of the use cases where numba would be able to generate performing code.
Cython and Numba certainly overlap. However, Cython requires:
1) learning another language 2) creating an extension module --- loading bit-code files and dynamically executing (even on a different machine from the one that initially created them) can be a powerful alternative for run-time compilation and distribution of code.
These aren't show-stoppers obviously. But, I think some users would prefer an even simpler approach to getting fast-code than Cython (which currently doesn't do enought type-inference and requires building a dlopen extension module).
Dag and I have been discussing this at PyCon, and here is my take on it (at this moment :).
Definitely, if you can avoid Cython then that is easier and more desirable in many ways. So perhaps we can create a third project called X (I'm not very creative, maybe ArrayExprOpt), that takes an abstract syntax tree in a rather simple form, performs code optimizations such as rewriting loops with array accesses to vector expressions, fusing vector expressions and loops, etc, and spits out a transformed AST containing these optimizations. If runtime information is given such as actual shape and stride information the transformations could figure out there and then whether to do things like collapsing, axes swapping, blocking (as in, introducing more axes or loops to retain discontiguous blocks in the cache), blocked memory copies to contiguous chunks, etc. The AST could then also say whether the final expressions are vectorizable. Part of this functionality is already in numpy's nditer, except that this would be implicit and do more (and hopefully with minimal overhead).
So numba, Cython and maybe numexpr could use the functionality, simply by building the AST from Python and converting back (if necessary) to its own AST. As such, the AST optimizer would be only part of any (runtime) compiler's pipeline, and it should be very flexible to retain any information (metadata regarding actual types, control flow information, etc) provided by the original AST. It would not do control flow analysis, type inference or promotion, etc, but only deal with abstract types like integers, reals and arrays (C, Fortran or partly contiguous or strided). It would not deal with objects, but would allow to insert nodes like UnreorderableNode and SideEffectNode wrapping parts of the original AST. In short, it should be as easy as possible to convert from an original AST to this project's AST and back again afterwards.
As the project matures many optimizations may be added that deal with all sorts of loop restructuring and ways to efficiently utilize the cache as well as enable vectorization and possibly parallelism. Perhaps it could even generate a different AST depending on whether execution target the CPU or the GPU (with optionally available information such as cache sizes, GPU shared/local memory sizes, etc).
Seeing that this would be a part of my master dissertation, my supervisor would require me to write the code, so at least until August I think I would have to write (at least the bulk of) this. Otherwise I can also make other parts of my dissertation's project more prominent to make up for it. Anyway, my question is, is there interest from at least the numba and numexpr projects (if code can be transformed into vector operations, it makes sense to use numexpr for that, I'm not sure what numba's interest is in that).
-Travis
> Dag > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion@scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion > > > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion@scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion > _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
How about: * http://www.hotpy.org/ * http://pypy.org/numpydonate.html On Wed, Mar 21, 2012 at 11:14 PM, mark florisson <markflorisson88@gmail.com>wrote:
On 20 March 2012 20:49, Olivier Delalleau <shish@keba.be> wrote:
I doubt Theano is already as smart as you'd want it to be right now, however the core mechanisms are there to perform graph optimizations and move computations to GPU. It may save time to start from there instead of starting all over from scratch. I'm not sure though, but it looks like it would be worth considering it at least.
Thanks for the suggestion Olivier, as Dag said we discusses it, and indeed we (or I) should look a lot deeper into it and see what components are reusable there and discuss with the Theano community if and how we can collaborate.
-=- Olivier
Le 20 mars 2012 15:40, Dag Sverre Seljebotn <d.s.seljebotn@astro.uio.no> a écrit :
We talked some about Theano. There are some differences in project goals which means that it makes sense to make this a seperate project: Cython wants to use this to generate C code up front from the Cython AST at compilation time; numba also has a different frontend (parsing of python bytecode) and a different backend (LLVM).
However, it may very well be possible that Theano could be refactored so that the more essential algorithms working on the syntax tree could be pulled out and shared with cython and numba. Then the question is whether the core of Theano is smart enough to compete with Fortran compilers and support arbitraily strided inputs optimally. Otherwise one might as well start from scratch. I'll leave that for Mark to figure out...
Dag -- Sent from my Android phone with K-9 Mail. Please excuse my brevity.
Olivier Delalleau <shish@keba.be> wrote:
This sounds a lot like Theano, did you look into it?
-=- Olivier
Le 20 mars 2012 13:49, mark florisson <markflorisson88@gmail.com> a
:
On 13 March 2012 18:18, Travis Oliphant <travis@continuum.io> wrote:
>> >> (Mark F., how does the above match how you feel about this?) > > I would like collaboration, but from a technical perspective I
> this would be much more involved than just dumping the AST to an IR > and generating some code from there. For vector expressions I think > sharing code would be more feasible than arbitrary (parallel) loops, > etc. Cython as a compiler can make many decisions that a Python > (bytecode) compiler can't make (at least without annotations and a > well-defined subset of the language (not so much the syntax as the > semantics)). I think in numba, if parallelism is to be supported, you > will want a prange-like construct, as proving independence between > iterations can be very hard to near impossible for a compiler.
I completely agree that you have to define some kind of syntax to get parallelism. But, a prange construct would not be out of the question, of course.
> > As for code generation, I'm not sure how llvm would do things like > slicing arrays, reshaping, resizing etc (for vector expressions you > can first evaluate all slicing and indexing operations and then > compile the remaining vector expression), but for loops and array > reassignment within loops this would have to invoke the actual > slicing > code from the llvm code (I presume).
There could be some analysis on the byte-code, prior to emitting the llvm code in order to handle lots of things. Basically, you have to "play" the byte-code on a simple machine anyway in order to emit the correct code. The big thing about Cython is you have to typedef too many things
really quite knowable from the code. If Cython could improve it's type inference, then it would be a more suitable target.
> There are many other things, like > bounds checking, wraparound, etc, that are all supported in both > numpy > and Cython, but going through an llvm layer would as far as I can > see, > require re-implementing those, at least if you want top-notch > performance. Personally, I think for non-trivial
> code (for loops with indexing, slicing, function calls, etc) Cython > is > a better target.
With libclang it is really quite possible to imagine a cython -> C target that itself compiles to llvm so that you can do everything at that intermediate layer. However, LLVM is a much better layer for optimization than C now that there are a lot of people collaborating on that layer. I think it would be great if Cython targeted LLVM actually instead of C.
> > Finally, as for non-vector-expression code, I really believe Cython > is > a better target. cython.inline can have high overhead (at least the > first time it has to compile), but with better (numpy-aware) type > inference or profile guided optimizations (see recent threads on
> cython-dev mailing list), in addition to things like prange, I > personally believe Cython targets most of the use cases where numba > would be able to generate performing code.
Cython and Numba certainly overlap. However, Cython requires:
1) learning another language 2) creating an extension module --- loading bit-code files and dynamically executing (even on a different machine from the one that initially created them) can be a powerful alternative for run-time compilation and distribution of code.
These aren't show-stoppers obviously. But, I think some users would prefer an even simpler approach to getting fast-code than Cython (which currently doesn't do enought type-inference and requires building a
écrit think that are performance-critical the dlopen
extension module).
Dag and I have been discussing this at PyCon, and here is my take on it (at this moment :).
Definitely, if you can avoid Cython then that is easier and more desirable in many ways. So perhaps we can create a third project called X (I'm not very creative, maybe ArrayExprOpt), that takes an abstract syntax tree in a rather simple form, performs code optimizations such as rewriting loops with array accesses to vector expressions, fusing vector expressions and loops, etc, and spits out a transformed AST containing these optimizations. If runtime information is given such as actual shape and stride information the transformations could figure out there and then whether to do things like collapsing, axes swapping, blocking (as in, introducing more axes or loops to retain discontiguous blocks in the cache), blocked memory copies to contiguous chunks, etc. The AST could then also say whether the final expressions are vectorizable. Part of this functionality is already in numpy's nditer, except that this would be implicit and do more (and hopefully with minimal overhead).
So numba, Cython and maybe numexpr could use the functionality, simply by building the AST from Python and converting back (if necessary) to its own AST. As such, the AST optimizer would be only part of any (runtime) compiler's pipeline, and it should be very flexible to retain any information (metadata regarding actual types, control flow information, etc) provided by the original AST. It would not do control flow analysis, type inference or promotion, etc, but only deal with abstract types like integers, reals and arrays (C, Fortran or partly contiguous or strided). It would not deal with objects, but would allow to insert nodes like UnreorderableNode and SideEffectNode wrapping parts of the original AST. In short, it should be as easy as possible to convert from an original AST to this project's AST and back again afterwards.
As the project matures many optimizations may be added that deal with all sorts of loop restructuring and ways to efficiently utilize the cache as well as enable vectorization and possibly parallelism. Perhaps it could even generate a different AST depending on whether execution target the CPU or the GPU (with optionally available information such as cache sizes, GPU shared/local memory sizes, etc).
Seeing that this would be a part of my master dissertation, my supervisor would require me to write the code, so at least until August I think I would have to write (at least the bulk of) this. Otherwise I can also make other parts of my dissertation's project more prominent to make up for it. Anyway, my question is, is there interest from at least the numba and numexpr projects (if code can be transformed into vector operations, it makes sense to use numexpr for that, I'm not sure what numba's interest is in that).
-Travis
> >> Dag >> _______________________________________________ >> NumPy-Discussion mailing list >> NumPy-Discussion@scipy.org >> http://mail.scipy.org/mailman/listinfo/numpy-discussion >> >> >> >> _______________________________________________ >> NumPy-Discussion mailing list >> NumPy-Discussion@scipy.org >> http://mail.scipy.org/mailman/listinfo/numpy-discussion >> > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion@scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
On 03/25/2012 08:44 PM, Mic wrote:
How about: * http://www.hotpy.org/
The front page says a 10x speedup. That's a bit short of the almost 1000x speedup required for numerical code (that is, for some examples Python is thousands of times slower than C or Fortran). Well -- I'm sure hotpy could get beyond 10x as well -- it's just to say that they have probably not looked much on the case of numerical computation.
Well, you can start with these...: http://technicaldiscovery.blogspot.com/2011/10/thoughts-on-porting-numpy-to-... http://blog.streamitive.com/2011/10/19/more-thoughts-on-arrays-in-pypy/ Dag
Hi, Up to now, I didn't post frequently to this mailing list. So I present myself a little bit. I'm Frédéric Bastien, one of Theano senior developer. I'm not a student nor a professor, but a staff of our lab. So I more time for Theano then most people in our lab, but I have other work too.
From now on, I'm going to read every weekday email on the numpy mailing list. I think this will help communication between the Theano and numpy community. We have many common goal and this can only be useful to all of us. Also, if possible, I'll try to go to conference where discussion will occur on the big change coming to numpy.
So now, here is my comments/questions to those change. Currently Theano include some type of functionality that I didn't saw in the this thread or others related to the refactoring of Numpy. 1) Theano do numerical stability optimization. (simple example: log(1 + x) -> log1p(x)) 2) It allow to do symbolic "graph generation". We use this to do symbolic differentiation, to compute the R op and L op used in Hessian free. 3) Automatically make some operation inplace on there input: a + b -> a += b 4) Mathematical optimization like a*2+a -> a * 3 5) Implementation optimization (planed in the new numpy) Is there any plan to add 1-4 in the new project? Will it be doable(especially 2). The Theano interface that ask the user to create a Theano function was caused by the fact it is hard/time consuming to work on the AST that we looked a little bit. I didn't looked at the python AST myself, so I don't know if it was possible or just hard. Do someone know if there is all the information that we need? I don't recall discussion on having Theano work at the byte-code level. I never looked at python byte-code, but I remember having read that it change at each version of python? If that is true, how do you plan to handle this? I don't think that rewriting each optimization again at each release of python is acceptable. Event if python change the byte-code only at each X.Y release. I like the idea of having a common library to do array expression optimization (the arrayExprOpt idea). Theano optimizations cover much more type of optimization as said then just element-wise optimization. Up to now, I only remember having read about the element-wise optimization. Did I forget something? Do you plan more then this? If the goal stay at this level, it mean in the end it target to replace numexpr, but I think we should target more then this as there is many high level optimization possible that are useful. There is many projects that do optimized code generation that use SSE* instruction for tensor operation like blitz++(already named) and eigen2. Redoing all this work is time consuming. So any reuse can be useful. Don't forget that many implementation optimization are size and hardware dependent. Do you plan to go that low level? If there is such low level optimization, I suppose it will be only for the frequent case. I'm not sure if this thread is the right one for this, but we started a new project for a full tensor on the GPU. It will be the used by Theano, PyCUDA and PyOpenCL. But this ask change to the ndarray interface. The current code is a C library that use cython for its interface. We made a mailing list for it[1]. The current version is in a fork. I think it will be highly valuable that we have in the end the same gpu nd array with numpy too. It only have been announced as a goal for the new Numpy up to now. We have some work done. It will be usable shortly. But don't expect all Numpy function implemented yet :) Frédéric Bastien [1] https://github.com/inducer/compyte/wiki
On Mar 20, 2012, at 12:49 PM, mark florisson wrote:
Cython and Numba certainly overlap. However, Cython requires:
1) learning another language 2) creating an extension module --- loading bit-code files and dynamically executing (even on a different machine from the one that initially created them) can be a powerful alternative for run-time compilation and distribution of code.
These aren't show-stoppers obviously. But, I think some users would prefer an even simpler approach to getting fast-code than Cython (which currently doesn't do enought type-inference and requires building a dlopen extension module).
Dag and I have been discussing this at PyCon, and here is my take on it (at this moment :).
Definitely, if you can avoid Cython then that is easier and more desirable in many ways. So perhaps we can create a third project called X (I'm not very creative, maybe ArrayExprOpt), that takes an abstract syntax tree in a rather simple form, performs code optimizations such as rewriting loops with array accesses to vector expressions, fusing vector expressions and loops, etc, and spits out a transformed AST containing these optimizations. If runtime information is given such as actual shape and stride information the transformations could figure out there and then whether to do things like collapsing, axes swapping, blocking (as in, introducing more axes or loops to retain discontiguous blocks in the cache), blocked memory copies to contiguous chunks, etc. The AST could then also say whether the final expressions are vectorizable. Part of this functionality is already in numpy's nditer, except that this would be implicit and do more (and hopefully with minimal overhead).
So numba, Cython and maybe numexpr could use the functionality, simply by building the AST from Python and converting back (if necessary) to its own AST. As such, the AST optimizer would be only part of any (runtime) compiler's pipeline, and it should be very flexible to retain any information (metadata regarding actual types, control flow information, etc) provided by the original AST. It would not do control flow analysis, type inference or promotion, etc, but only deal with abstract types like integers, reals and arrays (C, Fortran or partly contiguous or strided). It would not deal with objects, but would allow to insert nodes like UnreorderableNode and SideEffectNode wrapping parts of the original AST. In short, it should be as easy as possible to convert from an original AST to this project's AST and back again afterwards.
I think this is a very interesting project, and certainly projects like numba can benefit of it. So, in order to us have an idea on what you are after, can we assume that your project (call it X) would be kind of an compiler optimizer, and then the produced, optimized code could be feed into numba for optimized LLVM code generation (that on its turn, can be run on top of CPUs or GPUs or a combination)? Is that correct? Giving that my interpretation above is correct, it is bit more difficult to me to see how your X project could be of benefit for numexpr. In fact, I actually see this the other way round: once the optimizer has discovered the vectorization parts, then go one step further and generate code that uses numexpr automatically (call this, vectorization through numexpr). This is what you mean, or I'm missing something?
As the project matures many optimizations may be added that deal with all sorts of loop restructuring and ways to efficiently utilize the cache as well as enable vectorization and possibly parallelism. Perhaps it could even generate a different AST depending on whether execution target the CPU or the GPU (with optionally available information such as cache sizes, GPU shared/local memory sizes, etc).
Seeing that this would be a part of my master dissertation, my supervisor would require me to write the code, so at least until August I think I would have to write (at least the bulk of) this. Otherwise I can also make other parts of my dissertation's project more prominent to make up for it. Anyway, my question is, is there interest from at least the numba and numexpr projects (if code can be transformed into vector operations, it makes sense to use numexpr for that, I'm not sure what numba's interest is in that).
I'm definitely interested for the numexpr part. It is just that I'm still struggling to see the big picture on this. But the general idea is really appealing. Thanks, -- Francesc Alted
participants (15)
-
Dag Sverre Seljebotn -
Francesc Alted -
Frédéric Bastien -
Ilan Schnell -
mark florisson -
Mic -
Nathaniel Smith -
Olivier Delalleau -
Pauli Virtanen -
Pierre Haessig -
Sturla Molden -
Till Stensitzki -
Travis Oliphant -
Travis Oliphant -
xavier.gnata@gmail.com