Contributing to pypy [especially numpy]

Hi, I am willing to contribute to PyPy, especially on Numpy port. The main reason why Numpy is my main interest is that as Ph.D student in Applied Mathematics, I really hope one day we will be able to perform numerical computation without using heavy binding in C/Fortran or intermediate solution like Cython. I didn't contribute to any opensource project yet, and I may have to learn some conventions. However, I'm a regular user of DCVS. Looking at the source code and the dev mailing list, it's seems a big refactoring is currently done on the numpy branch. Is there any small fixes/features easy enough to implement for a newbie on the topic ? Samuel

On Fri, Oct 14, 2011 at 5:59 PM, Samuel Vaiter <samuel.vaiter@gmail.com> wrote:
Hi,
I am willing to contribute to PyPy, especially on Numpy port. The main reason why Numpy is my main interest is that as Ph.D student in Applied Mathematics, I really hope one day we will be able to perform numerical computation without using heavy binding in C/Fortran or intermediate solution like Cython. I didn't contribute to any opensource project yet, and I may have to learn some conventions. However, I'm a regular user of DCVS.
Hi! It's great to hear from you!
Looking at the source code and the dev mailing list, it's seems a big refactoring is currently done on the numpy branch. Is there any small fixes/features easy enough to implement for a newbie on the topic ?
Samuel
Yes, there are definitely small things that you can work on. A good start would be to pick a missing feature from numpy and start implementing it. There is usually someone on IRC who can help if you have some immediate questions. Do you want me to find you some small thing? Cheers, fijal

Hi, Thanks for you answer.
Yes, there are definitely small things that you can work on.
A good start would be to pick a missing feature from numpy and start implementing it. There is usually someone on IRC who can help if you have some immediate questions.
Do you want me to find you some small thing?
Yeah, it might be a good thing for a start to have a "tutor" if you have the time to think about it. @Stefan : I agree. By "intermediate solution" I mean the total time : time to think about data structure + time to implement + time to execute. I use Numpy almost all the time as a tool to do "exploration". I don't mind to have the fastest execution time, I only look after the total time ;) Cython is a great tool, but 90% of my issues are index-related : Numpy and Matlab are slow when you loop over the index of your array. And I think PyPy-Numpy (will) provides a much better solution in MY case. Regards, Samuel

On Sun, Oct 16, 2011 at 6:41 PM, Samuel Vaiter <samuel.vaiter@gmail.com> wrote:
Hi,
Thanks for you answer.
Yes, there are definitely small things that you can work on.
A good start would be to pick a missing feature from numpy and start implementing it. There is usually someone on IRC who can help if you have some immediate questions.
Do you want me to find you some small thing?
Yeah, it might be a good thing for a start to have a "tutor" if you have the time to think about it.
Hi I'm on holiday now but maybe I can think about something. Alex: any idea? Cheers, fijal

On Sun, Oct 16, 2011 at 2:03 PM, Maciej Fijalkowski <fijall@gmail.com>wrote:
On Sun, Oct 16, 2011 at 6:41 PM, Samuel Vaiter <samuel.vaiter@gmail.com> wrote:
Hi,
Thanks for you answer.
Yes, there are definitely small things that you can work on.
A good start would be to pick a missing feature from numpy and start implementing it. There is usually someone on IRC who can help if you have some immediate questions.
Do you want me to find you some small thing?
Yeah, it might be a good thing for a start to have a "tutor" if you have the time to think about it.
Hi
I'm on holiday now but maybe I can think about something. Alex: any idea?
Cheers, fijal _______________________________________________ pypy-dev mailing list pypy-dev@python.org http://mail.python.org/mailman/listinfo/pypy-dev
Adding new ufuncs is a great introductory task to numpy. I'm not sure which ones we're missing, but I'm sure we are :) You could also add Python-ufunc support, that is the ability to create a ufunc from a Python function. This shouldn't be too difficult. Alex -- "I disapprove of what you say, but I will defend to the death your right to say it." -- Evelyn Beatrice Hall (summarizing Voltaire) "The people's good is the highest law." -- Cicero

Samuel Vaiter, 14.10.2011 17:59:
The main reason why Numpy is my main interest is that as Ph.D student in Applied Mathematics, I really hope one day we will be able to perform numerical computation without using heavy binding in C/Fortran or intermediate solution like Cython.
I guess you didn't mean it that way, but "intermediate solution" makes it sound like you expect any of these to go away one day. They sure won't. Manually optimised C and Fortran code will always beat JIT compilers, especially in numerics. It's a game they can't win - whenever JIT compilers get too close to hand optimised code, someone will come along and write better code. Stefan

On Sun, Oct 16, 2011 at 2:29 PM, Stefan Behnel <stefan_ml@behnel.de> wrote
Samuel Vaiter, 14.10.2011 17:59:
The main reason why Numpy is my main interest is that as Ph.D student in Applied Mathematics, I really hope one day we will be able to perform numerical computation without using heavy binding in C/Fortran or intermediate solution like Cython.
I guess you didn't mean it that way, but "intermediate solution" makes it sound like you expect any of these to go away one day. They sure won't. Manually optimised C and Fortran code will always beat JIT compilers, especially in numerics. It's a game they can't win - whenever JIT compilers get too close to hand optimised code, someone will come along and write better code.
Stefan
I guess what you say is at best [citation needed]. We have proven already that we can perform several optimizations that are very hard to perform at the C level. And indeed, while you can always argue "well, you can just write a better compiler", it's true also for JITs. And we're only at the beginning of what we can do. One example that I have in mind is array expressions that depend on runtime - we can optimize them fairly well in the JIT (which means SSE and whatnot), but you just can't get the same thing in C, because you're unable to compile native code per a set of array operations. Cheers, fijal

On Sun, Oct 16, 2011 at 11:50 AM, Maciej Fijalkowski <fijall@gmail.com>wrote:
On Sun, Oct 16, 2011 at 2:29 PM, Stefan Behnel <stefan_ml@behnel.de> wrote
Samuel Vaiter, 14.10.2011 17:59:
The main reason why Numpy is my main interest is that as Ph.D student in Applied Mathematics, I really hope one day we will be able to perform numerical computation without using heavy binding in C/Fortran or intermediate solution like Cython.
I guess you didn't mean it that way, but "intermediate solution" makes it sound like you expect any of these to go away one day. They sure won't. Manually optimised C and Fortran code will always beat JIT compilers, especially in numerics. It's a game they can't win - whenever JIT compilers get too close to hand optimised code, someone will come along and write better code.
Stefan
I guess what you say is at best [citation needed]. We have proven already that we can perform several optimizations that are very hard to perform at the C level. And indeed, while you can always argue "well, you can just write a better compiler", it's true also for JITs. And we're only at the beginning of what we can do.
One example that I have in mind is array expressions that depend on runtime - we can optimize them fairly well in the JIT (which means SSE and whatnot), but you just can't get the same thing in C, because you're unable to compile native code per a set of array operations.
Cheers, fijal _______________________________________________ pypy-dev mailing list pypy-dev@python.org http://mail.python.org/mailman/listinfo/pypy-dev
Another example, which no fortran compiler will ever be able to do, is if you create a ufunc from a Python function, you can still inline it into assembler that's emitted for an operation so: a + b * sin(my_ufunc(c)) still generates a single loop in assembler. Alex -- "I disapprove of what you say, but I will defend to the death your right to say it." -- Evelyn Beatrice Hall (summarizing Voltaire) "The people's good is the highest law." -- Cicero

Maciej Fijalkowski, 16.10.2011 17:50:
On Sun, Oct 16, 2011 at 2:29 PM, Stefan Behnel wrote
Samuel Vaiter, 14.10.2011 17:59:
The main reason why Numpy is my main interest is that as Ph.D student in Applied Mathematics, I really hope one day we will be able to perform numerical computation without using heavy binding in C/Fortran or intermediate solution like Cython.
I guess you didn't mean it that way, but "intermediate solution" makes it sound like you expect any of these to go away one day. They sure won't. Manually optimised C and Fortran code will always beat JIT compilers, especially in numerics. It's a game they can't win - whenever JIT compilers get too close to hand optimised code, someone will come along and write better code.
I guess what you say is at best [citation needed].
Feel free to quote me. :D
We have proven already that we can perform several optimizations that are very hard to perform at the C level. And indeed, while you can always argue "well, you can just write a better compiler", it's true also for JITs.
I wasn't comparing a JIT to another compiler. I was comparing it to a human programmer. A JIT, just like any other compiler, will never be able to *understand* the code it compiles, and it can only apply the optimisations that it was taught. JITs are nice when you need performance quickly and don't care about the last few CPU cycles. However, there are cases where it's not very satisfactory to learn that your JIT compiler, in the current state that it has, can only get you up to, say, 90%, or even 95% of the speed that you need for your problem. In those cases where you do care about the last 5%, and numerics people care about them surprisingly often, you will eventually end up using a low-level language, usually C or Fortran, to make sure you get as much out of your code as possible. JIT compilers are structurally much harder to manually override than static compilers, and they are certainly not designed to help with the "but I know what I'm doing" cases. Mind you, I'm not saying that JIT compilers aren't capable of generating surprisingly fast code. They certainly are, and what they deliver is often "good enough". I'm just saying that statically compiled low-level languages will *always* have a raison d'être, if only to deliver support for the "I know what I'm doing" cases. Stefan

On Sun, Oct 16, 2011 at 5:34 PM, Stefan Behnel <stefan_ml@behnel.de> wrote:
I wasn't comparing a JIT to another compiler. I was comparing it to a human programmer. A JIT, just like any other compiler, will never be able to *understand* the code it compiles, and it can only apply the optimisations that it was taught.
I don't understand your argument. There are *many* situations where the best time to make a decision on how to generate machine code is runtime, not offline compile time. There are many things in numpy that are very difficult to get fast because we can't know how to perform them without informations that are only known at runtime, e.g.: - non-trivial iteration. The neighborhood iterators we have in numpy are very slow because of the cost of function calls that can't really be bypassed unless you start to generate hundred or even more small functions for each special case (number of dimensions, is the stride value == item size, etc...) - anything that requires specialization on the type. A typical example is the sparse array code in scipy.sparse. Some code is C++ code that is templated on the contained value and index size. But because the generated code is so big, we need to restrain the available types, otherwise we would need to compile literally thousand of functions which are doing exactly the same. Even if most people don't use most of them. One area where JIT is not that useful is to replace existing fortran/c code. Not only I am doubtful about have JIT beating Intel compiler for linear algebra stuff, but more significantly, rewriting the existing codebase would take many man-years. Some of this code requires deep knowledge of the underlying computation, and there is also the issue of correctness in floating point code generation. Given that decade-old compilers get it wrong, I would expect pypy jit to have quite a few funky corner cases as well. cheers, David

Hi David, On Sun, Oct 16, 2011 at 19:13, David Cournapeau <cournape@gmail.com> wrote:
(...) and there is also the issue of correctness in floating point code generation. Given that decade-old compilers get it wrong, I would expect pypy jit to have quite a few funky corner cases as well.
No, we should not have corner cases, because we don't go there at all. We know very well that rewriting operations on floats can slightly change their results, so we don't do it. In other words the JIT produces a sequence of residual operations that has bit-wise the same effect as the original sequence of Python operations. (More precisely, it seems that we only replace FLOAT_MUL(x, 1.0) by "x" and FLOAT_MUL(x, -1.0) by "-x", as well as simplify repeated FLOAT_NEG's and assume that FLOAT_MUL's are commutative. As far as I can tell these trivial optimizations are all bit-wise correct, at least on modern FPUs.) A bientôt, Armin.

On Sun, Oct 16, 2011 at 5:21 PM, Armin Rigo <arigo@tunes.org> wrote:
Hi David,
On Sun, Oct 16, 2011 at 19:13, David Cournapeau <cournape@gmail.com> wrote:
(...) and there is also the issue of correctness in floating point code generation. Given that decade-old compilers get it wrong, I would expect pypy jit to have quite a few funky corner cases as well.
No, we should not have corner cases, because we don't go there at all. We know very well that rewriting operations on floats can slightly change their results, so we don't do it. In other words the JIT produces a sequence of residual operations that has bit-wise the same effect as the original sequence of Python operations.
(More precisely, it seems that we only replace FLOAT_MUL(x, 1.0) by "x" and FLOAT_MUL(x, -1.0) by "-x", as well as simplify repeated FLOAT_NEG's and assume that FLOAT_MUL's are commutative. As far as I can tell these trivial optimizations are all bit-wise correct, at least on modern FPUs.)
A bientôt,
Armin. _______________________________________________ pypy-dev mailing list pypy-dev@python.org http://mail.python.org/mailman/listinfo/pypy-dev
I should note that I wrote all of these operations by verifying that GCC would do them, as well as testing on obscure values. Note for example that we don't constant fold x + 0.0 (changes the sign of x at x== -0.0). ALex -- "I disapprove of what you say, but I will defend to the death your right to say it." -- Evelyn Beatrice Hall (summarizing Voltaire) "The people's good is the highest law." -- Cicero

On Sun, Oct 16, 2011 at 10:21 PM, Armin Rigo <arigo@tunes.org> wrote:
Hi David,
On Sun, Oct 16, 2011 at 19:13, David Cournapeau <cournape@gmail.com> wrote:
(...) and there is also the issue of correctness in floating point code generation. Given that decade-old compilers get it wrong, I would expect pypy jit to have quite a few funky corner cases as well.
No, we should not have corner cases, because we don't go there at all. We know very well that rewriting operations on floats can slightly change their results, so we don't do it. In other words the JIT produces a sequence of residual operations that has bit-wise the same effect as the original sequence of Python operations.
(More precisely, it seems that we only replace FLOAT_MUL(x, 1.0) by "x" and FLOAT_MUL(x, -1.0) by "-x", as well as simplify repeated FLOAT_NEG's and assume that FLOAT_MUL's are commutative. As far as I can tell these trivial optimizations are all bit-wise correct, at least on modern FPUs.)
Interesting to know. But then, wouldn't this limit the speed gains to be expected from the JIT ? And I am not sure I understand how you can "not go there" if you want to vectorize code to use SIMD instruction sets ? David

Hi, On Sun, Oct 16, 2011 at 23:41, David Cournapeau <cournape@gmail.com> wrote:
Interesting to know. But then, wouldn't this limit the speed gains to be expected from the JIT ?
Yes, to some extent. It cannot give you the last bit of performance improvements you could expect from arithmetic optimizations, but (as usual) you get already the several-times improvements of e.g. removing the boxing and unboxing of float objects. Personally I'm wary of going down that path, because it means that the results we get could suddenly change their least significant digit(s) when the JIT kicks in. At least there are multiple tests in the standard Python test suite that would fail because of that.
And I am not sure I understand how you can "not go there" if you want to vectorize code to use SIMD instruction sets ?
I'll leave fijal to answer this question in detail :-) I suppose that the goal is first to use SIMD when explicitly requested in the RPython source, in the numpy code that operate on matrices; and not do the harder job of automatically unrolling and SIMD-ing loops containing Python float operations. But even the later could be done without giving up on the idea that all Python operations should be present in a bit-exact way (e.g. by using SIMD on 64-bit floats, not on 32-bit floats). A bientôt, Armin.

On Mon, Oct 17, 2011 at 12:10 AM, Armin Rigo <arigo@tunes.org> wrote:
Hi,
On Sun, Oct 16, 2011 at 23:41, David Cournapeau <cournape@gmail.com> wrote:
Interesting to know. But then, wouldn't this limit the speed gains to be expected from the JIT ?
Yes, to some extent. It cannot give you the last bit of performance improvements you could expect from arithmetic optimizations, but (as usual) you get already the several-times improvements of e.g. removing the boxing and unboxing of float objects. Personally I'm wary of going down that path, because it means that the results we get could suddenly change their least significant digit(s) when the JIT kicks in. At least there are multiple tests in the standard Python test suite that would fail because of that.
The thing is that as with python there are scenarios where we can optimize a lot (like you said by doing type specialization or folding array operations or using multithreading based on runtime decisions) where we don't have to squeeze the last 2% of performance. This is the approach that worked great for optimizing Python so far (concentrate on the larger picture).
And I am not sure I understand how you can "not go there" if you want to vectorize code to use SIMD instruction sets ?
I'll leave fijal to answer this question in detail :-) I suppose that the goal is first to use SIMD when explicitly requested in the RPython source, in the numpy code that operate on matrices; and not do the harder job of automatically unrolling and SIMD-ing loops containing Python float operations. But even the later could be done without giving up on the idea that all Python operations should be present in a bit-exact way (e.g. by using SIMD on 64-bit floats, not on 32-bit floats).
For now we restrict SIMD operations to explicit array arithmetics and we don't do automatic vectorization. We'll see later what we do with it :) Cheers, fijal

Maciej Fijalkowski, 17.10.2011 09:34:
On Mon, Oct 17, 2011 at 12:10 AM, Armin Rigo wrote:
On Sun, Oct 16, 2011 at 23:41, David Cournapeau wrote:
Interesting to know. But then, wouldn't this limit the speed gains to be expected from the JIT ?
Yes, to some extent. It cannot give you the last bit of performance improvements you could expect from arithmetic optimizations, but (as usual) you get already the several-times improvements of e.g. removing the boxing and unboxing of float objects. Personally I'm wary of going down that path, because it means that the results we get could suddenly change their least significant digit(s) when the JIT kicks in. At least there are multiple tests in the standard Python test suite that would fail because of that.
The thing is that as with python there are scenarios where we can optimize a lot (like you said by doing type specialization or folding array operations or using multithreading based on runtime decisions) where we don't have to squeeze the last 2% of performance. This is the approach that worked great for optimizing Python so far (concentrate on the larger picture).
That's what I meant. It's not surprising that a JIT compiler can be faster than an interpreter, and it's not surprising that it can optimise generic code into several times faster specialised code. That's what JIT compilers are there for, and PyPy does a really good job at that. It's much harder to reach up to the performance of specialised, hand tuned code, though. And there is a lot of specialised, hand tuned code in SciPy and Sage, for example. That's a different kind of game than the "running generic Python code faster than CPython" business, however worthy that is by itself. Stefan

On Mon, Oct 17, 2011 at 10:16 AM, Stefan Behnel <stefan_ml@behnel.de> wrote:
Maciej Fijalkowski, 17.10.2011 09:34:
On Mon, Oct 17, 2011 at 12:10 AM, Armin Rigo wrote:
On Sun, Oct 16, 2011 at 23:41, David Cournapeau wrote:
Interesting to know. But then, wouldn't this limit the speed gains to be expected from the JIT ?
Yes, to some extent. It cannot give you the last bit of performance improvements you could expect from arithmetic optimizations, but (as usual) you get already the several-times improvements of e.g. removing the boxing and unboxing of float objects. Personally I'm wary of going down that path, because it means that the results we get could suddenly change their least significant digit(s) when the JIT kicks in. At least there are multiple tests in the standard Python test suite that would fail because of that.
The thing is that as with python there are scenarios where we can optimize a lot (like you said by doing type specialization or folding array operations or using multithreading based on runtime decisions) where we don't have to squeeze the last 2% of performance. This is the approach that worked great for optimizing Python so far (concentrate on the larger picture).
That's what I meant. It's not surprising that a JIT compiler can be faster than an interpreter, and it's not surprising that it can optimise generic code into several times faster specialised code. That's what JIT compilers are there for, and PyPy does a really good job at that.
It's much harder to reach up to the performance of specialised, hand tuned code, though. And there is a lot of specialised, hand tuned code in SciPy and Sage, for example. That's a different kind of game than the "running generic Python code faster than CPython" business, however worthy that is by itself.
Stefan
We're not trying to compete though. The plan is to reuse specialized hand tuned code where it's there and compete in the areas where SciPy or NumPy doesn't cater well (like a python ufunc or a type specialization or a chain of array operations). Cheers, fijal

On 10/17/2011 12:10 AM Armin Rigo wrote:
Hi,
On Sun, Oct 16, 2011 at 23:41, David Cournapeau<cournape@gmail.com> wrote:
Interesting to know. But then, wouldn't this limit the speed gains to be expected from the JIT ?
Yes, to some extent. It cannot give you the last bit of performance improvements you could expect from arithmetic optimizations, but (as usual) you get already the several-times improvements of e.g. removing the boxing and unboxing of float objects. Personally I'm wary of going down that path, because it means that the results we get could suddenly change their least significant digit(s) when the JIT kicks in. At least there are multiple tests in the standard Python test suite that would fail because of that.
And I am not sure I understand how you can "not go there" if you want to vectorize code to use SIMD instruction sets ?
I'll leave fijal to answer this question in detail :-) I suppose that the goal is first to use SIMD when explicitly requested in the RPython source, in the numpy code that operate on matrices; and not do the harder job of automatically unrolling and SIMD-ing loops containing Python float operations. But even the later could be done without giving up on the idea that all Python operations should be present in a bit-exact way (e.g. by using SIMD on 64-bit floats, not on 32-bit floats).
A bientôt,
Armin. I'm wondering how you handle high level loop optimizations vs floating point order-sensitive calculations. E.g., if a source loop has z[i]=a*b*c, might you hoist b*c without considering that assert a*b*c == a*(b*c) might fail, as in
a=b=1e-200; c=1e200 assert a*b*c == a*(b*c) Traceback (most recent call last): File "<stdin>", line 1, in <module> AssertionError a*b*c, a*(b*c) (0.0, 1e-200)
Regards, Bengt Richter

On Mon, Oct 17, 2011 at 6:12 AM, Bengt Richter <bokr@oz.net> wrote:
On 10/17/2011 12:10 AM Armin Rigo wrote:
Hi,
On Sun, Oct 16, 2011 at 23:41, David Cournapeau<cournape@gmail.com> wrote:
Interesting to know. But then, wouldn't this limit the speed gains to be expected from the JIT ?
Yes, to some extent. It cannot give you the last bit of performance improvements you could expect from arithmetic optimizations, but (as usual) you get already the several-times improvements of e.g. removing the boxing and unboxing of float objects. Personally I'm wary of going down that path, because it means that the results we get could suddenly change their least significant digit(s) when the JIT kicks in. At least there are multiple tests in the standard Python test suite that would fail because of that.
And I am not sure I understand how you can "not go there" if you want
to vectorize code to use SIMD instruction sets ?
I'll leave fijal to answer this question in detail :-) I suppose that the goal is first to use SIMD when explicitly requested in the RPython source, in the numpy code that operate on matrices; and not do the harder job of automatically unrolling and SIMD-ing loops containing Python float operations. But even the later could be done without giving up on the idea that all Python operations should be present in a bit-exact way (e.g. by using SIMD on 64-bit floats, not on 32-bit floats).
A bientôt,
Armin.
I'm wondering how you handle high level loop optimizations vs floating point order-sensitive calculations. E.g., if a source loop has z[i]=a*b*c, might you hoist b*c without considering that assert a*b*c == a*(b*c) might fail, as in
a=b=1e-200; c=1e200 assert a*b*c == a*(b*c) Traceback (most recent call last): File "<stdin>", line 1, in <module> AssertionError a*b*c, a*(b*c) (0.0, 1e-200)
Regards, Bengt Richter
______________________________**_________________ pypy-dev mailing list pypy-dev@python.org http://mail.python.org/**mailman/listinfo/pypy-dev<http://mail.python.org/mailman/listinfo/pypy-dev>
No, you would never hoist b * c because b * c isn't an operation in that loop, the only ops that exist are: t1 = a * b t2 = t1 * c z[i] = t2 even if we did do arithmetic reassosciation (which we don't, yet), you can't do them on floats. Alex -- "I disapprove of what you say, but I will defend to the death your right to say it." -- Evelyn Beatrice Hall (summarizing Voltaire) "The people's good is the highest law." -- Cicero

On 10/17/2011 01:26 PM Alex Gaynor wrote:
On Mon, Oct 17, 2011 at 6:12 AM, Bengt Richter<bokr@oz.net> wrote:
On 10/17/2011 12:10 AM Armin Rigo wrote:
Hi,
On Sun, Oct 16, 2011 at 23:41, David Cournapeau<cournape@gmail.com> wrote:
Interesting to know. But then, wouldn't this limit the speed gains to be expected from the JIT ?
Yes, to some extent. It cannot give you the last bit of performance improvements you could expect from arithmetic optimizations, but (as usual) you get already the several-times improvements of e.g. removing the boxing and unboxing of float objects. Personally I'm wary of going down that path, because it means that the results we get could suddenly change their least significant digit(s) when the JIT kicks in. At least there are multiple tests in the standard Python test suite that would fail because of that.
And I am not sure I understand how you can "not go there" if you want
to vectorize code to use SIMD instruction sets ?
I'll leave fijal to answer this question in detail :-) I suppose that the goal is first to use SIMD when explicitly requested in the RPython source, in the numpy code that operate on matrices; and not do the harder job of automatically unrolling and SIMD-ing loops containing Python float operations. But even the later could be done without giving up on the idea that all Python operations should be present in a bit-exact way (e.g. by using SIMD on 64-bit floats, not on 32-bit floats).
A bientôt,
Armin.
I'm wondering how you handle high level loop optimizations vs floating point order-sensitive calculations. E.g., if a source loop has z[i]=a*b*c, might you hoist b*c without considering that assert a*b*c == a*(b*c) might fail, as in
a=b=1e-200; c=1e200 assert a*b*c == a*(b*c) Traceback (most recent call last): File "<stdin>", line 1, in<module> AssertionError a*b*c, a*(b*c) (0.0, 1e-200)
Regards, Bengt Richter
______________________________**_________________ pypy-dev mailing list pypy-dev@python.org http://mail.python.org/**mailman/listinfo/pypy-dev<http://mail.python.org/mailman/listinfo/pypy-dev>
No, you would never hoist b * c because b * c isn't an operation in that loop, the only ops that exist are:
t1 = a * b t2 = t1 * c z[i] = t2
d'oh
even if we did do arithmetic reassosciation (which we don't, yet), you can't do them on floats.
Hm, what if you could statically prove that the fp ops gave bit-wise exactly the same results when reordered (given you have 53 significant bits to play with)? (Maybe more theoretical question than practical). Regards, Bengt Richter P.S. What did you mean with the teaser, "(which we don't, yet)" ? When would you?

On Sun, Oct 16, 2011 at 6:34 PM, Stefan Behnel <stefan_ml@behnel.de> wrote:
Maciej Fijalkowski, 16.10.2011 17:50:
On Sun, Oct 16, 2011 at 2:29 PM, Stefan Behnel wrote
Samuel Vaiter, 14.10.2011 17:59:
The main reason why Numpy is my main interest is that as Ph.D student in Applied Mathematics, I really hope one day we will be able to perform numerical computation without using heavy binding in C/Fortran or intermediate solution like Cython.
I guess you didn't mean it that way, but "intermediate solution" makes it sound like you expect any of these to go away one day. They sure won't. Manually optimised C and Fortran code will always beat JIT compilers, especially in numerics. It's a game they can't win - whenever JIT compilers get too close to hand optimised code, someone will come along and write better code.
I guess what you say is at best [citation needed].
Feel free to quote me. :D
We have proven already that we can perform several optimizations that are very hard to perform at the C level. And indeed, while you can always argue "well, you can just write a better compiler", it's true also for JITs.
I wasn't comparing a JIT to another compiler. I was comparing it to a human programmer. A JIT, just like any other compiler, will never be able to *understand* the code it compiles, and it can only apply the optimisations that it was taught. JITs are nice when you need performance quickly and don't care about the last few CPU cycles. However, there are cases where it's not very satisfactory to learn that your JIT compiler, in the current state that it has, can only get you up to, say, 90%, or even 95% of the speed that you need for your problem. In those cases where you do care about the last 5%, and numerics people care about them surprisingly often, you will eventually end up using a low-level language, usually C or Fortran, to make sure you get as much out of your code as possible. JIT compilers are structurally much harder to manually override than static compilers, and they are certainly not designed to help with the "but I know what I'm doing" cases.
I just claim you're wrong here and there are cases where you can't beat the JIT compiler, precisely because some stuff depends on runtime data and you can't encode all the possibilities in a statically compiled code (at least in theory). Granted, you might want to have an access to emitting assembler on the fly and do it better than a compiler, but sometimes you need a way to emit assembler on the fly.
Mind you, I'm not saying that JIT compilers aren't capable of generating surprisingly fast code. They certainly are, and what they deliver is often "good enough". I'm just saying that statically compiled low-level languages will *always* have a raison d'être, if only to deliver support for the "I know what I'm doing" cases.
We still have to implement JITs in something :)

Maciej Fijalkowski, 16.10.2011 20:01:
On Sun, Oct 16, 2011 at 6:34 PM, Stefan Behnel wrote:
Maciej Fijalkowski, 16.10.2011 17:50:
We have proven already that we can perform several optimizations that are very hard to perform at the C level. And indeed, while you can always argue "well, you can just write a better compiler", it's true also for JITs.
I wasn't comparing a JIT to another compiler. I was comparing it to a human programmer. A JIT, just like any other compiler, will never be able to *understand* the code it compiles, and it can only apply the optimisations that it was taught. JITs are nice when you need performance quickly and don't care about the last few CPU cycles. However, there are cases where it's not very satisfactory to learn that your JIT compiler, in the current state that it has, can only get you up to, say, 90%, or even 95% of the speed that you need for your problem. In those cases where you do care about the last 5%, and numerics people care about them surprisingly often, you will eventually end up using a low-level language, usually C or Fortran, to make sure you get as much out of your code as possible. JIT compilers are structurally much harder to manually override than static compilers, and they are certainly not designed to help with the "but I know what I'm doing" cases.
I just claim you're wrong here and there are cases where you can't beat the JIT compiler, precisely because some stuff depends on runtime data and you can't encode all the possibilities in a statically compiled code (at least in theory).
Regarding David's response, I agree that there are cases where JITs can help in limiting the code explosion that you'd get from statically generating all possible optimised cases for generic code. A JIT only needs to instantiate the cases that really exist at runtime. Obviously, that does not automatically mean that the JIT would generate code that is as fast or faster than what a programmer would write for *one* of the specific cases by tuning the code accordingly. It just means that it would generate better code *on average* when looking at the whole corpus, because it can simply (and quickly) adapt the code to more cases at need. If that's what the programmer wants depends on the use case. I see the advantage for that especially in library code that needs to deal with basically all cases efficiently, as David pointed out. Stefan
participants (7)
-
Alex Gaynor
-
Armin Rigo
-
Bengt Richter
-
David Cournapeau
-
Maciej Fijalkowski
-
Samuel Vaiter
-
Stefan Behnel