Microbenchmark: Summing over array of doubles

Sun Aug 1 19:54:58 EDT 2004

On 31 Jul 2004, Yaroslav Bulatov wrote:

> I'm doing intensive computation on arrays in Python, so if you have
> suggestions on Python/C solutions that could push the envelope, please
> let me know.

If you're doing mostly vector calculations as opposed to summing, I've
been doing some work on adding SIMD support to numarray, with pleasing
results (around 2x speedups).  I've also done some work adding local
parallel processing support to numarray, with not-so-pleasing results
(mostly due to Python overhead).

Regarding your results:

numarray should be just as fast as the -O2 C version.  I was puzzled at
first as to where the speed discrepancy came from, but the culprit is in
the -O2 flag:  gcc -O2 noticies that sum is never used, and thus removes
the loop entirely.  As a matter of fact, there isn't even any fadd 
instruction in the assembler output:

        call    clock
        movl    %eax, %esi
        movl    $9999999, %ebx
.L11:
        decl    %ebx
        jns     .L11
        subl    $16, %esp
        call    clock

As you can see, the 21ms you're seeing is the time spent counting down
from 9,999,999 to 0.  To obtain correct results, add a line such as
'printf("%f\n",sum);' after the main loop in the C version.  This will
force gcc to leave the actual calculation in place and give you accurate
results.

The above fix will likely render numarray faster than the C version.  
Using gcc -O3 rather than gcc -O2 will get fairer results, as this is what 
numarray uses.

Is there any reason why in the Python/numarray version, you use 
Numeric's RandomArray rather than numarray.random_array?  It shouldn't 
affect your results, but it would speed up initialization time a bit.

There are a few inefficiences in the pytime module (mostly involving 
range() and *args/**kwargs), but I don't think they'll have too big of an 
impact on your results.  Instead, I'd suggest running the numarray/Numeric 
tests using Psyco to remove much of the Python overhead.

For completeness, I'd also suggest both running the Java version using a 
JIT compiler such as Kaffe, and compiling it natively using gcj (the 
latter should approach the speed of C).