[Numpy-discussion] arange(start, stop, step) and floating point (Ticket #8)

Thu Feb 9 20:36:09 EST 2006

Well, my results are different.

SVN r2087:
> python -m timeit -s "from numpy import arange" "arange(10000.0)"
10000 loops, best of 3: 21.1 usec per loop

SVN r2088:
> python -m timeit -s "from numpy import arange" "arange(10000.0)"
10000 loops, best of 3: 25.6 usec per loop

I am using gcc version 3.3.4 with the following flags: -msse2
-mfpmath=sse -fno-strict-aliasing -DNDEBUG -g -O3.

The timing is consistent with the change in the DOUBLE_fill loop:

r2087:
   1b8f0:       f2 0f 11 08             movsd  %xmm1,(%eax)
   1b8f4:       f2 0f 58 ca             addsd  %xmm2,%xmm1
   1b8f8:       83 c0 08                add    $0x8,%eax
   1b8fb:       39 c8                   cmp    %ecx,%eax
   1b8fd:       72 f1                   jb     1b8f0 <DOUBLE_fill+0x30>

r2088:
   1b9d0:       f2 0f 2a c2             cvtsi2sd %edx,%xmm0
   1b9d4:       42                      inc    %edx
   1b9d5:       f2 0f 59 c1             mulsd  %xmm1,%xmm0
   1b9d9:       f2 0f 58 c2             addsd  %xmm2,%xmm0
   1b9dd:       f2 0f 11 00             movsd  %xmm0,(%eax)
   1b9e1:       83 c0 08                add    $0x8,%eax
   1b9e4:       39 ca                   cmp    %ecx,%edx
   1b9e6:       7c e8                   jl     1b9d0 <DOUBLE_fill+0x20>


The loop was 5 instructions before the change and 8 instructions
after.  It is possible that 387 FPU may do addition and multiplication
in parallel and this is why you don't see the difference.

Nevetheless, I would like to withdraw my prior objections.  I think
the code is now more numerically correct and that is worth the
slow-down on some platforms.

By the way, as I was playing with the code. I've also tried the
recommendation of using a[i] instead of *p:

--- numpy/core/src/arraytypes.inc.src   (revision 2088)
+++ numpy/core/src/arraytypes.inc.src   (working copy)
@@ -1652,9 +1652,8 @@
        @typ@ start = buffer[0];
        @typ@ delta = buffer[1];
        delta -= start;
-       buffer += 2;
-       for (i=2; i<length; i++, buffer++) {
-               *buffer = start + i*delta;
+       for (i=2; i!=length; ++i) {
+               buffer[i] = start + i*delta;
        }
 }

The resulting optimized code for the loop was:

   1b9d0:       f2 0f 2a c0             cvtsi2sd %eax,%xmm0
   1b9d4:       f2 0f 59 c1             mulsd  %xmm1,%xmm0
   1b9d8:       f2 0f 58 c2             addsd  %xmm2,%xmm0
   1b9dc:       f2 0f 11 04 c2          movsd  %xmm0,(%edx,%eax,8)
   1b9e1:       40                      inc    %eax
   1b9e2:       39 c8                   cmp    %ecx,%eax
   1b9e4:       75 ea                   jne    1b9d0 <DOUBLE_fill+0x20>

This is one instruction less because "add    $0x8,%eax" was eliminated
and all pointer arithmetics and store (buffer[i] = ...) is now done in
a single instruction "movsd  %xmm0,(%edx,%eax,8)".

The timing, however did not change:

> python -m timeit -s "from numpy import arange" "arange(10000.0)"
10000 loops, best of 3: 25.6 usec per loop

My change may be worth commiting because C code is shorter and
arguably more understandable (at least by Fortran addicts :-). 
Travis?



On 2/9/06, Tim Hochberg <tim.hochberg at cox.net> wrote:
> # baseline
> arange(10000.0) took 4.39404812623 seconds for 100000 reps

> # multiply instead of repeated add.
> arange(10000.0) took 4.34652784083 seconds for 100000 reps